image

Data Science – Process and Iterative Phases

The Data Science process follows the scientific method of refining knowledge by making observations, formulating and testing hypotheses, observing results, and formulating general theories that explain results. Within Data Science, this process takes the form of observing data and creating and evaluating models of behavior.

  • Define Big Data Strategy and Business Needs: Define the requirements that identify desired outcomes with measurable tangible benefits.
  • Choose Data Sources: Identify gaps in the current data asset base and find data sources to fill those gaps.
  • Acquire and Ingest Data Sources: Obtain data sets and onboard them.
  • Develop Data Science Hypotheses and Methods: Explore data sources via profiling, visualization, mining, etc.; refine requirements. Define model algorithm inputs, types, or model hypotheses and methods of analysis (i.e., groupings of data found by clustering, etc.).
  • Integrate and Align Data for Analysis: Model feasibility depends in part on the quality of the source data. Leverage trusted and credible sources. Apply appropriate data integration and cleansing techniques to increase quality and usefulness of provisioned data sets.
  • Explore Data Using Models: Apply statistical analysis and machine learning algorithms against the integrated data. Validate, train, and over time, evolve the model. Training entails repeated runs of the model against actual data to verify assumptions and make adjustments, such as identifying outliers. Through this process, requirements will be refined. Initial feasibility metrics guide evolution of the model. New hypotheses may be introduced that require additional data sets and results of this exploration will shape the future modeling and outputs (even changing the requirements).
  • Deploy and Monitor: Those models that produce useful information can be deployed to production for ongoing monitoring of value and effectiveness. Often Data Science projects turn into data warehousing projects where more vigorous development processes are put in place (ETL, DQ, Master Data, etc.).

P.S. The outputs of each step become the inputs into the next.

Leave a Reply

Your email address will not be published. Required fields are marked *

five + eleven =