Data Profiling is a form of data analysis used to inspect data and assess quality. Data profiling uses statistical techniques to discover the true structure, content, and quality of a collection of data (Olson, 2003). A profiling engine produces statistics that analysts can use to identify patterns in data content and structure. Such as:
- Counts of Nulls: Identifies nulls exist and allows for inspection of whether they are allowable or not
- Max/Min Value: Identifies outliers, like negatives
- Max/Min Length: Identifies outliers or invalids for fields with specific length requirements
- Frequency Distribution of Values for Individual Columns: Enables assessment of reason-ability (e.g., distribution of country codes for transactions, inspection of frequently or infrequently occurring values, as well as the percentage of the records populated with defaulted values)
- Data Type and Format: Identifies level of non-conformance to format requirements, as well as identification of unexpected formats (e.g., number of decimals, embedded spaces, sample values)
Profiling also includes cross-column analysis, which can identify overlapping or duplicate columns and expose embedded value dependencies. Inter-table analysis explores overlapping values sets and helps identify foreign key relationships. Most data profiling tools allow for drilling down into the analyzed data for further investigation.
Data Profiling is an effective way to understand data, it is just a first step to data quality improvement. Solving problems requires other forms of analysis, including business process analysis, analysis of data lineage, and deeper data analysis that can help isolate root causes of problems.
Results of data profiling can be used to identify opportunities to improve the quality of both Data and Metadata. Results from the profiling engine must be assessed by an analyst to determine whether data conforms to rules and other requirements. A good analyst can use profiling results to confirm known relationships and uncover hidden characteristics and patterns within and between data sets, including business rules, and validity constraints. Profiling is usually used as part of data discovery for projects (especially data integration projects; or to assess the current state of data that is targeted for improvement.