Over 70% of all business in the country is small business – less than five employees, less than one million dollars gross revenue per year. Along with this dominance of small business is the dominance of relatively small rather than big data. Smaller data sets, fewer customer records, usually fewer data elements – all indicate that the world’s data problems are dominated by small rather than big data. But while data sets vary in size and scope, many share similar problems of data quality.
In recent years numerous somewhat related words have appeared to muddy the “data” water: data profiling, data quality, dirty data, data cleansing, data standardization, and so forth all refer directly or indirectly to data quality. Data profiling refers more specifically to the process of identifying through structured methods and filtering techniques, the nature and types of data errors. Thus different software and cloud services now exist with common and unique features of data profiling.
Regardless of business size, data quality fulfillment requires data profiling as both diagnostic prerequisite and data correction method. For small business, a variety of desk top programs and software tools provide both diagnosis and correction capabilities. For example, USPS CASS pre-mailing software provides both assessment of address and zip code errors (thus providing a limited data profiling function) but also corrects and standardizes addresses information.
Even excel, employing its’ de-duplication tool and pivot table applications can provide some assessment of the completion of data fields and data inconsistencies by simply reporting the types and number of entries in each recorded data element. For example, a pivot table would report correct and mislabeled CBSA designations thusly (red labeled mis-defined :
|ATLANTA-SANDY SPRINGS-MARIETTA, GA||682|
|ATLANTA SPGS, GA||42|
|ATLANTIC CITY, NJ||26|
|ATLANTIC CTY, NJ||2|
|AUSTIN-ROUND ROCK, TX||234|
|BEAUMONT-PORT ARTHUR, TX||88|
|BOWLING GREEN, KY||28|
While numerous off the shelf software applications have some or limited data profiling capability, only a handful have comprehensive data profiling capability. In fact, when considering the following types of data errors defined by Rahm and Hai then even fewer qualify as relatively complete data profiling programs.
|Attribute||Illegal values||bdate=30.13.70||values outside of domain range|
|Record||Violated attribute dependencies||age=22, bdate=12.02.70||age = (current date – birth date) should hold|
|Record type||Uniqueness violation||emp1=(name=”John Smith”, SSN=”123456”) emp2=(name=”Peter Miller”, SSN=”123456”)||uniqueness for SSN (social security number) violated|
|Source||Referential integrity violation||emp=(name=”John Smith”,deptno=127)||referenced department (127) not defined|
In their treatise “Data Cleaning: Problems and Current Approaches”, Erhard Rahm and Hong Hai provide a good initial overview of schema and instance level data errors which they define according to single source at schema and instance levels as shown in Tables 1 and 2.
Table 1 – Examples for single-source problems at schema level (violated integrity constraints)
Table 2 – Examples for single-source problems at instance level
|Attribute||Missing values||phone=9999-999999||unavailable values during data|
|Misspellings Cryptic||city=”Sacremento”||usually typos, phonetic errors|
|Abbreviations Embedded values||experience=”B”;occupation=”DB Prog.”|
|name=”J. Smith 12.02.70||multiple values entered in one attribute (e.g. in a free-form|
|Misfielded values||city=”Texas”||misclassification of attribute|
|Record||Violated attribute||city=”Redmond”, zip=77777||city and zip code should correspond|
|Record type||Word transpositions||name1= “J. Smith”,name2=”Miller P.”||usually in a free-form field|
|Contridicting records||emp1=(name=”John Smith”,…);||same employee represented twice due to some data entry errors|
|emp1=(name=”John Smith”, bdate=12.02.70); emp2=(name=”John Smith”, bdate=12.12.70)||the same real world entity is described by different values|
Regardless the size of your business and whether you have small to big data, these types of data errors are largely universal. Schema level errors: illegal values, violated attribute dependencies, uniqueness violations; single source errors: missing values, misspellings, abbreviation errors, misfielded values, violated attributes, word transpositions, contradicting records are common data errors that data profiling can and should, at minimum detect.
There is no shortage of software and cloud accesses data profiling services. If your customer and prospect records are hundreds or thousands in number, quality data profiling tools abound: Datiris, Datamartist, Trillium, X88, Pitney Bowes, SAS applications, and so forth. Check some of these options out. All usually offer free trial periods and test scale data sets for testing each data profiling application. All offer variations of data profiling criteria and data profiling variations.