Data Profiling Small (Rather than Big) Data

Over 70% of all business in the country is small business – less than five employees, less than one million dollars gross revenue per year. Along with this dominance of small business is the dominance of relatively small rather than big data. Smaller data sets, fewer customer records, usually fewer data elements – all indicate that the world’s data problems are dominated by small rather than big data. But while data sets vary in size and scope, many share similar problems of data quality.

In recent years numerous somewhat related words have appeared to muddy the “data” water: data profiling, data quality, dirty data, data cleansing, data standardization, and so forth all refer directly or indirectly to data quality. Data profiling refers more specifically to the process of identifying through structured methods and filtering techniques, the nature and types of data errors. Thus different software and cloud services now exist with common and unique features of data profiling.

Regardless of business size, data quality fulfillment requires data profiling as both diagnostic prerequisite and data correction method. For small business, a variety of desk top programs and software tools provide both diagnosis and correction capabilities. For example, USPS CASS pre-mailing software provides both assessment of address and zip code errors (thus providing a limited data profiling function) but also corrects and standardizes addresses information.

Even excel, employing its’ de-duplication tool and pivot table applications can provide some assessment of the completion of data fields and data inconsistencies by simply reporting the types and number of entries in each recorded data element. For example, a pivot table would report correct and mislabeled CBSA designations thusly (red labeled mis-defined :

 

CBSA Designation Count
ALBANY-LEBANON, OR 30
ALBANY-LEBN, OR 7
ATLANTA-SANDY SPRINGS-MARIETTA, GA 682
ATLANTA SPGS, GA 42
ATLANTIC CITY, NJ 26
ATLANTIC CTY, NJ 2
AUSTIN-ROUND ROCK, TX 234
AUSTIN, TX 7
BEAUMONT-PORT ARTHUR, TX 88
BEAUMONT, TX 9
BOWLING GREEN, KY 28
BOWLING GREEN,KY 5
CHARLOTTE-GASTONIA-CONCORD, NC-SC 221
CHARLOTTE, NC 16
CLARKSVILLE, TN-KY 55
CLARKSVLE, TN-KY 4

While numerous off the shelf software applications have some or limited data profiling capability, only a handful have comprehensive data profiling capability. In fact, when considering the following types of data errors defined by Rahm and Hai then even fewer qualify as relatively complete data profiling programs.

 

Scope/Problem Flawed Data Reasons/Remarks
Attribute Illegal values bdate=30.13.70 values outside of domain range
Record Violated attribute dependencies age=22, bdate=12.02.70 age = (current date – birth date) should hold
Record type Uniqueness violation emp1=(name=”John Smith”, SSN=”123456”) emp2=(name=”Peter Miller”, SSN=”123456”) uniqueness for SSN (social security number) violated
Source Referential integrity violation emp=(name=”John Smith”,deptno=127) referenced department (127) not defined

In their treatise “Data Cleaning: Problems and Current Approaches”, Erhard Rahm and Hong Hai provide a good initial overview of schema and instance level data errors which they define according to single source at schema and instance levels as shown in Tables 1 and 2.
Table 1 – Examples for single-source problems at schema level (violated integrity constraints)
Table 2 – Examples for single-source problems at instance level

 

Scope/Problem Flawed Data Reasons/Remarks
Attribute Missing values phone=9999-999999 unavailable values during data
Misspellings Cryptic city=”Sacremento” usually typos, phonetic errors
Abbreviations Embedded values experience=”B”;occupation=”DB Prog.”
name=”J. Smith 12.02.70 multiple values entered in one attribute (e.g. in a free-form
Misfielded values city=”Texas” misclassification of attribute
Record Violated attribute city=”Redmond”, zip=77777 city and zip code should correspond
Record type Word transpositions name1= “J. Smith”,name2=”Miller P.” usually in a free-form field
Contridicting records emp1=(name=”John Smith”,…); same employee represented twice due to some data entry errors
emp2=(name=”J. Smith”,…)
emp1=(name=”John Smith”, bdate=12.02.70); emp2=(name=”John Smith”, bdate=12.12.70) the same real world entity is described by different values

Regardless the size of your business and whether you have small to big data, these types of data errors are largely universal. Schema level errors: illegal values, violated attribute dependencies, uniqueness violations; single source errors: missing values, misspellings, abbreviation errors, misfielded values, violated attributes, word transpositions, contradicting records are common data errors that data profiling can and should, at minimum detect.

There is no shortage of software and cloud accesses data profiling services. If your customer and prospect records are hundreds or thousands in number, quality data profiling tools abound: Datiris, Datamartist, Trillium, X88, Pitney Bowes, SAS applications, and so forth. Check some of these options out. All usually offer free trial periods and test scale data sets for testing each data profiling application. All offer variations of data profiling criteria and data profiling variations.

17 replies
  1. freemusicdownloadsb.com
    freemusicdownloadsb.com says:

    I really like your blog.. very nice colors & theme. Did you
    make this website yourself or did you hire someone to do it for you?
    Plz reply as I’m looking to design my own blog and would
    like to know where u got this from. many thanks

    • admin
      admin says:

      The web site, blogs, other consultation was providd by a business in India, Web Experts On-Line (WEO): https://www.webexpertsonline.net/clients/ service, pricing, turnaround all excelent and much less expensive than similar services in the US. Highly recommended. Tell them Thomas Halatyn at PAC-COMM recommended them to you. Regards, TH

    • admin
      admin says:

      The blog topics we present are very expansive and involve a variety of services / partners. Can you give me more detail in terms of specifically what you are interested ina

    • admin
      admin says:

      Thanks for the response. Data profiling has many sub-topics. Can you be more specific in terms of what within data profiling about which you would like more information. Thomas Halatyn, thalatyn@pac-comm.biz

Comments are closed.