In the last blog, I mentioned that we need to investigate the data. Understanding the data is a necessary precursor to cleansing and consolidation. If we know the actual condition of data in existing sources,we can correct data problems before they corrupt new systems. So let’s begin with the issues in data, which necessitates us to have an investigation done prior to handling the data. An investigate stage helps us answer questions like the following:
– Does the data column has some out of range, default (such as all nines in the telephone number) or unique values?
– Does the data column has Free-form text that requires parsing to improve “addressability” to key components for optimal matching?
– Does the data column has values that do not match metadata labels and require new field definitions for matching and load format (such as DBA, C/O, Attention text, drivers license numbers, or other comment information in a name or address field)?
– Does the data column has values that overlap adjacent fields and thus require a re-alignment of field content. For example, name information or city, state, and postal code that extend into street and mailing address fields.
– Does the data column has invalid formats that can cause conversion problems such as alphanumeric in character or numeric only fields.
– Does the data column has blank or missing data or requires some special character and punctuation handling?
Now a brief of Investigate Stage in action in pictures…
3. What kind of report can we see from the investigate stage?
Finally, We can also use a data profiling product such as IBM InfoSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata.