Two years back in a post in Information Analyzer, I wrote: “Data quality specialists use InfoSphere Information Analyzer to scan samples and full volumes of data to determine their quality and structure.” In this blog I wish to explore a little bit more on Sampling of data.
Need for Data Sampling:
While cooking rice, one does not check all the grains to confirm that the rice is properly cooked or not. One takes a sample and based on the sample decides whether the rice is cooked or not. Often, same is the case with Data. In situation of very large volumes, analyzing full volume is risky. It may take very long time to process with high chances of failing due to resource constraints, competing processes, other external factors.
To handle such cases, IA provides data sampling (random, sequential and Nth). At a fraction of the cost a lot of useful information can be generated:
– Is the column unique?
– Are there nulls?
– What are the high frequency values?
– What are the high frequency patterns?
– Are there many outliers?
Here is the example of the Analysis of data (pattern of Address Line 2 in input address) with full load of 2.5 million records and with a 1% sample of 20K records and we can see how analysis result for the full load and the sample data are pretty close.
Note: Sampling from the IA perspective may still incur a full table scan on the source system. All rows may still be read and sent to IA whereby IA only chooses to analyze every 100th record or randomly ignores all but 10% of the records depending on which sampling options you selected.
Challenges in Sampling
- A challenge in sampling is that it identifies the high frequency blocks of values or patterns and will miss most of the low frequency values or patterns. But knowledge of the big, frequent value blocks can help develop strategies for analyzing the rest of the data, at a reduced cost.
- Another challenge is Sampling assumes that the data is uniform and bad data patterns are uniformly distributed across all sources of data or time periods. If there is reason to believe that this is not the case, then the best practice is to split the data. Using Virtual Table or database views, you can analyze each segment separately (with or without sampling) compare results and enhance your data knowledge. These segments can be created by hypothesis factor of where we feel there could be error patterns (for example application, data channel, period, time of day).