In one of my earlier blog I was talking about the Matching in Quality Stage. For me this is one of the most powerful part of Quality Stage where we can find duplicates in huge data in matter of minutes. For example one of our customer wanted to find duplicates in 5 million of his records and we were able to do that on my server in minutes. Not just that we found out several cases where it appeared that the data had been intentionally modified so that duplicates could not be identified, still the tool was able to catch that. But the heart of all this depends on how well we create our match specification. And unfortunately it is an art to create a good match specification that many practitioners are not aware of. I sit with many practitioners and spend time with them to educate how to use this tool to create a good specification. So in this blog I wish to write some of my tips for creating a good match specification. This is not exhaustive but a good starting point that I got from a redbook.
Generally, raw information in the source systems is not acceptable to match on because it is either stored in a free form format or, even if it is parsed into individual columns, content does not always correspond to the metadata. It is very difficult to achieve high accuracy match results when using data in these
formats. So, as a rule, business intelligence fields and match fields that are generated by Domain Specific Standardization should be used for match specifications.
In addition, single domain fields, such as SSN or TIN (that might require some validation) should be reviewed as possible additional fields to help in determining the match. Initial investigation results could be used to determine the accuracy and field population.
Initial blocking strategy
Within a file with many records, it is very difficult to find matches when trying to compare every single pair in the file, although that seems a logical approach, except for the large files job, which takes a very long time to run because of the number of comparisons that need to be made. (n*(n-1)/2).
Blocking is a scheme that reduces the number of pairs of records that needs to be examined. In blocking, files are partitioned into mutually exclusive and
exhaustive blocks designed to increase the number of matches while decreasing the number of pairs to compare. Consequently, blocking is generally
implemented by partitioning files based on the values of one or more fields.
The following guidelines should be used when selecting fields for blocking:
- Select fields that are reliable, well populated, and have a good frequency distribution of values
- Use multiple fields in a block
- Go from most specific in the first pass – to more lenient in future passes. Your first pass should result in the most obvious and “best” matches
- Multiple Pass Strategy: Use multiple passes for variation and to “widen the net”
- Use match fields generated from standardization (NYSIIS, Soundex) rather than the business intelligence fields when applicable (First Name, Primary Name, Street Name, City Name)
Of course, if a pair of records blocking fields are not the same in any of the passes of the match specification, that pair does not have a chance to be linked.
So defining blocking fields and multiple passes is a balance between scope and accuracy to compare a reasonable amount of “like” records.
Select fields for match comparison
These are guidelines on the match field selection:
- Do not use Match Fields generated from Standardization. Use Business Intelligence Fields (such as First Name, Primary Name, Street Name, and City Name) and apply a fuzzy comparison:
– UNCERT (Street Name, City Name)
– Name UNCERT (Middle Name, First Name Maybe)
– MultUNCERT (Individual Name)
- Use Exact Comparison methods on shorter fields (as examples, House Number, Codes, Gender, and Floor Value) as well as on fields well standardized (such as Street Type, Directions, State Code, Country Code, and Floor Type)
- Use default match parameters in the initial run
- Use default match cutoff values of 0 in the initial run