InfoSphere Quality Stage – XIII (Creating a good match specification)

In one of my earlier blog I was talking about the Matching in Quality Stage. For me this is one of the most powerful part of Quality Stage where we can find duplicates in huge data in matter of minutes. For example one of our customer wanted to find duplicates in 5 million of his records and we were able to do that on my server in minutes. Not just that we found out several cases where it appeared that the data had been intentionally modified so that duplicates could not be identified, still the tool was able to catch that. But the heart of all this depends on how well we create our match specification. And unfortunately it is an art to create a good match specification that many practitioners are not aware of. I sit with many practitioners and spend time with them to educate how to use this tool to create a good specification. So in this blog I wish to write some of my tips for creating a good match specification. This is not exhaustive but a good starting point that I got from a redbook.

Generally, raw information in the source systems is not acceptable to match on because it is either stored in a free form format or, even if it is parsed into individual columns, content does not always correspond to the metadata. It is very difficult to achieve high accuracy match results when using data in these
formats. So, as a rule, business intelligence fields and match fields that are generated by Domain Specific Standardization should be used for match specifications.

In addition, single domain fields, such as SSN or TIN (that might require some validation) should be reviewed as possible additional fields to help in determining the match. Initial investigation results could be used to determine the accuracy and field population.

Initial blocking strategy
Within a file with many records, it is very difficult to find matches when trying to compare every single pair in the file, although that seems a logical approach, except for the large files job, which takes a very long time to run because of the number of comparisons that need to be made. (n*(n-1)/2).

Blocking is a scheme that reduces the number of pairs of records that needs to be examined. In blocking, files are partitioned into mutually exclusive and
exhaustive blocks designed to increase the number of matches while decreasing the number of pairs to compare. Consequently, blocking is generally
implemented by partitioning files based on the values of one or more fields.
The following guidelines should be used when selecting fields for blocking:

  •  Select fields that are reliable, well populated, and have a good frequency distribution of values
  •  Use multiple fields in a block
  •  Go from most specific in the first pass – to more lenient in future passes. Your first pass should result in the most obvious and “best” matches
  •  Multiple Pass Strategy: Use multiple passes for variation and to “widen the net”
  •  Use match fields generated from standardization (NYSIIS, Soundex) rather than the business intelligence fields when applicable (First Name, Primary Name, Street Name, City Name)

Of course, if a pair of records blocking fields are not the same in any of the passes of the match specification, that pair does not have a chance to be linked.
So defining blocking fields and multiple passes is a balance between scope and accuracy to compare a reasonable amount of “like” records.

Select fields for match comparison

These are guidelines on the match field selection:

  •  Do not use Match Fields generated from Standardization. Use Business Intelligence Fields (such as First Name, Primary Name, Street Name, and City Name) and apply a fuzzy comparison:
    – UNCERT (Street Name, City Name)
    – Name UNCERT (Middle Name, First Name Maybe)
    – MultUNCERT (Individual Name)
  • Use Exact Comparison methods on shorter fields (as examples, House Number, Codes, Gender, and Floor Value) as well as on fields well standardized (such as Street Type, Directions, State Code, Country Code, and Floor Type)
  •  Use default match parameters in the initial run
  •  Use default match cutoff values of 0 in the initial run

 

One thought on “InfoSphere Quality Stage – XIII (Creating a good match specification)

  1. hi namit,
    while “testing all passes” in match designer specifications i’m getting this error.
    we have already created metadata in database with standardized output columns and match frequency output columns .also we are testing for only one pass.
    Match designer match pass run process failed.
    could you please help me with this issue???

    ##I IIS-DSEE-TFCN-00006 16:31:32(001) conductor uname: -s=Windows_NT; -r=1; -v=5; -n=DAS-ZA1; -m=Pentium
    ##I IIS-DSEE-TOSH-00002 16:31:32(002) orchgeneral: loaded
    ##I IIS-DSEE-TOSH-00002 16:31:32(003) orchsort: loaded
    ##I IIS-DSEE-TOSH-00002 16:31:32(004) orchstats: loaded
    ##I IIS-DSEE-TDOD-00080 16:31:33(003) Process meta data not available in database
    ##I IIS-DSEE-TDOD-00080 16:31:34(001) Process meta data not available in database
    ##I IIS-DSEE-TDOD-00080 16:31:34(004) Process meta data not available in database
    ##I IIS-DSEE-TFSC-00001 16:31:34(006) APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt
    ##W IIS-DSEE-TFIP-00072 16:31:34(007) When checking operator: When binding output interface field “qsMatchStatValue” to field “qsMatchStatValue”: Implicit conversion from source type “uint64” to result type “int64”: Possible range limitation.
    ##W IIS-DSEE-TFIP-00095 16:31:34(008) When checking operator: The modify operator drops field “qsMatchMasterFlag” which is not present in output interface.
    ##W IIS-DSEE-TFIP-00072 16:31:34(009) When checking operator: When binding output schema variable “outRec”: When binding output interface field “qsMatchSetID” to field “qsMatchSetID”: Implicit conversion from source type “uint64” to result type “int64”: Possible range limitation.
    ##W IIS-DSEE-TFIP-00072 16:31:34(010) When checking operator: When binding output schema variable “outRec”: When binding output interface field “qsMatchDataID” to field “qsMatchDataID”: Implicit conversion from source type “uint64” to result type “int64”: Possible range limitation.
    ##W IIS-DSEE-TFIP-00072 16:31:34(011) When checking operator: When binding output schema variable “outRec”: When binding output interface field “Name” to field “DataCol0001”: Implicit conversion from source type “string” to result type “string[255]”: Possible truncation of variable length string.
    ##F IIS-DSEE-TFPM-00352 16:32:37(000) Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem.
    Status from APT_PMpoll is 0; node name is node0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s