One of InfoSphere QualityStage’s core strengths is its ability to precisely match data, even when it appears to be different. To do so, QualityStage uses a statistical matching technique called Probabilistic Record Linkage. This method evaluates each match field, taking into account frequency distribution, discriminating value, and data reliability, and produces a score (or match weight), which precisely measures the content of the matching fields. This measurement decisively gauges the probability of a match.
But before matching can take place, a data analyst must configure the specific match conditions through the Match Designer user interface, as shown below.
A Match Specification includes the following fundamental principles:
Match passes are the method that is used to define specific match conditions. Each match specification can define any number of match passes to implement complementary or independent business rules to compensate for data errors, missing values in blocking columns, or reduce the complexity when you are dealing with large data volumes.
Blocking provides a method for limiting the number of record pairs to examine if it is infeasible to compare all record pairs for sources of reasonable size. Blocking partitions the sources into mutually exclusive and exhaustive subsets, and the matching process searches for matches only within a subset. If the subsets are designed to bring together pairs that have a higher likelihood of being matches and ignore those that are less likely matching pairs, successful matching becomes computationally feasible for large data volumes.
Match commands are the method that is used to specify matching columns, match comparison types, agreement and disagreement weights and weight overrides.
Match and clerical cutoffs are thresholds that determine how to categorize scored record pairs.
Cutoff values are based on the composite weight, which is assigned to each record pair. All record pairs with composite weight equal or above the match cutoff value are considered duplicates, record pairs below the clerical cutoff are considered non-matches and records with composite weight between the two values are considered clerical records.
Because the cutoff values have direct influence whether a record pair is considered a match, the actual business purpose should determine how aggressive or conservative those values might be defined. For example, if de-duplication is performed to create a mailing list for shopping catalogs, it might be acceptable to set more aggressive (lower) match cutoff value than you do with patient records.