As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.
Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.
Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.
Deterministic decisions tables:
- Fields are compared
- Letter grades are assigned
- Combined letter grades are compared to a vendor-delivered file
- Result: Match; Fail; Suspect
Probabilistic record linkage:
- Fields are evaluated for degree of match
- Weight is assigned and represents the information content by value.
- Weights are summed to derive a total score.
- Result: Statistical probability of a match
InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.