In last two blogs we went through the need and process of Standerdization of input data. Suppose the input data was “Mr. Brian Rumbaugh NY 27/12/1947”, with the help of Standerdize stage we correctly established the following
First Name: Brian
Last Name: Rumbaugh
City: New York
Country: United States of America
Date Of Birth: 27/12/1947
And so on…
But now how confident are that there is just a single record of Mr. Rumbaugh in the source. It is imperative for customer to understand this as their business decisions are based on it. And what will give me that confidence? Is it the Driving License Number in the two records? Or is it the Date of Birth, or the residential address, or the SSN or is it the combination of these? This is accomplished using Matching .
Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key.
To increase its usability and completeness, data can be consolidated or linked
(matched) along any relationship, such as a common person, business, place,
product, part, or event. You can also use matching to find duplicate entities that are caused by data entry violations or account-oriented business practices.
During the data matching stage, IBM InfoSphere QualityStage takes the
- Identifies duplicate records (such as customers, suppliers, products, or parts) within one or more data sources.
- Provides householding for individuals (such as a family or group of individuals at a location) and householding for commercial entities (multiple businesses in the same location or different locations).
- Enables the creation of match groups across data sources that might or might not have a predetermined key
There are two types of match stage:
- One source match locates and groups all similar records within a single input data source. This process identifies potential duplicate records, which might then be removed. An example is the need to eliminate duplicates from a consolidation of mailing lists purchased from multiple sources.
- Reference Match identifies relationships among records in two data sources. An example of many-to-one matching is matching the ZIP codes in a customer file with the list of valid ZIP codes. More than one record in the customer file can have the same ZIP code in it.
Here is a sample One source Match job…