Match and Manage your Data on Cloud

We left the last blog with two questions.

A few weeks back I wrote on IBM Bluemix Data Connect. If you missed it, then watch this video on how you can put data to work with IBM Bluemix Data Connect.

Now, Business Analysts will be able to leverage Entity Matching technology using Data Connect. The Match and Manage (BETA) operation on Data Connect identifies possible matches and relationships (in plethora of data sets, including master data and non-master data sets) to create a unified view of your data. It also provides a visualization of the relationships between entities in the unified data set.

For example, you have two sets of data : One containing customer profile information and the other containing a list of prospects. A Business Analyst can now use intuitive UI to do the Match and Manage operation to match these two data sets and provide insights to questions such as:

  •  Are there duplicates in the prospect list?
  • How many of the prospects are already existing customers?
  • Are there non-obvious relationships among prospects and customers that can be explored?
  • Are there other sources of information within that could provide better insights if brought together?

The two data set are matched using Cognitive capabilities which allows the MDM– matching technology to be auto-configured and tuned to intelligently match across different data sets:

dataconnect

Business Analyst can understand the de-duplicated datasets by navigating through a relationship graph of the data to understand how the entities are related across the entire dataset. Now they can discover new non-obvious relationships within the data that were previously undiscoverable. The following generated canvas enables them to interactively explore relationships between entities.

dataconnect1

In the above example it was illustrated as how clients can now easily understand the data they hold within their MDM repositories and how now they can match their MDM data with other data sources not included within the MDM system. This simplifies the Analytical MDM experience where MDM technologies are accessible to everyone without the need to wait for Data Engineers to transform the data into a format that can be matched and rely on MDM Ninja’s to configure matching algorithms.

Summary:

IBM Bluemix Data Connect provides a seamless integrated self-service experience for data preparation. With addition of entity analytics capability, business users are empowered to gain insight from data that wasn’t previously available to them. Now organizations can extract further value from their MDM data by ensuring it is used across the organization to provide accurate analytics. Entity analytics within Data Connect is now available in beta. Go ahead and experience the next evolution of MDM.

InfoSphere Quality Stage – VII (Matching)

In last two blogs we went through the need and process of Standerdization of input data. Suppose the input data was “Mr. Brian Rumbaugh NY 27/12/1947”, with the help of Standerdize stage we correctly established the following

First Name: Brian
Last Name: Rumbaugh
City: New York
Country: United States of America
Date Of Birth: 27/12/1947
And so on…

But now how confident are that there is just a single record of Mr. Rumbaugh in the source. It is imperative for customer to understand this as their business decisions are based on it. And what will give me that confidence? Is it the Driving License Number in the two records? Or is it the Date of Birth, or the residential address, or the SSN or is it the combination of these? This is accomplished using Matching .

Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key.

To increase its usability and completeness, data can be consolidated or linked
(matched) along any relationship, such as a common person, business, place,
product, part, or event. You can also use matching to find duplicate entities that are caused by data entry violations or account-oriented business practices.

During the data matching stage, IBM InfoSphere QualityStage takes the
following actions:

  • Identifies duplicate records (such as customers, suppliers, products, or parts) within one or more data sources.
  • Provides householding for individuals (such as a family or group of individuals at a location) and householding for commercial entities (multiple businesses in the same location or different locations).
  • Enables the creation of match groups across data sources that might or might not have a predetermined key

There are two types of match stage:

  1. One source match locates and groups all similar records within a single input data source. This process identifies potential duplicate records, which might then be removed. An example is the need to eliminate duplicates from a consolidation of mailing lists purchased from multiple sources.
  2. Reference Match identifies relationships among records in two data sources. An example of many-to-one matching is matching the ZIP codes in a customer file with the list of valid ZIP codes. More than one record in the customer file can have the same ZIP code in it.

Here is a sample One source Match job…

DeDuplication Match
Individual DeDuplication Match