Match and Manage your Data on Cloud

We left the last blog with two questions.

A few weeks back I wrote on IBM Bluemix Data Connect. If you missed it, then watch this video on how you can put data to work with IBM Bluemix Data Connect.

Now, Business Analysts will be able to leverage Entity Matching technology using Data Connect. The Match and Manage (BETA) operation on Data Connect identifies possible matches and relationships (in plethora of data sets, including master data and non-master data sets) to create a unified view of your data. It also provides a visualization of the relationships between entities in the unified data set.

For example, you have two sets of data : One containing customer profile information and the other containing a list of prospects. A Business Analyst can now use intuitive UI to do the Match and Manage operation to match these two data sets and provide insights to questions such as:

  •  Are there duplicates in the prospect list?
  • How many of the prospects are already existing customers?
  • Are there non-obvious relationships among prospects and customers that can be explored?
  • Are there other sources of information within that could provide better insights if brought together?

The two data set are matched using Cognitive capabilities which allows the MDM– matching technology to be auto-configured and tuned to intelligently match across different data sets:

dataconnect

Business Analyst can understand the de-duplicated datasets by navigating through a relationship graph of the data to understand how the entities are related across the entire dataset. Now they can discover new non-obvious relationships within the data that were previously undiscoverable. The following generated canvas enables them to interactively explore relationships between entities.

dataconnect1

In the above example it was illustrated as how clients can now easily understand the data they hold within their MDM repositories and how now they can match their MDM data with other data sources not included within the MDM system. This simplifies the Analytical MDM experience where MDM technologies are accessible to everyone without the need to wait for Data Engineers to transform the data into a format that can be matched and rely on MDM Ninja’s to configure matching algorithms.

Summary:

IBM Bluemix Data Connect provides a seamless integrated self-service experience for data preparation. With addition of entity analytics capability, business users are empowered to gain insight from data that wasn’t previously available to them. Now organizations can extract further value from their MDM data by ensuring it is used across the organization to provide accurate analytics. Entity analytics within Data Connect is now available in beta. Go ahead and experience the next evolution of MDM.

IA Thin Client -Your entry point into data lake

In one of my previous blogs, I was mentioning how a data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. Some Enterprises have invested money and created data lake, but are not sure how to begin utilizing their data. IA Thin Client gives the first grip on the data to the business user or analyst. Extending the capacity of Information Analyzer on Hadoop and giving a user friendly thin client, it helps the Enterprises to get to know their data. Here are few of it’s capabilities
1.    Customers can see the listing of all the data they have in there HDFS file system which they can preview and select a handful of interesting ones.

2.    They can group these interesting ones into some Workspaces say – Customer related, Employee related, Finance related and so on.

3.    IA Thin Client gives them a dashboard where they can see the overall picture of data in a particular Workspaces.Workspace

4. From Workspace you can drill into details of  of one of these interesting structured / semi structured data and run data analysis to find more details about the data. This detailed analysis gives insight about data in easily understandable way – What is the quality of data? What is format of data? Can the data be classified into one of the several known data classifications? User can also see detailed information for each of the columns of the data (format, any data quality problem observed, data type, min-max values, classification, frequent values, sampling of actual values and so on).DatasetDetails

5.    Using the tool  user can make some suggestion to the meta data of the data. For example after looking they feel that some data formats do not look correct, or the minimum value should have been something else, or the data quality problem identified can be ignored etc. Editing these also reflect on the overall data quality score.

6.  Tool allows to add a note to data or link one of the interesting data to the existing data governance catalog.

7.    Tool allows the customer to apply some existing data rule to the data and see how the data performs against it.

8.    Moreover this is done on a simple, intuitive, easy to use thin client so that a non-technical person can easily navigate through the data.

You can watch a 4 minute video to get a first hand experience of the tool.


Or see InfoSphere Information Analyzer thin client presentation that provides a comprehensive overview of the Information Analyzer thin client.

Who Leads the Forrester Wave in Data Quality?

This was Information Analyzer and QualityStage Vs. the World and IBM came out on top!!!

Forrester published their most recent Wave vendor evaluation report on Data Quality December 14th, 2015. IBM is positioned as a strong leader in this evaluation, receiving the highest possible strategy score.

ForresterDataQuality

Here are some highlights:

  • IBM gets customers started on enterprise data quality with a rich set of data quality content to speed up the deployment and return on data quality investment across traditional, big data, cloud, and hybrid environments.
  • The stewardship consoles allow business data quality stewards to lead data quality with strong dashboarding, reporting, and data profiling.
  • In addition, business data stewards easily collaborate with data quality developers in the creation of rules, match, and survivor feedback.
  • IBM is also porting its full enterprise data quality capabilities to the cloud and evolving its pricing and services models to be flexible to a variety of customer architectures and implementations.

For full Forrester report click here.

How BlueMix can help in a Natural Disaster

A few minutes back the news headline reads “A powerful earthquake has struck south Asia, with tremors felt in northern Pakistan, India and Afghanistan”. Natural Disasters are becoming a commonplace. Technology can help in predicting about such natural disaster and also can help in relief effort, post disaster. Based on my involvement in Uttarakhanda Disaster relief and Nepal Disaster relief, I want to share how technology can help in post disaster relief.

Why Cloud?

A solution on Cloud is inevitable because of the following reasons:

  • Location – the Cloud datacenters are physically distant from the area of the natural disaster and applications can keep running even when power and telecommunications are disrupted.
  • Autoscaling – applications designed for Cloud can automatically scale up easily to accommodate the sudden spike in the application usage on the event of disaster.
  • Support for distributed team development – you won’t be tied to inaccessible physical build and deployment servers if you hit a bug at exactly the wrong time
  • On demand pricing – Using the infrastructure only when it is required – Reduces cost of solution. No need to keep the infrastructure ready, waiting for the disaster to strike.

 

DisasterWhy Bluemix?

BlueMix offers many out of the box services that can help in this effort and one need not have to create applications afresh for these. A catalog of IBM, third party, and open source services allow the developer to stitch an application together quickly.

  • Lots of available libraries for Node.JS for implementing pop-up sites like Wikis
  • Language translation with Watson can be helpful for displaced persons whose first language is not English
  • Twilio can integrate into SMS messaging and VOIP phone networks

How ETL tool like Information Server can help?

We can use the following functionalities of InfoSphere Information Server in Disaster Management

  • Data Standardization: Lot of Data about the location or disaster victims is passed around. This comes from various sources and can be dirty or unusable. Data Standardization service can do data cleansing to remove noise and make it usable.
  • Data Matching: Victim information needs to be dynamically communicated between disaster relief team and the friends and relatives of the victim. These two different sources need to find each other and exchange information. Probabilistic Matching algorithms become inevitable to bring these two together.

These are some of my thoughts. Please share yours so that others can learn and benefit …

Sampling Data Using Information Analyzer

Two years back in a post in Information Analyzer, I wrote: “Data quality specialists use InfoSphere Information Analyzer to scan samples and full volumes of data to determine their quality and structure.” In this blog I wish to explore a little bit more on Sampling of data.

Need for Data Sampling:
While cooking rice, one does not check all the grains to confirm that the rice is properly cooked or not. One takes a sample and based on the sample decides whether the rice is cooked or not. Often, same is the case with Data. In situation of very large volumes, analyzing full volume is risky. It may take very long time to process with high chances of failing due to resource constraints, competing processes, other external factors.

To handle such cases, IA provides data sampling (random, sequential and Nth). At a fraction of the cost a lot of useful information can be generated:
– Is the column unique?
– Are there nulls?
– What are the high frequency values?
– What are the high frequency patterns?
– Are there many outliers?

Example
Here is the example of the Analysis of data (pattern of Address Line 2 in input address) with full load of 2.5 million records and with a 1% sample of 20K records and we can see how analysis result for the full load and the sample data are pretty close.

Address Line Two - Format Pattern Comparison
Address Line Two – Format Pattern Comparison

Note: Sampling from the IA perspective may still incur a full table scan on the source system. All rows may still be read and sent to IA whereby IA only chooses to analyze every 100th record or randomly ignores all but 10% of the records depending on which sampling options you selected.

Challenges in Sampling

  • A challenge in sampling is that it identifies the high frequency blocks of values or patterns and will miss most of the low frequency values or patterns. But knowledge of the big, frequent value blocks can help develop strategies for analyzing the rest of the data, at a reduced cost.
  • Another challenge is Sampling assumes that the data is uniform and bad data patterns are uniformly distributed across all sources of data or time periods. If there is reason to believe that this is not the case, then the best practice is to split the data. Using Virtual Table or database views, you can analyze each segment separately (with or without sampling) compare results and enhance your data knowledge. These segments can be created by hypothesis factor of where we feel there could be error patterns (for example application, data channel, period, time of day).

Determistic Vs Probabilistic Match

As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.

DetermisticVsProbabilistic

Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.

Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.

In Summary
Deterministic decisions tables:

  • Fields are compared
  • Letter grades are assigned
  • Combined letter grades are compared to a vendor-delivered file
  • Result: Match; Fail; Suspect

Probabilistic record linkage:

  • Fields are evaluated for degree of match
  • Weight is assigned and represents the information content by value.
  • Weights are summed to derive a total score.
  • Result: Statistical probability of a match

InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.

InfoSphere Quality Stage – XIV (Standardization Rules Designer)

The Standardization Rules Designer (SRD) in Quality Stage is designed to aid in the enhancement of Standardization rule sets. So before moving to Standardization Rules Designer (SRD), I would like to take a moment to share where we use Standardization Rule Sets. These rules are used in a Quality Stage product to do things like the following:

  •  Assign data to its appropriate metadata fields. Standardize ensures that the data within a specific field is being used for the business purpose defined in the metadata. For example, credit records might have a driver’s license number in the address line 1 field and a customer’s address in the address line 2 field. To synchronize data with its appropriate metadata field, the driver’s license number can be moved to a separate field for the driver’s licenses.
  • Decompose free-form fields into single component fields. For example, the customer’s address can be decomposed into House number, Street name, PO Box, Rural Route, and other smaller component fields.
  • Identify new data fields based on the underlying data. New fields, that do not exist on input, such as Gender Flag, Individual/Business Record Indicator, or Nickname, can be populated by the application, based on table or file look-ups.
  • Break up records storing multiple entities. It might be necessary to create a separate record for each person or entity that is represented on a single input record (such as joint accounts). A separate record allows for a more complete linkage of all entities in the input files.
  • Exclude records that do not meet minimum criteria. Based on defined business rules, the application can be required to exclude or reject records that do not meet basic requirements (for example, records that do not contain a name or address).

The new Standardization Rules Designer provides an intuitive and efficient framework that we can use to create or enhance standardization rule sets. You can use the browser-based interface to add or modify classifications, lookup tables, and rules. You can also import sample data to validate that the enhancements to the rule set work with your data. The following screen capture shows a part of the Standardization Rules Designer in which you can add or modify a rule by mapping input values from an example record to output columns. This rule splits concatenated values in an input address record by mapping each part of the input value to a different output column.

A rule in the Standardization Rules Designer splits the concatenated values in an address record and maps each value to an output column
A rule in the Standardization Rules Designer splits the concatenated values in an address record and maps each value to an output column

For more information you can visit: