InfoSphere DataStage – VII (Data Cleansing)

In this Blog, I would like to discuss about an important feature of any ETL tool – Supporting Data Cleansing, which is a part of a bigger umbrella called Data Quality. It is a debate by some whether Data Cleansing  should happen during ETL phase or it should happen later when Data Marts are being created. We will not go in that debate now. We will focus on the need of Data Cleansing.

Most organizations today depend on the use of data in two general ways. Standard business processes use data for executing transactions, as well as supporting operational activities. Business analysts review data captured as a result of day-to-day operations through reports as a way of identifying new opportunities for efficiency or growth. So data is used to both run and improve the ways that organizations achieve their business objectives. If that is true, then there must be processes in place to ensure that data is of sufficient quality to meet the business needs.

The price of poor data is illustrated by these examples:

  • A data error in a bank causes 300 credit-worthy customers to receive mortgage default notices. The error costs the bank time, effort, and customer goodwill.
  • A marketing organization sends duplicate direct mail pieces. A six percent redundancy in each mailing costs hundreds of thousands of dollars a year.
  • A managed-care agency cannot relate prescription drug usage to patients and prescribing doctors. The agency’s OLAP application fails to identify areas to improve efficiency and inventory management and new selling opportunities.

We should therefore have the ability to check, filter, and correct mistakes or corruption that can be found in the source data. A typical case in which a cleaning process is mandatory is in the address (location) processing: “street,” “st,” “st.,” and so forth; all indicate the word “street” and an automatic process must be able to recognize it. For this particular purpose, there are specialized tools that apply ad-hoc algorithms and have their own topographic database where to store/retrieve the correct name of every street, and so forth.

IBM InfoSphere QualityStage™, part of the InfoSphere Information Server suite, comprises a set of stages, a Match Designer, and related capabilities that provide a development environment for building data-cleansing tasks called jobs. It enables enterprises to create and maintain an accurate view of master data entities, such as customers, vendors, locations and products. Core capabilities in it include data investigation, standardization, address validation, probabilistic matching and survivorship. InfoSphere QualityStage may be deployed in transactional, operational, or analytic applications, and in batch and real-time environments.

A cool thing about Quality Stage is that it shares the canvas with the data integration stages.

You can watch a brief video of Quality Stage here.

5 thoughts on “InfoSphere DataStage – VII (Data Cleansing)

  1. Very true, despite all we can do to prevent data entry errors and duplicates some still get through. Data Stage is an excellent tool. DataMatch is a customizable software suite with advanced fuzzy matching capabilities and graphical data quality reporting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s