In this Blog, I would like to discuss about an important feature of any ETL tool – Supporting Data Cleansing, which is a part of a bigger umbrella called Data Quality. It is a debate by some whether Data Cleansing should happen during ETL phase or it should happen later when Data Marts are being created. We will not go in that debate now. We will focus on the need of Data Cleansing.
Most organizations today depend on the use of data in two general ways. Standard business processes use data for executing transactions, as well as supporting operational activities. Business analysts review data captured as a result of day-to-day operations through reports as a way of identifying new opportunities for efficiency or growth. So data is used to both run and improve the ways that organizations achieve their business objectives. If that is true, then there must be processes in place to ensure that data is of sufficient quality to meet the business needs.
The price of poor data is illustrated by these examples:
- A data error in a bank causes 300 credit-worthy customers to receive mortgage default notices. The error costs the bank time, effort, and customer goodwill.
- A marketing organization sends duplicate direct mail pieces. A six percent redundancy in each mailing costs hundreds of thousands of dollars a year.
- A managed-care agency cannot relate prescription drug usage to patients and prescribing doctors. The agency’s OLAP application fails to identify areas to improve efficiency and inventory management and new selling opportunities.
We should therefore have the ability to check, filter, and correct mistakes or corruption that can be found in the source data. A typical case in which a cleaning process is mandatory is in the address (location) processing: “street,” “st,” “st.,” and so forth; all indicate the word “street” and an automatic process must be able to recognize it. For this particular purpose, there are specialized tools that apply ad-hoc algorithms and have their own topographic database where to store/retrieve the correct name of every street, and so forth.
IBM InfoSphere QualityStage™, part of the InfoSphere Information Server suite, comprises a set of stages, a Match Designer, and related capabilities that provide a development environment for building data-cleansing tasks called jobs. It enables enterprises to create and maintain an accurate view of master data entities, such as customers, vendors, locations and products. Core capabilities in it include data investigation, standardization, address validation, probabilistic matching and survivorship. InfoSphere QualityStage may be deployed in transactional, operational, or analytic applications, and in batch and real-time environments.
A cool thing about Quality Stage is that it shares the canvas with the data integration stages.
You can watch a brief video of Quality Stage here.