Information Governance – Revisited

IIGIt has been more than 5 years that I wrote on Information governance. Over the period of last 5 years some areas of Information Governance became more matured and I thought of re-visiting this topic. In a simple analogy, what library do for books, Data governance does for data. It organizes data, makes it simple to access the data, gives means to check for validity/ accuracy of data and makes it understandable to all who need it.  If Information Governance in place, organizations can use data for generating insights and also they are equipped for  regulatory mandates (like GDPR).

There are six sets of capabilities that make up the Information Management & Governance component:

  1. Data Lifecycle Management is a discipline that applies not only to analytical data but also to operational, master and reference data within the enterprise.  It involves defining and implementing policies on the creation, storage, transmission, usage and eventual disposal of data, in order to ensure that it is handled in such a way as to comply with business requirements and regulatory mandates.

2. MDM: Master and Entity Data acts as the ‘single source of the truth’ for entities – customers, suppliers, employees, contracts etc.  Such data is typically stored outside the analytics environment in a Master Data Management (MDM) system, and the analytics environment then accesses the MDM system when performing tasks such as data integration.

3. Reference Data is similar in concept to Master and Entity Data, but pertains to common data elements such as location codes, currency exchange rates etc., which are used by multiple groups or lines of business within the enterprise.  Like Master and Entity Data, Reference data is typically leveraged by operational as well as analytical systems.  It is therefore typically stored outside the analytics environment and accessed when required for data integration or analysis.

4. Data Catalog is a repository that contains metadata relating to the data stored in the Analytical Data Lake Storage repositories.  The catalog maintains the location, meaning and lineage of data elements, the relationships between them and the policies and rules relating to their security and management .  The catalog is critical for enabling effective information governance, and to support self-service access to data for exploration and analysis.

5. Data Models provide a consistent representation of data elements and their relationships across the enterprise.  An effective Enterprise Data Model facilitates consistent representation of entities and relationships, simplifying management of and access to data.

6. Data Quality Rules describe the quality requirements for each data set within the Analytical Data Lake Storage component, and provides measures of data quality that can be used by potential consumers of data to determine whether a data set is suitable for a particular purpose.  For example, data sets obtained from social media sources are often sparse and therefore ‘low quality’ but that does not necessarily disqualify a data set from being used.  Provided a user of the data knows about its quality, they can use that knowledge to determine what kinds of algorithms can best be applied to that data.

 

Advertisements

Match and Manage your Data on Cloud

We left the last blog with two questions.

A few weeks back I wrote on IBM Bluemix Data Connect. If you missed it, then watch this video on how you can put data to work with IBM Bluemix Data Connect.

Now, Business Analysts will be able to leverage Entity Matching technology using Data Connect. The Match and Manage (BETA) operation on Data Connect identifies possible matches and relationships (in plethora of data sets, including master data and non-master data sets) to create a unified view of your data. It also provides a visualization of the relationships between entities in the unified data set.

For example, you have two sets of data : One containing customer profile information and the other containing a list of prospects. A Business Analyst can now use intuitive UI to do the Match and Manage operation to match these two data sets and provide insights to questions such as:

  •  Are there duplicates in the prospect list?
  • How many of the prospects are already existing customers?
  • Are there non-obvious relationships among prospects and customers that can be explored?
  • Are there other sources of information within that could provide better insights if brought together?

The two data set are matched using Cognitive capabilities which allows the MDM– matching technology to be auto-configured and tuned to intelligently match across different data sets:

dataconnect

Business Analyst can understand the de-duplicated datasets by navigating through a relationship graph of the data to understand how the entities are related across the entire dataset. Now they can discover new non-obvious relationships within the data that were previously undiscoverable. The following generated canvas enables them to interactively explore relationships between entities.

dataconnect1

In the above example it was illustrated as how clients can now easily understand the data they hold within their MDM repositories and how now they can match their MDM data with other data sources not included within the MDM system. This simplifies the Analytical MDM experience where MDM technologies are accessible to everyone without the need to wait for Data Engineers to transform the data into a format that can be matched and rely on MDM Ninja’s to configure matching algorithms.

Summary:

IBM Bluemix Data Connect provides a seamless integrated self-service experience for data preparation. With addition of entity analytics capability, business users are empowered to gain insight from data that wasn’t previously available to them. Now organizations can extract further value from their MDM data by ensuring it is used across the organization to provide accurate analytics. Entity analytics within Data Connect is now available in beta. Go ahead and experience the next evolution of MDM.

IA Thin Client -Your entry point into data lake

In one of my previous blogs, I was mentioning how a data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. Some Enterprises have invested money and created data lake, but are not sure how to begin utilizing their data. IA Thin Client gives the first grip on the data to the business user or analyst. Extending the capacity of Information Analyzer on Hadoop and giving a user friendly thin client, it helps the Enterprises to get to know their data. Here are few of it’s capabilities
1.    Customers can see the listing of all the data they have in there HDFS file system which they can preview and select a handful of interesting ones.

2.    They can group these interesting ones into some Workspaces say – Customer related, Employee related, Finance related and so on.

3.    IA Thin Client gives them a dashboard where they can see the overall picture of data in a particular Workspaces.Workspace

4. From Workspace you can drill into details of  of one of these interesting structured / semi structured data and run data analysis to find more details about the data. This detailed analysis gives insight about data in easily understandable way – What is the quality of data? What is format of data? Can the data be classified into one of the several known data classifications? User can also see detailed information for each of the columns of the data (format, any data quality problem observed, data type, min-max values, classification, frequent values, sampling of actual values and so on).DatasetDetails

5.    Using the tool  user can make some suggestion to the meta data of the data. For example after looking they feel that some data formats do not look correct, or the minimum value should have been something else, or the data quality problem identified can be ignored etc. Editing these also reflect on the overall data quality score.

6.  Tool allows to add a note to data or link one of the interesting data to the existing data governance catalog.

7.    Tool allows the customer to apply some existing data rule to the data and see how the data performs against it.

8.    Moreover this is done on a simple, intuitive, easy to use thin client so that a non-technical person can easily navigate through the data.

You can watch a 4 minute video to get a first hand experience of the tool.


Or see InfoSphere Information Analyzer thin client presentation that provides a comprehensive overview of the Information Analyzer thin client.

Who Leads the Forrester Wave in Data Quality?

This was Information Analyzer and QualityStage Vs. the World and IBM came out on top!!!

Forrester published their most recent Wave vendor evaluation report on Data Quality December 14th, 2015. IBM is positioned as a strong leader in this evaluation, receiving the highest possible strategy score.

ForresterDataQuality

Here are some highlights:

  • IBM gets customers started on enterprise data quality with a rich set of data quality content to speed up the deployment and return on data quality investment across traditional, big data, cloud, and hybrid environments.
  • The stewardship consoles allow business data quality stewards to lead data quality with strong dashboarding, reporting, and data profiling.
  • In addition, business data stewards easily collaborate with data quality developers in the creation of rules, match, and survivor feedback.
  • IBM is also porting its full enterprise data quality capabilities to the cloud and evolving its pricing and services models to be flexible to a variety of customer architectures and implementations.

For full Forrester report click here.

How BlueMix can help in a Natural Disaster

A few minutes back the news headline reads “A powerful earthquake has struck south Asia, with tremors felt in northern Pakistan, India and Afghanistan”. Natural Disasters are becoming a commonplace. Technology can help in predicting about such natural disaster and also can help in relief effort, post disaster. Based on my involvement in Uttarakhanda Disaster relief and Nepal Disaster relief, I want to share how technology can help in post disaster relief.

Why Cloud?

A solution on Cloud is inevitable because of the following reasons:

  • Location – the Cloud datacenters are physically distant from the area of the natural disaster and applications can keep running even when power and telecommunications are disrupted.
  • Autoscaling – applications designed for Cloud can automatically scale up easily to accommodate the sudden spike in the application usage on the event of disaster.
  • Support for distributed team development – you won’t be tied to inaccessible physical build and deployment servers if you hit a bug at exactly the wrong time
  • On demand pricing – Using the infrastructure only when it is required – Reduces cost of solution. No need to keep the infrastructure ready, waiting for the disaster to strike.

 

DisasterWhy Bluemix?

BlueMix offers many out of the box services that can help in this effort and one need not have to create applications afresh for these. A catalog of IBM, third party, and open source services allow the developer to stitch an application together quickly.

  • Lots of available libraries for Node.JS for implementing pop-up sites like Wikis
  • Language translation with Watson can be helpful for displaced persons whose first language is not English
  • Twilio can integrate into SMS messaging and VOIP phone networks

How ETL tool like Information Server can help?

We can use the following functionalities of InfoSphere Information Server in Disaster Management

  • Data Standardization: Lot of Data about the location or disaster victims is passed around. This comes from various sources and can be dirty or unusable. Data Standardization service can do data cleansing to remove noise and make it usable.
  • Data Matching: Victim information needs to be dynamically communicated between disaster relief team and the friends and relatives of the victim. These two different sources need to find each other and exchange information. Probabilistic Matching algorithms become inevitable to bring these two together.

These are some of my thoughts. Please share yours so that others can learn and benefit …

DataStage job run time architecture on Hadoop

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Here is more detail on how Information Server uses YARN

DataStageOnHadoopArchitecture

  1. A job is submitted to run in the Information Server engine.
  2. The ‘Conductor’ (the process responsible for coordinating the job) asks YARN to instantiate the YARN version of the Conductor: The Application Master.
  3. The YARN Client is responsible for starting and stopping Application Masters
  4. Now that the Application Master is ready, ‘Section Leaders’ (responsible for work on a datanode) are prepared
  5. Section Leaders are created and managed by YARN Node Managers.  This is the point where the BigIntegrate/BigQuality binaries will be copied to the Hadoop DataNode if they do not already exist there.
  6. Now the real work can begin – the ‘players’ (that actually run the process) are started.

All of this is automatic and behind the scenes.  The actual user interface will look and feel identical to when a job is run on Windows, AIX, or Linux.

 

Tasting the 4 flavors of the NEW IBM InfoSphere Information Server V11.5

IBM has just announced Information Server 11.5 and new products running Information Server on Hadoop called BigIntegrate and BigQuality.  The version 11.5 release includes improvements to products like DataStage, Information Governance Catalog and Information Analyzer.  The new releases will be available September 25 2015.

InfoSphere Information Server V11.5 capabilities are available in four flavors (aka packages) to help firms overcome key information challenges:

IIS4Flavors

InfoSphere Information Server for Data Integration: Flexibly transform data in any style and deliver it to any system, supporting faster time to value and reduced IT risk.
InfoSphere Information Server for Data Quality: Establish high-quality data and manage it, turning a deluge of data into trusted information.
InfoSphere Information Governance Catalog: Better understand data and foster collaboration between IT and line-of-business teams to narrow the communication gap and create a framework for information governance.
InfoSphere Information Server Enterprise Edition: Gain the capabilities of all three individual packages in one comprehensive offering; start information integration efforts in one area and then expand when needed to further optimize results.

More coming soon….