Information Governance – Revisited

IIGIt has been more than 5 years that I wrote on Information governance. Over the period of last 5 years some areas of Information Governance became more matured and I thought of re-visiting this topic. In a simple analogy, what library do for books, Data governance does for data. It organizes data, makes it simple to access the data, gives means to check for validity/ accuracy of data and makes it understandable to all who need it.  If Information Governance in place, organizations can use data for generating insights and also they are equipped for  regulatory mandates (like GDPR).

There are six sets of capabilities that make up the Information Management & Governance component:

  1. Data Lifecycle Management is a discipline that applies not only to analytical data but also to operational, master and reference data within the enterprise.  It involves defining and implementing policies on the creation, storage, transmission, usage and eventual disposal of data, in order to ensure that it is handled in such a way as to comply with business requirements and regulatory mandates.

2. MDM: Master and Entity Data acts as the ‘single source of the truth’ for entities – customers, suppliers, employees, contracts etc.  Such data is typically stored outside the analytics environment in a Master Data Management (MDM) system, and the analytics environment then accesses the MDM system when performing tasks such as data integration.

3. Reference Data is similar in concept to Master and Entity Data, but pertains to common data elements such as location codes, currency exchange rates etc., which are used by multiple groups or lines of business within the enterprise.  Like Master and Entity Data, Reference data is typically leveraged by operational as well as analytical systems.  It is therefore typically stored outside the analytics environment and accessed when required for data integration or analysis.

4. Data Catalog is a repository that contains metadata relating to the data stored in the Analytical Data Lake Storage repositories.  The catalog maintains the location, meaning and lineage of data elements, the relationships between them and the policies and rules relating to their security and management .  The catalog is critical for enabling effective information governance, and to support self-service access to data for exploration and analysis.

5. Data Models provide a consistent representation of data elements and their relationships across the enterprise.  An effective Enterprise Data Model facilitates consistent representation of entities and relationships, simplifying management of and access to data.

6. Data Quality Rules describe the quality requirements for each data set within the Analytical Data Lake Storage component, and provides measures of data quality that can be used by potential consumers of data to determine whether a data set is suitable for a particular purpose.  For example, data sets obtained from social media sources are often sparse and therefore ‘low quality’ but that does not necessarily disqualify a data set from being used.  Provided a user of the data knows about its quality, they can use that knowledge to determine what kinds of algorithms can best be applied to that data.



Need for Governance in Self-Service Analytics

Screen Shot 2017-03-31 at 9.55.05 PM
Analytics Offering without Self-Service

Self-Service Analytics is a form of business intelligence in which line-of-business professionals or data scientists are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support. This empowers everyone in the organization to discover new insights and enable them for informed decision-making. Capitalizing on the data lake, or modernized data warehouse, they can do full data set analysis (no more sampling), gain insight from non-relational data, support individuals in their desire for exploratory analysis and discovery with 360o view of all their business. At this stage, the organization can truly be data-savvy and insight-driven leading to better decisions, more effective actions, and improved outcomes. Insight is being used to make risk-aware decisions, or fight fraud and counter threats, optimize operations or most often focused on attract, grow and retain customers.

Any self service analytics, regardless of persona, has to involve data governance. Here are three examples of how any serious analytics work would be impossible without support for a proper data governance practice in the analytics technology:

  1. Fine-grained authorization controls: Most industries feature data sets where data access needs to be controlled so that sensitive data is protected. As data moves from one store to another, gets transformed, and aggregated, the authorization information needs to move with that data. Without the transfer of authorization controls as data changes state or location, self-service analytics would not be permitted under the applicable regulatory policies.
  2. Data lineage information: As data moves between different data storage layers and changes state, it’s important for the lineage of the data to be captured and stored. This helps analysts understand what goes into their analytic results, but it is also often a policy requirement for many regulatory frameworks. An example of where this is important is the right to be forgotten, which is a legislative initiative we are seeing in some Western countries. With this, any trace of information about a citizen would have to be tracked down and deleted from all of an organization’s data stores. Without a comprehensive data lineage framework, adherence to a right to be forgotten policy would be impossible.
  3. Business glossary: A current and complete business glossary acts as a roadmap for analysts to understand the nature of an organization’s data. Specifically, a business glossary maps an organization’s business concepts to the data schemas. One common problem with Hadoop data lakes is a lack of business glossary information as Hadoop has no proper set of metadata and governance tooling.

A core design point of any self service analytics offering (like the IBM DataWorks) is that data governance capabilities should be baked in. This enables self-service data analysis where analysts only see data they’re entitled to see, where data movement and transformation is automatically tracked for a complete lineage story, and as users search for data, business glossary information is used.

A dip into ‘Data Reservoir’

In the previous blog, we discussed in great details the limitation of a Data Lake and how without proper governance, a data lake can become overwhelming and unsafe to use. Hence, emerged an enhanced data lake solution known as a data reservoir. So how does a Data Reservoir assists the Enterprise:

  • A data reservoir provides the right information to people so they can perform activities like the following:
    – Investigate and understand a particular situation or type of activity.
    – Build analytical models of the activity.
    – Assess the success of an analytic solution in production in order to improve it.
  • A data reservoir provides credible information to subject matter experts (such as data to analysts, data scientists, and business teams) so they can perform analysis activities such as, investigating and understanding a particular situation, event, or activity.
  • A data reservoir has capabilities that ensure the data is properly cataloged and protected so subject matter experts can confidently access the data they need for their work and analysis.
  • The creation and maintenance of the data reservoir is accomplished with little to no assistance and additional effort from the IT teams.

Design of a Data Reservoir:
This design point is critical because subject matter experts play a crucial role in ensuring that analytics provides worthwhile and valuable insights at appropriate points in the organization’s operation. With a data reservoir, line-of-business teams can take advantage of the data in the data reservoir to make decisions with confidence.

Data ReservoirThere are three main parts to a data reservoir described as follows:

  • The data reservoir repositories (Figure 1, item 1) provide platforms both for storing data and running analytics as close to the data as possible.
  • The data reservoir services (Figure 1, item 2) provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.
  • The information management and governance fabric (Figure 1 item 3) provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its lifecycle.

The data reservoir is designed to offer simple and flexible access to data because people are key to making analytics successful. For more information please read Governing and Managing Big Data for Analytics and Decision Makers.

Data Governance: And the winner is…

When an organization runs a strong Information Governance program, it helps ensure that information used for critical decisions in the organization is trusted, particularly from such a central hub as the information warehouse. The information must come from an authoritative source and is known to be complete, timely, and relevant to the people and systems that are involved in making the decision. It must be managed by an Information Steward who can communicate to others about its purpose, usage, and quality. Through communication of Information Governance policy and rules, business terms, and their relationship to the information assets, the information can be clearly understood across the organization.

I was going through The Forrester Wave™: Data Governance Tools, Q2 2014.  IBM has been named a leader and has earned the highest scores for both strategy and market presence.

IBM was adjudged the Leader based on the evaluation on the following 5 domains of data governance management

  1. quality
  2. reference
  3. life-cycle management
  4. security/privacy
  5. metadata

These are the products in the Information Governance story of IBM (with links to my previous blogs on these topics)



InfoSphere Quality Stage – XIII (Creating a good match specification)

In one of my earlier blog I was talking about the Matching in Quality Stage. For me this is one of the most powerful part of Quality Stage where we can find duplicates in huge data in matter of minutes. For example one of our customer wanted to find duplicates in 5 million of his records and we were able to do that on my server in minutes. Not just that we found out several cases where it appeared that the data had been intentionally modified so that duplicates could not be identified, still the tool was able to catch that. But the heart of all this depends on how well we create our match specification. And unfortunately it is an art to create a good match specification that many practitioners are not aware of. I sit with many practitioners and spend time with them to educate how to use this tool to create a good specification. So in this blog I wish to write some of my tips for creating a good match specification. This is not exhaustive but a good starting point that I got from a redbook.

Generally, raw information in the source systems is not acceptable to match on because it is either stored in a free form format or, even if it is parsed into individual columns, content does not always correspond to the metadata. It is very difficult to achieve high accuracy match results when using data in these
formats. So, as a rule, business intelligence fields and match fields that are generated by Domain Specific Standardization should be used for match specifications.

In addition, single domain fields, such as SSN or TIN (that might require some validation) should be reviewed as possible additional fields to help in determining the match. Initial investigation results could be used to determine the accuracy and field population.

Initial blocking strategy
Within a file with many records, it is very difficult to find matches when trying to compare every single pair in the file, although that seems a logical approach, except for the large files job, which takes a very long time to run because of the number of comparisons that need to be made. (n*(n-1)/2).

Blocking is a scheme that reduces the number of pairs of records that needs to be examined. In blocking, files are partitioned into mutually exclusive and
exhaustive blocks designed to increase the number of matches while decreasing the number of pairs to compare. Consequently, blocking is generally
implemented by partitioning files based on the values of one or more fields.
The following guidelines should be used when selecting fields for blocking:

  •  Select fields that are reliable, well populated, and have a good frequency distribution of values
  •  Use multiple fields in a block
  •  Go from most specific in the first pass – to more lenient in future passes. Your first pass should result in the most obvious and “best” matches
  •  Multiple Pass Strategy: Use multiple passes for variation and to “widen the net”
  •  Use match fields generated from standardization (NYSIIS, Soundex) rather than the business intelligence fields when applicable (First Name, Primary Name, Street Name, City Name)

Of course, if a pair of records blocking fields are not the same in any of the passes of the match specification, that pair does not have a chance to be linked.
So defining blocking fields and multiple passes is a balance between scope and accuracy to compare a reasonable amount of “like” records.

Select fields for match comparison

These are guidelines on the match field selection:

  •  Do not use Match Fields generated from Standardization. Use Business Intelligence Fields (such as First Name, Primary Name, Street Name, and City Name) and apply a fuzzy comparison:
    – UNCERT (Street Name, City Name)
    – Name UNCERT (Middle Name, First Name Maybe)
    – MultUNCERT (Individual Name)
  • Use Exact Comparison methods on shorter fields (as examples, House Number, Codes, Gender, and Floor Value) as well as on fields well standardized (such as Street Type, Directions, State Code, Country Code, and Floor Type)
  •  Use default match parameters in the initial run
  •  Use default match cutoff values of 0 in the initial run


Information Server and Big Data Integration – I

Organizations exploring big data analytics, such as Apache Hadoop for data at rest or streaming technology for data in motion, face many of the same  challenges as they do with other analytical environments. These challenges include determining the location of the information sources needed for analysis, how that information can be moved into the analytical environment, how it must be reformatted so that it becomes easier and more efficient to explore, and what data should be persisted to quickly get to the next level of analysis. Several Data Integration tools are assisting these organizations to solve these challenges. I plan to write a few blogs on describing how Information Server is assisting in this Big Data world.

InfoSphere Information Server includes capabilities that organizations need to integrate the extreme volume, variety and velocity of big data – from new and emerging big data sources. Here are some of these (not an exhaustive list), I will be explaining them whenever my time permits.

Balanced optimization for Hadoop
Balanced Optimization HadoopWhen a data integration job includes a big data source, InfoSphere Information Server now can push the processing to the data. Using the same common set of InfoSphere DataStage stages and links to build the data integration logic, developers may now choose to run the entire logic, or only portions of that logic, as a MapReduce job that will execute directly on the Hadoop platform. When the sources and targets of the integration task are Hadoop data stores, this approach will yield significant performance gains, as well as savings in network resource consumption.

IBM InfoSphere Streams integration
StreamsFor big data projects that focus on real-time analytical processing, IBM now offers direct data flow integration between InfoSphere Information Server and InfoSphere Streams to combine the power and reach of both platforms. With this feature, organizations can use standard data integration conventions to gather information from across the enterprise and pass that information to the real-time analytical processes. Similarly, when InfoSphere Streams finds records of insight, that data can now be passed directly to a running data-integration job and made available to data stores or applications across the information landscape, using the full depth and breadth of InfoSphere Information Server connectivity.

Big data job sequencing
Oozie IntegrationInfoSphere Information Server now allows any InfoSphere BigInsights or Cloudera-certified Oozie-contained MapReduce job to be included in the job sequencer. This feature provides end-to-end workflow across heterogeneous topologies executed in both InfoSphere Information Server and Hadoop.


Big-data governance
DataGovernanceInfoSphere Information Server also supports big data-related governance features, such as impact analysis and data lineage, on any big data integration points, thus providing enterprises the ability to deliver on the promises of massively scalable analytics, without sacrificing organizational insight into the information infrastructure.

Data Governance VI (Protecting Database)

Data Governances encompasses preventing issues with data, so that the enterprise can become more efficient. Most organizations have formal policies that govern how and when privileged users such as database administrators, help desk members, and outsourced personnel can access database systems. But (There is always a BUT!), organizations do not always have effective mechanisms to monitor, control, and audit the actions of these privileged users. To make matters worse, accountability is difficult to achieve because privileged users often share the credentials used to access database systems.

Monitoring privileged users helps ensure Data Governance in the following ways:

  • Data privacy—Monitoring ensures that only authorized applications and users are viewing sensitive data.
  • Database change control—Monitoring ensures that critical database structures and values are not being changed outside of corporate change control procedures.
  • Protection against external attacks—A successful, targeted attack frequently results in the attacker gaining privileged user access. For example, an outsider in Timbuktu might look like an insider because he has authenticated access, until you look at other identifying information such as the user’s location.

An organization will want to track all database changes to the following:

  • Database structures such as tables, triggers, and stored procedures. For example, the organization will want to detect accidental deletions or insertions in critical tables that affect the quality of business decisions.
  • Critical data values such as data that affects the integrity of fi nancial transactions.
  • Security and access control objects such as users, roles, and permissions. For example, an outsourced contractor might create a new user account with unfettered access to critical databases and then delete the entire account, eliminating all traces of her activity.
  • Database configuration files and other external objects, such as environment/registry variables, confi guration files (e.g., NAMES.ORA), shell scripts, OS fi les, and executables such as Java. programs.

IBM InfoSphere Guardium Database Activity Monitor offers a solution that creates a continuous, fine-grained audit trail of all database activities, including the “who,” “what,” “when,” “where,” and “how” of each transaction. This audit
trail is continuously analyzed and filtered in real-time, to identify unauthorized or suspicious activities. To enforce separation of duties, all audit data is stored in a secure, tamper-proof repository external to monitored databases.

IBM InfoSphere Guardium Database Activity Monitor’s solution has a minimal impact on database performance and does not require any changes to databases or applications. IBM InfoSphere Guardium Database Activity Monitor also enables an organization to automate the time-consuming process of tracking all observed database changes and reconciling them with authorized work orders within existing change-ticketing systems, such as BMC Remedy and custom change management applications. For example, a large financial institution set up an automated change-reconciliation process with IBM InfoSphere Guardium Database Activity Monitor.