IBM Vs Informatica (2 of 2)

In my last blog, we compared IBM’s Information Server and Informatica’s Power Center based on their scalability. Here is the summary: Big Data and enterprise class data environments need unlimited data scalability to keep pace with data volume growth. Informatica’s PowerCenter is NOT designed to provide unlimited data scalability which may lead to investment in expensive workarounds.

In this blog we will touch upon two other important aspect of ETL tools.

Data Governance and Metadata Management

Governance

  • IBM provides a data governance solution (Information Governance Catalog) designed for business users.
    • Information Governance Catalog has deep support of REST API interface. This makes Information Sever more open and ensures compatibility with other enterprise systems. User can create custom enhancement and loaders as well as can create unique user interfaces for a consistent look and feel.
    • There is a superior Event based notification that takes advantage of open source kafka messaging. For example, Import of metadata is an “event” that can be monitored for workflow and approval purposes, or simply for notification.
    • There is graphical reporting to illustrate relationships, data design origins, and data flow lineage to help answer “what does this mean” and “where did this data come from?”
    • There is an advanced search and navigation or a “shopping” experience for the data.
    • Metatadata Asset Manager controls what data goes in the repository. “Import Areas”  govern what is being imported into the repository (or not), and who is able to import. These imports are initiated via browser interface.  No local Windows installation is required for the metadata administrator.
  • Informatica lacks these capabilities and provides a data governance solution designed for technical users. It lacks openness of their platform and you get locked to “Informatica Only” architecture.

Data Quality

  • IBM provides an integrated data integration platform with one processing engine, one user design experience for data integration and data quality, and one shared metadata repository. Information Server gives ability to write a datastage job once and run it anywhere (transcational database, hadoop or eventually spark)
  • Informaticdataquality.pnga provides a  collection of multiple and incompatible processing engines, user design experiences, and metadata repositories. Informatica Data Quality and Informatica Power Center are two different products that have different user interfaces.  In fact, PC needs two interfaces to design jobs an manage workflows. It also uses two engines. This means that Data Quality processes have to be ‘pushed’ or ‘exported’ to PC to run.
Summary:
In Summary, we can say Information Server is a better solution to go in case we want to create scalable workflows, open-ness in architecture and better productivity design and running the workflows. Information Server supports the power of 1.
  • 1 Engine: The same engine runs stand-alone, in a grid, or natively in Hadoop/YARN. Jobs can remain unchanged regardless of deployment model.
  • 1 Design Experience: Single design experience for Data Integration and Data Quality that increases productivity and reduces error.
  • 1 Repository: A single active metadata repository across the entire portfolio and so design and execution metadata instantly shared among team members.
Disclaimer: The postings on this site are my own and don’t necessarily represent IBM‘s positions, strategies or opinions
Advertisements

Whats new in IBM InfoSphere Information Server 11.7 – Part 1

IBM® InfoSphere® Information Server V11.7 was released last week. And in next couple of blogs, I wish to share how 11.7 is a major milestone for Governance functionality. First let’s look at the changes from a very high level before going closer.

At a Glance
IBM® InfoSphere® Information Server V11.7 accelerates the delivery of trusted and meaningful information to your business with enhanced automation and new design and usage experiences:

Enterprise Search
Enterprise Search
  • New Enterprise smart search to discover and view enterprise information
  • Automated data discovery and classification powered by machine learning
  • Policy-and-business-classification-driven data quality evaluation
  • New browser-based cognitive design experience for data engineers
  • New and expanded Hadoop data lake and cloud capabilities and connectivity
  • Single and holistic catalog view of information across the information landscape, enabling users to derive insight through a knowledge graph

Unified Governance

Now let’s get into some details. InfoSphere® Information Server V11.7 introduces the unified governance platform, a fabric that supports Data Governance Objectives throughout analytics lifecycle. Unified Governance focuses on the following themes and capabilities to construct a data foundation for the enterprise.

  • Auto Discovery and Classification: For data in traditional repositories or in the modern Hadoop data lakes the ability to catalog data accurately and quickly with minimal user intervention is a key requirement all modern enterprises have. Auto Discovery provides the user with the ability to point to a data source and ingest metadata from that data source and Auto Classification is an optional feature used after discovery which conducts data profiling and quality analysis.
  • Auto Quality Management : Data Quality is a key component of Data Governance. Automation rules provide a way to associate evaluation of data quality with business classification of data and also provide a way to automate data quality evaluation. It will help lower the cost of quality evaluation significantly.
  • Enterprise Search – An enterprise wants to leverage data. A lot of data does not get used simply because there is no good way to find it. The Knowledge Graph is a self-service user experience which provides information with insight to the business user. This allows a CDO to improve the use of data in business decisions with a high level of confidence that it is governed data. Starting with a simple keyword search, a user can leverage context of the data and use social collaboration to narrow down data to be used for analytics or business decision making.
  • Customizable User Experience : This release introduces the ability for an enterprise to customize their users experience based on roles and allows a user to customize their experience suited to their personal preference.
  • Metadata Integration from StoredIQ into IGC enabling organizations to govern all their information assets (structured and unstructured) in a centralized repository. This is critical to support customers needs for GDPR.

Watch the following 2 minute video on Unified Governance:

This release introduces key technological innovations as well open source technology. Also there has been a tremendous change in the DataStage Designer. I will share that in the upcoming blog. So stay tuned.

Information Governance – Revisited

IIGIt has been more than 5 years that I wrote on Information governance. Over the period of last 5 years some areas of Information Governance became more matured and I thought of re-visiting this topic. In a simple analogy, what library do for books, Data governance does for data. It organizes data, makes it simple to access the data, gives means to check for validity/ accuracy of data and makes it understandable to all who need it.  If Information Governance in place, organizations can use data for generating insights and also they are equipped for  regulatory mandates (like GDPR).

There are six sets of capabilities that make up the Information Management & Governance component:

  1. Data Lifecycle Management is a discipline that applies not only to analytical data but also to operational, master and reference data within the enterprise.  It involves defining and implementing policies on the creation, storage, transmission, usage and eventual disposal of data, in order to ensure that it is handled in such a way as to comply with business requirements and regulatory mandates.

2. MDM: Master and Entity Data acts as the ‘single source of the truth’ for entities – customers, suppliers, employees, contracts etc.  Such data is typically stored outside the analytics environment in a Master Data Management (MDM) system, and the analytics environment then accesses the MDM system when performing tasks such as data integration.

3. Reference Data is similar in concept to Master and Entity Data, but pertains to common data elements such as location codes, currency exchange rates etc., which are used by multiple groups or lines of business within the enterprise.  Like Master and Entity Data, Reference data is typically leveraged by operational as well as analytical systems.  It is therefore typically stored outside the analytics environment and accessed when required for data integration or analysis.

4. Data Catalog is a repository that contains metadata relating to the data stored in the Analytical Data Lake Storage repositories.  The catalog maintains the location, meaning and lineage of data elements, the relationships between them and the policies and rules relating to their security and management .  The catalog is critical for enabling effective information governance, and to support self-service access to data for exploration and analysis.

5. Data Models provide a consistent representation of data elements and their relationships across the enterprise.  An effective Enterprise Data Model facilitates consistent representation of entities and relationships, simplifying management of and access to data.

6. Data Quality Rules describe the quality requirements for each data set within the Analytical Data Lake Storage component, and provides measures of data quality that can be used by potential consumers of data to determine whether a data set is suitable for a particular purpose.  For example, data sets obtained from social media sources are often sparse and therefore ‘low quality’ but that does not necessarily disqualify a data set from being used.  Provided a user of the data knows about its quality, they can use that knowledge to determine what kinds of algorithms can best be applied to that data.

 

Need for Governance in Self-Service Analytics

Screen Shot 2017-03-31 at 9.55.05 PM
Analytics Offering without Self-Service

Self-Service Analytics is a form of business intelligence in which line-of-business professionals or data scientists are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support. This empowers everyone in the organization to discover new insights and enable them for informed decision-making. Capitalizing on the data lake, or modernized data warehouse, they can do full data set analysis (no more sampling), gain insight from non-relational data, support individuals in their desire for exploratory analysis and discovery with 360o view of all their business. At this stage, the organization can truly be data-savvy and insight-driven leading to better decisions, more effective actions, and improved outcomes. Insight is being used to make risk-aware decisions, or fight fraud and counter threats, optimize operations or most often focused on attract, grow and retain customers.

Any self service analytics, regardless of persona, has to involve data governance. Here are three examples of how any serious analytics work would be impossible without support for a proper data governance practice in the analytics technology:

  1. Fine-grained authorization controls: Most industries feature data sets where data access needs to be controlled so that sensitive data is protected. As data moves from one store to another, gets transformed, and aggregated, the authorization information needs to move with that data. Without the transfer of authorization controls as data changes state or location, self-service analytics would not be permitted under the applicable regulatory policies.
  2. Data lineage information: As data moves between different data storage layers and changes state, it’s important for the lineage of the data to be captured and stored. This helps analysts understand what goes into their analytic results, but it is also often a policy requirement for many regulatory frameworks. An example of where this is important is the right to be forgotten, which is a legislative initiative we are seeing in some Western countries. With this, any trace of information about a citizen would have to be tracked down and deleted from all of an organization’s data stores. Without a comprehensive data lineage framework, adherence to a right to be forgotten policy would be impossible.
  3. Business glossary: A current and complete business glossary acts as a roadmap for analysts to understand the nature of an organization’s data. Specifically, a business glossary maps an organization’s business concepts to the data schemas. One common problem with Hadoop data lakes is a lack of business glossary information as Hadoop has no proper set of metadata and governance tooling.

Summary:
A core design point of any self service analytics offering (like the IBM DataWorks) is that data governance capabilities should be baked in. This enables self-service data analysis where analysts only see data they’re entitled to see, where data movement and transformation is automatically tracked for a complete lineage story, and as users search for data, business glossary information is used.

The 4 Personas for Data Analytics

Due to new modernization strategies, data analytics is architected from  top down or through the lens of the consumers of the data. In this blog, I will describe the four roles that are integral to the data lifecycle. These are the personas who interact with data while uncovering and deploying insights as they explore this organizational data.

Citizen analysts/knowledge workers

A knowledge worker is primarily a subject-matter expert (SME) in a specific area of business—for example, a business analyst focused on risk or fraud, a marketing analyst aiming to build out new offers or someone who works to drive efficiencies into the supply chain. These users do not know where or how data is stored, or how to build an ETL flow or a machine learning algorithm. They simply want to access information on demand, driving analysis from their base of expertise, and create visualizations. They are the users of offerings like the Watson Analytics.

Data scientists

Data scientists can do a more sophisticated analysis, find a root cause to a problem, and develop a solution based on an insight that he discovers. They can use SPSS, SAS, etc or open source tools with built-in data shaping and point-and-click machine learning to manipulate large amount of data.

Data engineers

They focus enable data integrations, connections (plumbing) and data quality. They do the underlying enablement that a data scientist and citizen analyst depend on. They typically depend on solutions like DataWorks Forge to access multiple data source and to transform them within a fully managed service.

Application developers

Application developers are responsible for making analytics algorithms actionable within a business process, generally supported by a production system. Beginning with the analytics algorithms built by citizen analysts or data scientists, they work with the final data model representation created by data engineers, building an application that ties into the overall business process. They use something like Bluemix development platform and APIs for the individual data and analytics services.

Putting it all together

Image a scenario where a Citizen analyst notices (from a dashboard) that retail sales are down for the quarter. She pulls up Watson Analytics and uses it to discover that the underlying problem is specific to a category of goods and services in store in a specific region. But she needs more help to find the exact cause and a remedy.

She engages her data scientists and engineer. They discuss the need to pull in more data than just the transactional data the business analyst already has access to, specifically weather, social, and IoT data from the stores. The data engineer helps create the necessary access – the data scientists can then form and test various hypothesis using different analytic models.

Once the data scientist determines the root cause, he then shares the model with the developer who can then leverage it to improve the company’s mobile apps and websites to be more responsive in real-time to address the issue. The citizen analyst also shares the insight with the marketing department so they can take corrective action.

screen-shot-2016-12-05-at-1-33-18-pm

IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes

StewardshipCenter

IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.