Need for Governance in Self-Service Analytics

Screen Shot 2017-03-31 at 9.55.05 PM
Analytics Offering without Self-Service

Self-Service Analytics is a form of business intelligence in which line-of-business professionals or data scientists are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support. This empowers everyone in the organization to discover new insights and enable them for informed decision-making. Capitalizing on the data lake, or modernized data warehouse, they can do full data set analysis (no more sampling), gain insight from non-relational data, support individuals in their desire for exploratory analysis and discovery with 360o view of all their business. At this stage, the organization can truly be data-savvy and insight-driven leading to better decisions, more effective actions, and improved outcomes. Insight is being used to make risk-aware decisions, or fight fraud and counter threats, optimize operations or most often focused on attract, grow and retain customers.

Any self service analytics, regardless of persona, has to involve data governance. Here are three examples of how any serious analytics work would be impossible without support for a proper data governance practice in the analytics technology:

  1. Fine-grained authorization controls: Most industries feature data sets where data access needs to be controlled so that sensitive data is protected. As data moves from one store to another, gets transformed, and aggregated, the authorization information needs to move with that data. Without the transfer of authorization controls as data changes state or location, self-service analytics would not be permitted under the applicable regulatory policies.
  2. Data lineage information: As data moves between different data storage layers and changes state, it’s important for the lineage of the data to be captured and stored. This helps analysts understand what goes into their analytic results, but it is also often a policy requirement for many regulatory frameworks. An example of where this is important is the right to be forgotten, which is a legislative initiative we are seeing in some Western countries. With this, any trace of information about a citizen would have to be tracked down and deleted from all of an organization’s data stores. Without a comprehensive data lineage framework, adherence to a right to be forgotten policy would be impossible.
  3. Business glossary: A current and complete business glossary acts as a roadmap for analysts to understand the nature of an organization’s data. Specifically, a business glossary maps an organization’s business concepts to the data schemas. One common problem with Hadoop data lakes is a lack of business glossary information as Hadoop has no proper set of metadata and governance tooling.

A core design point of any self service analytics offering (like the IBM DataWorks) is that data governance capabilities should be baked in. This enables self-service data analysis where analysts only see data they’re entitled to see, where data movement and transformation is automatically tracked for a complete lineage story, and as users search for data, business glossary information is used.

The 4 Personas for Data Analytics

Due to new modernization strategies, data analytics is architected from  top down or through the lens of the consumers of the data. In this blog, I will describe the four roles that are integral to the data lifecycle. These are the personas who interact with data while uncovering and deploying insights as they explore this organizational data.

Citizen analysts/knowledge workers

A knowledge worker is primarily a subject-matter expert (SME) in a specific area of business—for example, a business analyst focused on risk or fraud, a marketing analyst aiming to build out new offers or someone who works to drive efficiencies into the supply chain. These users do not know where or how data is stored, or how to build an ETL flow or a machine learning algorithm. They simply want to access information on demand, driving analysis from their base of expertise, and create visualizations. They are the users of offerings like the Watson Analytics.

Data scientists

Data scientists can do a more sophisticated analysis, find a root cause to a problem, and develop a solution based on an insight that he discovers. They can use SPSS, SAS, etc or open source tools with built-in data shaping and point-and-click machine learning to manipulate large amount of data.

Data engineers

They focus enable data integrations, connections (plumbing) and data quality. They do the underlying enablement that a data scientist and citizen analyst depend on. They typically depend on solutions like DataWorks Forge to access multiple data source and to transform them within a fully managed service.

Application developers

Application developers are responsible for making analytics algorithms actionable within a business process, generally supported by a production system. Beginning with the analytics algorithms built by citizen analysts or data scientists, they work with the final data model representation created by data engineers, building an application that ties into the overall business process. They use something like Bluemix development platform and APIs for the individual data and analytics services.

Putting it all together

Image a scenario where a Citizen analyst notices (from a dashboard) that retail sales are down for the quarter. She pulls up Watson Analytics and uses it to discover that the underlying problem is specific to a category of goods and services in store in a specific region. But she needs more help to find the exact cause and a remedy.

She engages her data scientists and engineer. They discuss the need to pull in more data than just the transactional data the business analyst already has access to, specifically weather, social, and IoT data from the stores. The data engineer helps create the necessary access – the data scientists can then form and test various hypothesis using different analytic models.

Once the data scientist determines the root cause, he then shares the model with the developer who can then leverage it to improve the company’s mobile apps and websites to be more responsive in real-time to address the issue. The citizen analyst also shares the insight with the marketing department so they can take corrective action.


IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes


IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.

Data Reservoir – Need for Information Governance

As we know, a Data Reservoir contains data from many different sources to ease data discovery, data analytics or ad hoc investigations. Lets delve into it a little more to see the need for Information Governance in a Data Reservoir solution.

Information-GovernanceLets start with the use case of a Bank that plans to implement a Data Reservoir. It has the traditional sources of data. This data includes how much a person earns, what they spend their money on, where they live, even where they travel or eat. The customer may share similar type of information on social media site also. However, people who willingly share their information on a social media site know that this data will become more or less public. But when people share their data with their bank, they trust that the bank will use this data responsibly, for the purposes that the data was shared, and this responsibility goes further than of just abiding by the law.

Take a customer’s payment transactions as an example. Many customers would be unhappy if they felt the bank was monitoring how they spent their money. However, they would probably also expect the bank to detect fraudulent use of their debit card. Both of these use cases involve the same data but the first example seems to be prying into a person’s privacy and the second is an aspect of fraud prevention. The difference between the cases is in the purpose of the analytics. So as the bank makes data more widely available to its employees for the purpose of analytics, it must monitor both the access to data and the types of analytics it is being used for. It does that by information governance and security capabilities. Let’s delve into it more.

No data can enter the data reservoir without first being described in the data reservoir’s catalog. The data owner classifies their information sources that will feed the data reservoir to determine how the data reservoir should manage the data, including access control, quality control, masking of sensitive data, and data retention periods.

The classification assigned to data will lead to different management actions in that data in the data reservoir. For example, when data is classified as highly sensitive, the data reservoir can enforce masking of the data on ingestion into the data reservoir. Less sensitive data, that is nevertheless personal to the bank’s customers, may be stored in secured repositories in the data reservoir, so it can be used for production analytics. However, when it is copied into sandboxes for analytical discovery, it will be masked to prevent data scientists from seeing the values, without loosing the referential integrity of the data. Behind the scenes, the data reservoir is auditing access to data to detect if employees are accessing more data than is reasonable for their role. Thus the data reservoir is opening access to the bank’s data, but only for legitimate and approved purposes.


A dip into ‘Data Reservoir’

In the previous blog, we discussed in great details the limitation of a Data Lake and how without proper governance, a data lake can become overwhelming and unsafe to use. Hence, emerged an enhanced data lake solution known as a data reservoir. So how does a Data Reservoir assists the Enterprise:

  • A data reservoir provides the right information to people so they can perform activities like the following:
    – Investigate and understand a particular situation or type of activity.
    – Build analytical models of the activity.
    – Assess the success of an analytic solution in production in order to improve it.
  • A data reservoir provides credible information to subject matter experts (such as data to analysts, data scientists, and business teams) so they can perform analysis activities such as, investigating and understanding a particular situation, event, or activity.
  • A data reservoir has capabilities that ensure the data is properly cataloged and protected so subject matter experts can confidently access the data they need for their work and analysis.
  • The creation and maintenance of the data reservoir is accomplished with little to no assistance and additional effort from the IT teams.

Design of a Data Reservoir:
This design point is critical because subject matter experts play a crucial role in ensuring that analytics provides worthwhile and valuable insights at appropriate points in the organization’s operation. With a data reservoir, line-of-business teams can take advantage of the data in the data reservoir to make decisions with confidence.

Data ReservoirThere are three main parts to a data reservoir described as follows:

  • The data reservoir repositories (Figure 1, item 1) provide platforms both for storing data and running analytics as close to the data as possible.
  • The data reservoir services (Figure 1, item 2) provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.
  • The information management and governance fabric (Figure 1 item 3) provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its lifecycle.

The data reservoir is designed to offer simple and flexible access to data because people are key to making analytics successful. For more information please read Governing and Managing Big Data for Analytics and Decision Makers.

Data Governance: And the winner is…

When an organization runs a strong Information Governance program, it helps ensure that information used for critical decisions in the organization is trusted, particularly from such a central hub as the information warehouse. The information must come from an authoritative source and is known to be complete, timely, and relevant to the people and systems that are involved in making the decision. It must be managed by an Information Steward who can communicate to others about its purpose, usage, and quality. Through communication of Information Governance policy and rules, business terms, and their relationship to the information assets, the information can be clearly understood across the organization.

I was going through The Forrester Wave™: Data Governance Tools, Q2 2014.  IBM has been named a leader and has earned the highest scores for both strategy and market presence.

IBM was adjudged the Leader based on the evaluation on the following 5 domains of data governance management

  1. quality
  2. reference
  3. life-cycle management
  4. security/privacy
  5. metadata

These are the products in the Information Governance story of IBM (with links to my previous blogs on these topics)



Need For Data Virtualization

We are seeing massive upswings in the volume, variety and velocity of data. Business now want to use the data that was difficult to process in the past for – Decision Making, Exploratory Analysis, Advanced Analytics, New Applications (including mobiles). Adding to that the information landscape has become more complex. So the ability to get simple data access is more relevant than ever. Data virtualization technologies can help create a single access point to the pool of data you need to support your business.

Here are some of the reasons why some organization may benefit from Data virtualization

Simplified Access to data:

  • Data virtualization focuses on simplifying access to data by isolating the details of storage and retrieval and making the process transparent to data consumers.
  • The end user will never know—and doesn’t need to know—the details about the data sources. This means changes to source systems will not affect downstream applications. The burden on IT is reduced since the solution does not have to be reconfigured each time a change is made.
  • An organization can create a virtual view across data in the warehouse and data that exists within Hadoop to create a new logical data warehouse. Because end users interact with only one endpoint and use the same access methods they are accustomed to, they can easily take advantage of these new insights from the big data platform without learning new languages and without administrators having to maintain user privileges in multiple locations.

Reduced Costs:

  • Multiple information access points also increase the total cost of ownership since each access point requires administration, security and management resources.
  • Reliance on older technologies to get to data eventually requires organizations to purchase more hardware, more databases and more software with more capabilities—all of which require budget.
  • Because data virtualization involves virtual consolidation of data rather than physical movement, there is no need to create new databases or purchase additional hardware to store the consolidated data.

Faster Time to Value:

  • Building virtual data stores is much quicker than creating physical ones since data does not have to be physically moved.
  • Data virtualization reduces the time required to take advantage of disparate data, which makes it easier for users and processes to get the information they need in a timely manner.