Need for Governance in Self-Service Analytics

Screen Shot 2017-03-31 at 9.55.05 PM
Analytics Offering without Self-Service

Self-Service Analytics is a form of business intelligence in which line-of-business professionals or data scientists are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support. This empowers everyone in the organization to discover new insights and enable them for informed decision-making. Capitalizing on the data lake, or modernized data warehouse, they can do full data set analysis (no more sampling), gain insight from non-relational data, support individuals in their desire for exploratory analysis and discovery with 360o view of all their business. At this stage, the organization can truly be data-savvy and insight-driven leading to better decisions, more effective actions, and improved outcomes. Insight is being used to make risk-aware decisions, or fight fraud and counter threats, optimize operations or most often focused on attract, grow and retain customers.

Any self service analytics, regardless of persona, has to involve data governance. Here are three examples of how any serious analytics work would be impossible without support for a proper data governance practice in the analytics technology:

  1. Fine-grained authorization controls: Most industries feature data sets where data access needs to be controlled so that sensitive data is protected. As data moves from one store to another, gets transformed, and aggregated, the authorization information needs to move with that data. Without the transfer of authorization controls as data changes state or location, self-service analytics would not be permitted under the applicable regulatory policies.
  2. Data lineage information: As data moves between different data storage layers and changes state, it’s important for the lineage of the data to be captured and stored. This helps analysts understand what goes into their analytic results, but it is also often a policy requirement for many regulatory frameworks. An example of where this is important is the right to be forgotten, which is a legislative initiative we are seeing in some Western countries. With this, any trace of information about a citizen would have to be tracked down and deleted from all of an organization’s data stores. Without a comprehensive data lineage framework, adherence to a right to be forgotten policy would be impossible.
  3. Business glossary: A current and complete business glossary acts as a roadmap for analysts to understand the nature of an organization’s data. Specifically, a business glossary maps an organization’s business concepts to the data schemas. One common problem with Hadoop data lakes is a lack of business glossary information as Hadoop has no proper set of metadata and governance tooling.

A core design point of any self service analytics offering (like the IBM DataWorks) is that data governance capabilities should be baked in. This enables self-service data analysis where analysts only see data they’re entitled to see, where data movement and transformation is automatically tracked for a complete lineage story, and as users search for data, business glossary information is used.


A dip into ‘Data Reservoir’

In the previous blog, we discussed in great details the limitation of a Data Lake and how without proper governance, a data lake can become overwhelming and unsafe to use. Hence, emerged an enhanced data lake solution known as a data reservoir. So how does a Data Reservoir assists the Enterprise:

  • A data reservoir provides the right information to people so they can perform activities like the following:
    – Investigate and understand a particular situation or type of activity.
    – Build analytical models of the activity.
    – Assess the success of an analytic solution in production in order to improve it.
  • A data reservoir provides credible information to subject matter experts (such as data to analysts, data scientists, and business teams) so they can perform analysis activities such as, investigating and understanding a particular situation, event, or activity.
  • A data reservoir has capabilities that ensure the data is properly cataloged and protected so subject matter experts can confidently access the data they need for their work and analysis.
  • The creation and maintenance of the data reservoir is accomplished with little to no assistance and additional effort from the IT teams.

Design of a Data Reservoir:
This design point is critical because subject matter experts play a crucial role in ensuring that analytics provides worthwhile and valuable insights at appropriate points in the organization’s operation. With a data reservoir, line-of-business teams can take advantage of the data in the data reservoir to make decisions with confidence.

Data ReservoirThere are three main parts to a data reservoir described as follows:

  • The data reservoir repositories (Figure 1, item 1) provide platforms both for storing data and running analytics as close to the data as possible.
  • The data reservoir services (Figure 1, item 2) provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.
  • The information management and governance fabric (Figure 1 item 3) provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its lifecycle.

The data reservoir is designed to offer simple and flexible access to data because people are key to making analytics successful. For more information please read Governing and Managing Big Data for Analytics and Decision Makers.

Challenges of Data Lake paving way for Data Reservoir

DataLakeIn my previous blogs I was discussing about Data Lake. Imagine you have pooled the entire data of your enterprise to a Data lake, there will be challenges. All this raw data will be overwhelming and unsafe to use because no-one is sure where data came from, how reliable it is, and how it should be protected. Without proper management and governance, such a data lake can quickly become a data swamp. This data swamp can cause frustration to the business users, application developers, IT and even customers.

So there is a need for a facility for transforming raw data into information that is Clean, Timely, Useful and Relevant. Hence an enhanced data lake solution was built with management, affordability, and governance at its core. This solution is known as a data reservoir. Probably in one of the subsequent blogs we will take a dip into data reservoir! Stay tuned.

Data Lake Vs Data Warehouse

DataLakeIn my last blog, I wrote on Data Lake. The first comment on the Blog was to find out the difference between Data Lake and Data Warehouse. So in this blog, I will try to share some of my understanding on their difference:

Schema: In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. But in Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data.

Cost (Storage and Processing) : Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.

Data Access: The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.

Flexibility: Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.

Data Quality: The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.

Relevance in Big Data world: Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data. For example, data lakes are an ideal way to manage the millions of patient records for a hospital. These patient records can be physicians’ notes to lab results. With a data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record when needed, converting the data into uniform structures only when the situation calls for it.

Data Lake does provide some advantages to the Enterprises who require quick access to data. But Data Lakes brings  it’s own sets of challenges. I will explore this in my subsequent blogs.

A dive into “Data Lake”

DataLakeWith the advancement in Big Data and Cloud lot of new concepts are surfacing. “Data Lake” is one such concept that is being talked about and in this blog I wish to de-mistify it.

So let’s begin with the need of Data Lake.

It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data, because of the following issues:

  • Data is often difficult to locate because it is scattered among many business applications and business systems.
  • Frequently the data needs reengineering and reformatting in order to make it easier to analyze.
  • The data must be refreshed regularly to keep it up-to-date when it is in use by analytics

Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data.

As a result, many organizations are considering implementing a data lake solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.