Information Governance – Revisited

IIGIt has been more than 5 years that I wrote on Information governance. Over the period of last 5 years some areas of Information Governance became more matured and I thought of re-visiting this topic. In a simple analogy, what library do for books, Data governance does for data. It organizes data, makes it simple to access the data, gives means to check for validity/ accuracy of data and makes it understandable to all who need it.  If Information Governance in place, organizations can use data for generating insights and also they are equipped for  regulatory mandates (like GDPR).

There are six sets of capabilities that make up the Information Management & Governance component:

  1. Data Lifecycle Management is a discipline that applies not only to analytical data but also to operational, master and reference data within the enterprise.  It involves defining and implementing policies on the creation, storage, transmission, usage and eventual disposal of data, in order to ensure that it is handled in such a way as to comply with business requirements and regulatory mandates.

2. MDM: Master and Entity Data acts as the ‘single source of the truth’ for entities – customers, suppliers, employees, contracts etc.  Such data is typically stored outside the analytics environment in a Master Data Management (MDM) system, and the analytics environment then accesses the MDM system when performing tasks such as data integration.

3. Reference Data is similar in concept to Master and Entity Data, but pertains to common data elements such as location codes, currency exchange rates etc., which are used by multiple groups or lines of business within the enterprise.  Like Master and Entity Data, Reference data is typically leveraged by operational as well as analytical systems.  It is therefore typically stored outside the analytics environment and accessed when required for data integration or analysis.

4. Data Catalog is a repository that contains metadata relating to the data stored in the Analytical Data Lake Storage repositories.  The catalog maintains the location, meaning and lineage of data elements, the relationships between them and the policies and rules relating to their security and management .  The catalog is critical for enabling effective information governance, and to support self-service access to data for exploration and analysis.

5. Data Models provide a consistent representation of data elements and their relationships across the enterprise.  An effective Enterprise Data Model facilitates consistent representation of entities and relationships, simplifying management of and access to data.

6. Data Quality Rules describe the quality requirements for each data set within the Analytical Data Lake Storage component, and provides measures of data quality that can be used by potential consumers of data to determine whether a data set is suitable for a particular purpose.  For example, data sets obtained from social media sources are often sparse and therefore ‘low quality’ but that does not necessarily disqualify a data set from being used.  Provided a user of the data knows about its quality, they can use that knowledge to determine what kinds of algorithms can best be applied to that data.


A World with Watson

An year back I wrote my first blog about Watson. I have been closely following what’s happening with Watson. Here are some facts on Watson and what user’s of Watson are speaking about it.


Quick Facts About Watson:

  • By the end of this year, Watson will touch one billion people in some way
  • Watson can “see,” able to describe the contents of an image. For example, Watson can identify melanoma from skin lesion images with 95 percent accuracy, according to research with Memorial Sloan Kettering.
  • Watson can “hear,” understanding speech including Japanese, Mandarin, Spanish, Portuguese, among others.
  • Watson can “read” 9 languages.
  • Watson can “feel” impulses from sensors in elevators, buildings, autos and even ball bearings.
  • Watson has been trained on 8 types of cancers, with plans to add 6 more this year.
  • Beyond oncology, Watson is in use by nearly half of the top 25 life sciences companies, major manufacturers for IoT applications, retail and financial services firms, and partners like GM, H&R Block and
  • At IBM, there are more than 1,000 researchers focused solely on artificial intelligence

But perhaps more important than what Watson can do, it is what people, businesses and institutions of all sizes are doing with Watson. See what some IBM Watson users are saying.
What IBM and Watson has been at the leading edge of is providing enterprise grade, commercially ready cognitive services, fully integrated with a top notch cloud and many other services from analytics to support and sales & marketing.”  — André M. König, Co-Founder @ Opentopic Inc. This quote was included in Mr. König’s article “Watson is a Joke?” featured on LinkedIn.

All of us involved in training Watson… are absolutely convinced that this technology will become an indispensable part of a doctor’s armamentarium to care for patients.” — Mark G. Kris, MD, lead physician of the Memorial Sloan-Kettering-IBM Watson collaboration. Dr. Kris’s quote was featured in a June 25, 2017 article in the American Society of Clinical Oncology entitled “How Watson for Oncology is Advancing Personalized Patient Care.”

But, the probably more exciting part about it is in 30 percent of patients Watson found something new. And so that’s 300-plus people where Watson identified a treatment that a well-meaning, hard-working group of physicians hadn’t found.” Dr. Norman “Ned” Sharpless, director of the Lineberger Comprehensive Cancer Center at the University of North Carolina at Chapel Hill and recent presidential appointee as director of the National Cancer Institute.
Dr. Sharpless’ made these comments in a “60 Minutes” segment that aired on October 2016 and again on June 25, 2017. The segment can be viewed here.

30 minutes is down to 8 minutes to screen a patient…That coordinator can now spend that valuable time gained … in educating the patient on why it’s important for her to be in that clinical trial, helping to break down other barriers.”  Dr. Tufia Haddad, MD, Breast Medical Oncologist, Mayo Clinic, made these comments during an AI in Healthcare panel during HIMSS 2017, reported here.

We could have individually looked at the 1,500 proteins and genes but it would have taken us much longer to do so.  IBM Watson for Drug Discovery, with its robust knowledge base, was able to rapidly give us new and novel information we would not otherwise have had.” – Robert Bowser, PhD, director of the Gregory W. Fulton ALS Research Center at Barrow Neurological Institute and one of the nation’s leading ALS researchers. Quote is from a press release announcing the recent Society for Neuroscience study findings.

[With Watson], we’re seeing some really tremendous efficiencies gained in the drilling business – [including] an 80 percent reduction in the geoscience research time we need to actually design our wells. That means geoscience searchers are doing geoscience not looking out for more data.” -Peter Coleman, CEO and Managing Director for Woodside [source:  Investor Briefing, March 7, 2017]

[Watson services] was a wake-up call for us – that cognitive solutions are real and powerful. We felt that IBM had, by far, the largest lead in terms of where cognitive was going and that the Watson team would be in the best position to help our business users.” -Ryan Bartley, Head of Applied Innovation at Staples [source: IBM Watson blog, February 10, 2017]

It’s not man versus machine—they very much work hand and hand. Our analysts continue to play a critical role in evaluating a cyber security incident, while Watson for Cyber Security enforces their decisions and validates what they are sharing with the customer at risk. It enables security analysts to deliver faster and more accurate details on a breach, so we may better protect our customers.” – Ronan Murphy, CEO, Smarttech (source: Press Release, May 11, 2017)

Why Blockchain?

There has been a lot of buzz on blockchain taking it to Gartners Hype Cycle for Emerging Technologies, 2016. It has been envisioned that blockchain will do for transactions what the Internet did for information. So in this blog, lets discuss the need for blockchain?

Why Blockchain?

Complex Transactions

If you’ve ever bought a house, you probably had to sign a huge stack of papers from a variety of different stakeholders to make that transaction happen. It is a complex transaction involving banks, attorneys, title companies, insurers, regulators, tax agencies and inspectors. They all maintain separate records, and it’s costly to verify and record each step. That’s why the average closing takes several days. Same holds good if you are registering a vehicle. In these two examples, what you are doing is ‘Establishing ownership of the asset’ and the problem is that there are several ledgers (or databases) where the information resides and all of them have to have the same version of truth. So the problem are many fold:

  • Multiple ledger(s) which are updated to represent business transactions as they occur.
  • This is EXPENSIVE due to duplication of effort and intermediaries adding margin for services.
  • It is clearly INEFFICIENT, as the business conditions – the contract – is duplicated by every network participant and we need to rely on intermediaries through this paper laden process.
  • It is also VULNERABLE because if a central system (e.g. Bank) is compromised due to an incidents this affects the whole business network.  Incidents can include fraud, cyber attack or a simple mistake.


What if there existed a common ledger (or a distrubuted database) that everyone had an access to and everyone trust? This is what blockchain does to the business!

Why now?

There are three reasons why blockchain is starting to take a foothold now.
  • Industries are merging and interacting like never before. The growth of ecommerce, online banking, and in-app purchases, and the increasing mobility of people around the world have fueled the growth of transaction volumes. And transaction volumes will explode with the rise of Internet of Things (IoT) — autonomous objects, such as refrigerators that buy groceries when supplies are running low and cars that deliver themselves to your door, stopping for fuel along the way. These partnerships require more trust and transparency to succeed.
  • There is increasing regulation, cybercrime and fraud that is inhibiting business growth. The last 10 years have seen the growth of global, cross-industry regulations, including HIPA, Sarbanes -Oxley Act, anti-money laundering and more. And to keep pace with regulatory changes, companies are rapidly increasing compliance staff and budgets.
  • Advancement in technologies like cloud (offering compute power to track billions of transactions) and cryptography (securing both networks and transactions) are also enablers for blockchain.

In my future blog I will discuss how blockchain makes things better and how it works. So stay tuned.

Data Science Vs BI & Predictive Analytics

Business intelligence (BI) has been evolving for decades as data has become cheaper, easier to access, and easier to share. BI analysts take historical data, perform queries, and summarize findings in static reports that often include charts. The outputs of business intelligence are “known knowns” that are manifested in stand-alone reports examined by a single business analyst or shared among a few managers. For example, who are the probable high-net-worth clients to sell them a premium bank account. There can be some consideration like the average account balance etc.

Predictive analytics has been unfolding on a parallel track to business intelligence. With predictive analytics, numerous tools allow analysts to gain insight into “known unknowns”. These tools track trends and make predictions, but are often limited to specialized programs. In the previous example, the probable high-net-worth client could also be the spouse of an existing high-net-worth client that can be figured out using predictive analytics.

Data Science on the other hand is an interdisciplinary field that combines machine learning, statistics, advanced analysis, high-performance computing and visualizations. It is a new form of art that draws out hidden insights and puts data to work in the cognitive era. The tools of data science originated in the scientific community, where researchers used them to test and verify hypotheses that include “unknown unknowns”. Here are some of the examples:

  • Uncover totally unanticipated relationships and changes in markets or other patterns. For example the price of a house based on nearness to high voltage power lines or based on brick exterior.
  • Handle streams of data—in fact, some embedded intelligent services make decisions and carry out those decisions automatically in microseconds. For example analyzing the users click pattern to dynamically propose a product or promotion to attract the customer.

As discussed, Data Science different from from traditional business intelligence and predictive analytics in the following way.

  • It brings in data that is orders of magnitude larger than what previous generations of data warehouses could store, and it even works on streaming data sources.
  • The analytical tools used in data science are also increasingly powerful, using artificial intelligence techniques to identify hidden patterns in data and pull new insights out of it.
  • The visualization tools used in data science leverage modern web technologies to deliver interactive browser-based applications. Not only are these applications visually stunning, they also provide rich context and relevance to their consumers.

Data science enriches the value of data, going beyond what the data says to what it means for your organization—in other words, it turns raw data into intelligence that empowers everyone in your organization to discover new innovations, increase sales, and become more cost-efficient. Data science is not just about the algorithm, but about deriving value.


Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.


Need for Governance in Self-Service Analytics

Screen Shot 2017-03-31 at 9.55.05 PM
Analytics Offering without Self-Service

Self-Service Analytics is a form of business intelligence in which line-of-business professionals or data scientists are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support. This empowers everyone in the organization to discover new insights and enable them for informed decision-making. Capitalizing on the data lake, or modernized data warehouse, they can do full data set analysis (no more sampling), gain insight from non-relational data, support individuals in their desire for exploratory analysis and discovery with 360o view of all their business. At this stage, the organization can truly be data-savvy and insight-driven leading to better decisions, more effective actions, and improved outcomes. Insight is being used to make risk-aware decisions, or fight fraud and counter threats, optimize operations or most often focused on attract, grow and retain customers.

Any self service analytics, regardless of persona, has to involve data governance. Here are three examples of how any serious analytics work would be impossible without support for a proper data governance practice in the analytics technology:

  1. Fine-grained authorization controls: Most industries feature data sets where data access needs to be controlled so that sensitive data is protected. As data moves from one store to another, gets transformed, and aggregated, the authorization information needs to move with that data. Without the transfer of authorization controls as data changes state or location, self-service analytics would not be permitted under the applicable regulatory policies.
  2. Data lineage information: As data moves between different data storage layers and changes state, it’s important for the lineage of the data to be captured and stored. This helps analysts understand what goes into their analytic results, but it is also often a policy requirement for many regulatory frameworks. An example of where this is important is the right to be forgotten, which is a legislative initiative we are seeing in some Western countries. With this, any trace of information about a citizen would have to be tracked down and deleted from all of an organization’s data stores. Without a comprehensive data lineage framework, adherence to a right to be forgotten policy would be impossible.
  3. Business glossary: A current and complete business glossary acts as a roadmap for analysts to understand the nature of an organization’s data. Specifically, a business glossary maps an organization’s business concepts to the data schemas. One common problem with Hadoop data lakes is a lack of business glossary information as Hadoop has no proper set of metadata and governance tooling.

A core design point of any self service analytics offering (like the IBM DataWorks) is that data governance capabilities should be baked in. This enables self-service data analysis where analysts only see data they’re entitled to see, where data movement and transformation is automatically tracked for a complete lineage story, and as users search for data, business glossary information is used.

The Best Data Science Platform

Data science platforms are engines for creating machine-learning solutions. Innovation in this market focuses on Cloud, Apache Spark, automation, collaboration and artificial-intelligence capabilities.When choosing the best one, organizations often trust on The Gartner Magic Quadrants which aims to provide a qualitative analysis into a market and its direction, maturity and participants. Gartner previously called these platforms “advanced analytics platforms”. But since this platform is primarily used by data scientists so from this year the Quadrant has been renamed to Magic Quadrant for Data Science Platforms. 

This Magic Quadrant evaluates vendors of data science platforms. These are products that organizations use to build machine-learning solutions themselves, as opposed to outsourcing their creation or buying ready-made solution. These platforms are used by data scientists for  demand prediction, failure prediction, determination of customers’ propensity to buy or churn, and fraud detection.

The report aims to rank the BI platforms on the ability to execute and the completeness of vision. The Magic Quadrant is divided in 4 parts:

  • Niche Players
  • Challengers
  • Visionaries
  • Leaders

    Source: Gartner (Feburary 2017)

Adoption of open-source platforms and Diversity of tools is an important characteristic of this market. IBM’s mission is to make data simple and accessible to the world and commitment to open source and numerous open-source ecosystem providers made it most attractive platform for Data Science.  A data scientist needs the following to be more successful, which is provided by IBM Data Scientist Experience

  • Community: A data scientist needs to be updated with the latest news from the Data Science Community. There are plenty of new Open Source packages, libraries, techniques and tutorials available every day. A good data scientist follows the most important sources and shares their opinion and experiments with the community. IBM brings this into the UI of the DSX.
  • Open Source: Today there are companies that rely on open source for data science. Open source has become so mature that is directly competing with commercial offerings. IBM provide the best of open source within DSX, such as RStudio and Jupyter.
  • IBM Value Add: DSX improve open source by adding some capabilities from IBM. Data Shaping for example takes 80% of the data scientist time. IBM tools with visual GUI to help users better perform this task. You can execute Spark jobs on  managed Spark Service in Bluemix from within the DSX.

Watson Analytics

Need for Watson Analytics
If an organization is good at analyzing data and extracting relevant insights from it then decision makers can make more informed and thus more optimal decisions. But the decision makers are forced to make decisions with incomplete information. The reason?  Decisions makers/ Citizen Analysts, for the most part, tend to be mainly consumers of analytics and they rely on more skilled resources (Like Data Engineer, Data Scientist, Application developer) in the organization to provide the data driven answers to their questions. Moreover the answer to one question is just the start of another. Think of a detective interrogating a suspect. The consumer/builder model is hardly conducive to the iterative nature of data analysis. Therefore, the time it takes for these answers to be delivered to the decision makers is far from optimal – and many questions go unanswered every day.

watsonlogoWatson Analytics
So a logical solution is to provide an easier to use analytics offerings. Watson Analytics provides that value add so that more people will be able to leverage data to drive better decision making using analytics.

When we think of Watson, we think about Cognitive. And when we think about Analytics, we think  about traditional analytics (querying, dashboarding), along with some more advanced analytic capabilities (data mining, and social media analytics). So Watson Analytics is a Cloud based offering which can make analytics a child’s play even for a non-skilled user.

Watson Analytics helps users understand their data in a guided way using a natural language interface to ask a series of business questions. Example, a user can ask “What is the trend of revenue over years?” and get a visualization in response. So, Instead of having to first choose a visualization and working backwards to try answer the business question, Watson Analytics allows you to describe your intent in natural language, and it chooses the best visualization for you. Even better, Watson Analytics gives you some initial set of questions which you can keep refining.

Watson Analytics for Social Media
Watson Analytics can work on Social Media data to take the pulse of an audience by spotting trends and identifying new insights and relationships across multiple social channels allowing greater visibility into a given topic or market. It combines structured and unstructured self-service analysis to enrich your social media analytics experience for exceptionally insightful discoveries. All on the cloud!

Summary of Steps:
Watson Analytics does the following to provide insights hidden in your Big data. Mouse-over the below images to get the details of the steps.

  • Import data from a robust set of data source (on Cloud and on premise) options, with the option to prepare and cleanse via IBM Bluemix Data Connect.
  • Answering What: Identifying issues, early problem detection, finding anomalies or exceptions, challenging conventional wisdom or the status quo.
  • Understanding or explaining outcomes, Why something happened.
  • Dashboarding to share results