The Best Data Science Platform

Data science platforms are engines for creating machine-learning solutions. Innovation in this market focuses on Cloud, Apache Spark, automation, collaboration and artificial-intelligence capabilities.When choosing the best one, organizations often trust on The Gartner Magic Quadrants which aims to provide a qualitative analysis into a market and its direction, maturity and participants. Gartner previously called these platforms “advanced analytics platforms”. But since this platform is primarily used by data scientists so from this year the Quadrant has been renamed to Magic Quadrant for Data Science Platforms. 

This Magic Quadrant evaluates vendors of data science platforms. These are products that organizations use to build machine-learning solutions themselves, as opposed to outsourcing their creation or buying ready-made solution. These platforms are used by data scientists for  demand prediction, failure prediction, determination of customers’ propensity to buy or churn, and fraud detection.

The report aims to rank the BI platforms on the ability to execute and the completeness of vision. The Magic Quadrant is divided in 4 parts:

  • Niche Players
  • Challengers
  • Visionaries
  • Leaders

    gartnerdatascienceplatform
    Source: Gartner (Feburary 2017)

Adoption of open-source platforms and Diversity of tools is an important characteristic of this market. IBM’s mission is to make data simple and accessible to the world and commitment to open source and numerous open-source ecosystem providers made it most attractive platform for Data Science.  A data scientist needs the following to be more successful, which is provided by IBM Data Scientist Experience

  • Community: A data scientist needs to be updated with the latest news from the Data Science Community. There are plenty of new Open Source packages, libraries, techniques and tutorials available every day. A good data scientist follows the most important sources and shares their opinion and experiments with the community. IBM brings this into the UI of the DSX.
  • Open Source: Today there are companies that rely on open source for data science. Open source has become so mature that is directly competing with commercial offerings. IBM provide the best of open source within DSX, such as RStudio and Jupyter.
  • IBM Value Add: DSX improve open source by adding some capabilities from IBM. Data Shaping for example takes 80% of the data scientist time. IBM tools with visual GUI to help users better perform this task. You can execute Spark jobs on  managed Spark Service in Bluemix from within the DSX.

Who Leads the Forrester Wave in Data Quality?

This was Information Analyzer and QualityStage Vs. the World and IBM came out on top!!!

Forrester published their most recent Wave vendor evaluation report on Data Quality December 14th, 2015. IBM is positioned as a strong leader in this evaluation, receiving the highest possible strategy score.

ForresterDataQuality

Here are some highlights:

  • IBM gets customers started on enterprise data quality with a rich set of data quality content to speed up the deployment and return on data quality investment across traditional, big data, cloud, and hybrid environments.
  • The stewardship consoles allow business data quality stewards to lead data quality with strong dashboarding, reporting, and data profiling.
  • In addition, business data stewards easily collaborate with data quality developers in the creation of rules, match, and survivor feedback.
  • IBM is also porting its full enterprise data quality capabilities to the cloud and evolving its pricing and services models to be flexible to a variety of customer architectures and implementations.

For full Forrester report click here.

Spark – Sparkling framework for big data management and analytics

sparkThere has been lot of buzz around Apache Spark since last several months, and I have been following it to some extent and comparing it with Hadoop. In this blog, I will share some of what I have read about it.

Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Well, wasn’t that Hadoop’s claim to fame? Well yes, but Spark was developed as a way to speed up processing jobs in Hadoop systems. Spark advocates claim that with its in-memory computing layer, Spark can run batch-processing programs up to 100 times faster than MapReduce can. When data is processed from disk, Spark can run batch jobs up to 10 times faster, they say.

While MapReduce is limited to batch processing, the Apache Spark architecture includes a stream processing module, a machine learning library and a graph processing API with related algorithms thereby making it a more general purpose platform. The Spark Streaming technology in particular has found its way into deployments at Spark early adopters, for uses such as analyzing online advertising data and processing satellite images and geo-tagged tweets. Does that imply that handling of these additional processing workload may require companies to expand the size of their Hadoop clusters? And the answer is obviously, Yes.

Unlike Hadoop, Spark doesn’t include its own file system. It can run in a standalone mode and access a variety of data sources, but most often it is used to process and analyze data stored in the Hadoop Distributed File System (HDFS). So it should not be surprising to note that Spark has been incorporated into the top Hadoop distributions in every major distribution of Hadoop, including the ones from Cloudera, Hortonworks, IBM, MapR and Pivotol. In such installations, one can still use MapReduce because of its reliability, but Spark may require less development expertise than MapReduce does because of its high-level APIs and support for writing applications in Java, Scala or Python.

Seven Reasons Why Enterprises Trust IBM Software

Recently IBM announced that it would be backing Spark in it’s effort to embrace and promote Open Source. At this, technology entrepreneur and co-founder of the venture capital firm Andreessen Horowitz, Ben Horowitz said, “It’s like Spark just got blessed by the enterprise rabbi.” So this is the position that IBM commands as it stands as a technology company supporting it’s clients for over a century. In this blog I will share Seven reasons why  major corporations around the world rely heavily on IBM for critical services and solutions.

img_colorado2016

1. Innovation: Once IBM’s CEO asked one of top Indian Telco customer to describe IBM in one word. They immediately said – Innovation. Innovation is in IBMers DNA. Watson is just an illustration of IBM’s innovative prowess which demolished human competitors in a highly touted series of Jeopardy! games. IBM has been the top position in number of inventions for more than two decades now. You can read some notable inventions hereFrom eWeek: “IBM might be, at heart, an old school, enterprise-focused company, but it also keep coming up with innovative ideas, including artificial intelligence, supercomputing and the role of the mainframe in cloud computing. The company’s Watson invention is one of the most important it’s brought to the public in some time, and its work on capturing and analyzing big data to make it actionable in a corporate environment could have a positive effect on the world for decades to come.”

2. Understands Customers Needs: Management expert (and author of books such as Built to Last) Jim Collins says, “If you consider what IBM’s mission is, it’s not about computers or technology. It’s about allowing its individual employees to create ways for its customers to solve operational problems. Whether that’s a task best done with scales, typewriters or computers doesn’t matter; what matters is that customers’ needs are answered.” IBM understands the business of Enterprises and so is the market Leader the Gartner’s Magic quadrant in almost any technology area.

3. Spread Across Geographies: IBM has it’s offices in over 170 countries making it easy to reach an executive to get a demo or a quick help. In my induction to IBM 13 years back I was told that it is one  the top three most popular brand name around the world!

4. Trust: Which of the company can an Enterprise trust that will last for the next decade? Will it be acquired by another company and with its fate unknown? IBM has managed to have organic growth to survive 10 decades. Nobody will ever complete a leveraged buy out of IBM. When a company is looking for important solutions in key areas such as infrastructure software or security, the vendor’s reputation and trustworthiness are crucial considerations. There is an old saying in the industry: “Nobody ever got fired for buying IBM.”

5. Big Pockets: Why IBM is a Leader in most of Gartner’s magic quadrants? You guessed it. Either it innovates to be there or it acquires the company which is there. Mobile and Cloud solutions market are on rise and IBM is ready with $4 billion investment in these areas. Hardware operations lost half a billion dollars in 2013 due to large shifts in the commodity hardware market. For most companies, that sort of loss would spell the end, but given that IBMs big pocket, the management team is simply transitioning the business through this change cycle.

6. Experience: IBM survived several recessions, technological shifts and intense competition and demonstrated a strength shared by most 100-year-old companies: the ability to learn and change. For example,  many enterprises are now joining the band wagon of big data,  whereas IBM’s InfoSphere Information Server has over a decade of experience in big data movement and data governance. You may watch this video that captures IBM’s 100 years of experience that changed the world.

ibm7. Stack Integration: The one advantage you get with IBM is that IBM does everything – from silicon to solutions (end-to-end). Morningstar analyst Peter Wahlstrom says, “IBM holds a defensible position in enterprise software, services and hardware. While each of these businesses is an industry leader in its own right, the combination of these products and services provides the firm with a unique solution creation perspective and delivery ability that is key to its wide economic moat.“‘

I hope this would have been an interesting read – specially when it comes from an IBM developer who had been developing market leading software since over a decade.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions

Determistic Vs Probabilistic Match

As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.

DetermisticVsProbabilistic

Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.

Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.

In Summary
Deterministic decisions tables:

  • Fields are compared
  • Letter grades are assigned
  • Combined letter grades are compared to a vendor-delivered file
  • Result: Match; Fail; Suspect

Probabilistic record linkage:

  • Fields are evaluated for degree of match
  • Weight is assigned and represents the information content by value.
  • Weights are summed to derive a total score.
  • Result: Statistical probability of a match

InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.

Challenges of Data Lake paving way for Data Reservoir

DataLakeIn my previous blogs I was discussing about Data Lake. Imagine you have pooled the entire data of your enterprise to a Data lake, there will be challenges. All this raw data will be overwhelming and unsafe to use because no-one is sure where data came from, how reliable it is, and how it should be protected. Without proper management and governance, such a data lake can quickly become a data swamp. This data swamp can cause frustration to the business users, application developers, IT and even customers.

So there is a need for a facility for transforming raw data into information that is Clean, Timely, Useful and Relevant. Hence an enhanced data lake solution was built with management, affordability, and governance at its core. This solution is known as a data reservoir. Probably in one of the subsequent blogs we will take a dip into data reservoir! Stay tuned.

Data Lake Vs Data Warehouse

DataLakeIn my last blog, I wrote on Data Lake. The first comment on the Blog was to find out the difference between Data Lake and Data Warehouse. So in this blog, I will try to share some of my understanding on their difference:

Schema: In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. But in Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data.

Cost (Storage and Processing) : Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.

Data Access: The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.

Flexibility: Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.

Data Quality: The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.

Relevance in Big Data world: Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data. For example, data lakes are an ideal way to manage the millions of patient records for a hospital. These patient records can be physicians’ notes to lab results. With a data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record when needed, converting the data into uniform structures only when the situation calls for it.

Data Lake does provide some advantages to the Enterprises who require quick access to data. But Data Lakes brings  it’s own sets of challenges. I will explore this in my subsequent blogs.