IBM Vs Informatica (2 of 2)

In my last blog, we compared IBM’s Information Server and Informatica’s Power Center based on their scalability. Here is the summary: Big Data and enterprise class data environments need unlimited data scalability to keep pace with data volume growth. Informatica’s PowerCenter is NOT designed to provide unlimited data scalability which may lead to investment in expensive workarounds.

In this blog we will touch upon two other important aspect of ETL tools.

Data Governance and Metadata Management


  • IBM provides a data governance solution (Information Governance Catalog) designed for business users.
    • Information Governance Catalog has deep support of REST API interface. This makes Information Sever more open and ensures compatibility with other enterprise systems. User can create custom enhancement and loaders as well as can create unique user interfaces for a consistent look and feel.
    • There is a superior Event based notification that takes advantage of open source kafka messaging. For example, Import of metadata is an “event” that can be monitored for workflow and approval purposes, or simply for notification.
    • There is graphical reporting to illustrate relationships, data design origins, and data flow lineage to help answer “what does this mean” and “where did this data come from?”
    • There is an advanced search and navigation or a “shopping” experience for the data.
    • Metatadata Asset Manager controls what data goes in the repository. “Import Areas”  govern what is being imported into the repository (or not), and who is able to import. These imports are initiated via browser interface.  No local Windows installation is required for the metadata administrator.
  • Informatica lacks these capabilities and provides a data governance solution designed for technical users. It lacks openness of their platform and you get locked to “Informatica Only” architecture.

Data Quality

  • IBM provides an integrated data integration platform with one processing engine, one user design experience for data integration and data quality, and one shared metadata repository. Information Server gives ability to write a datastage job once and run it anywhere (transcational database, hadoop or eventually spark)
  • Informaticdataquality.pnga provides a  collection of multiple and incompatible processing engines, user design experiences, and metadata repositories. Informatica Data Quality and Informatica Power Center are two different products that have different user interfaces.  In fact, PC needs two interfaces to design jobs an manage workflows. It also uses two engines. This means that Data Quality processes have to be ‘pushed’ or ‘exported’ to PC to run.
In Summary, we can say Information Server is a better solution to go in case we want to create scalable workflows, open-ness in architecture and better productivity design and running the workflows. Information Server supports the power of 1.
  • 1 Engine: The same engine runs stand-alone, in a grid, or natively in Hadoop/YARN. Jobs can remain unchanged regardless of deployment model.
  • 1 Design Experience: Single design experience for Data Integration and Data Quality that increases productivity and reduces error.
  • 1 Repository: A single active metadata repository across the entire portfolio and so design and execution metadata instantly shared among team members.
Disclaimer: The postings on this site are my own and don’t necessarily represent IBM‘s positions, strategies or opinions

IBM Vs Informatica (1 of 2)

Some of you would have noticed in “about me” page, that one of my post that gets lot of hit is IBM and Informatica Leads Gartner Magic Quadrant for Data Integration Tools 2013. I realize that many of visitors would want to get a comparison between IBM Information Server and Informatica. I am into ETL domain since last 13 years, and have several publications and patents in this domain. So I thought of venturing into comparing these two solutions. One may ask that Gartner anyhow compares these solutions, so why is a need for me to do the same. The answer is that Gartner takes into account many different factors, and I am basing my comparison based on just the technical capability perspective of the key areas. Also I am opening a dialog where other practitioners who have worked on these two can provide inputs so that all the readers (including me) can benefit.

In this blog, I will focus on the scalability aspect of these two ETL solutions.

Scalability and Parallel Processing

Big Data Integration requires something called Massive Data Scalability. Massive Data Scalability requires the ability to process more data by simply adding more hardware.

  • IBM’s Information Server is built on a shared nothing, massively parallel processing architecture. There is no limitation on throughput and performance. If you want to process more data, you just add hardware. You don’t change your application. You can refer to my earlier blog which describes the Information Server Parallel processing which is much faster (10X to 15X) than the processing by Hadoop.

    Information Server Architecture Supporting Data Partitioning
  • Informatica’s  PowerCenter and Blaze can’t support partitioning a large data set across the nodes of a cluster or grid or MPP system. This is one of the fundamental architectural requirements for processing large data volumes. This means there is no support for running data integration logic in parallel across computing nodes, with the same logic running against separate data partitions. Because of this architectural limitations, the amount of data that you can sort, aggregate, transform, join, etc. is limited to what you can process on one node. So what does this mean?
    • First, you can’t exploit commodity grid hardware and storage for processing Big Data. You have to buy expensive SMP servers and storage because the amount of data that you can sort, transform, aggregate is what you can process on one server.
    • A second implication is that you are forced to push big ETL workloads into the parallel database. Users will be forced to push big ETL workloads into parallel databases such as Netezza, DB2, Oracle, and Teradata
    • Finally, because you can’t run all complex data transformations in the parallel database, you have to live with dirty data that has not been cleansed.

In Summary:
Processing large data in a scalable manner require data to be partitioned across separate nodes so that a single job executes the same application logic against all partitioned data. This is not possible by Informaticas Power Center. And so for processing large dataset, INFA customer has to depend on pushing the processing in the Database (too expensive) or offload some of the work to Hadoop (too slow).



The Best Data Science Platform

Data science platforms are engines for creating machine-learning solutions. Innovation in this market focuses on Cloud, Apache Spark, automation, collaboration and artificial-intelligence capabilities.When choosing the best one, organizations often trust on The Gartner Magic Quadrants which aims to provide a qualitative analysis into a market and its direction, maturity and participants. Gartner previously called these platforms “advanced analytics platforms”. But since this platform is primarily used by data scientists so from this year the Quadrant has been renamed to Magic Quadrant for Data Science Platforms. 

This Magic Quadrant evaluates vendors of data science platforms. These are products that organizations use to build machine-learning solutions themselves, as opposed to outsourcing their creation or buying ready-made solution. These platforms are used by data scientists for  demand prediction, failure prediction, determination of customers’ propensity to buy or churn, and fraud detection.

The report aims to rank the BI platforms on the ability to execute and the completeness of vision. The Magic Quadrant is divided in 4 parts:

  • Niche Players
  • Challengers
  • Visionaries
  • Leaders

    Source: Gartner (Feburary 2017)

Adoption of open-source platforms and Diversity of tools is an important characteristic of this market. IBM’s mission is to make data simple and accessible to the world and commitment to open source and numerous open-source ecosystem providers made it most attractive platform for Data Science.  A data scientist needs the following to be more successful, which is provided by IBM Data Scientist Experience

  • Community: A data scientist needs to be updated with the latest news from the Data Science Community. There are plenty of new Open Source packages, libraries, techniques and tutorials available every day. A good data scientist follows the most important sources and shares their opinion and experiments with the community. IBM brings this into the UI of the DSX.
  • Open Source: Today there are companies that rely on open source for data science. Open source has become so mature that is directly competing with commercial offerings. IBM provide the best of open source within DSX, such as RStudio and Jupyter.
  • IBM Value Add: DSX improve open source by adding some capabilities from IBM. Data Shaping for example takes 80% of the data scientist time. IBM tools with visual GUI to help users better perform this task. You can execute Spark jobs on  managed Spark Service in Bluemix from within the DSX.

Who Leads the Forrester Wave in Data Quality?

This was Information Analyzer and QualityStage Vs. the World and IBM came out on top!!!

Forrester published their most recent Wave vendor evaluation report on Data Quality December 14th, 2015. IBM is positioned as a strong leader in this evaluation, receiving the highest possible strategy score.


Here are some highlights:

  • IBM gets customers started on enterprise data quality with a rich set of data quality content to speed up the deployment and return on data quality investment across traditional, big data, cloud, and hybrid environments.
  • The stewardship consoles allow business data quality stewards to lead data quality with strong dashboarding, reporting, and data profiling.
  • In addition, business data stewards easily collaborate with data quality developers in the creation of rules, match, and survivor feedback.
  • IBM is also porting its full enterprise data quality capabilities to the cloud and evolving its pricing and services models to be flexible to a variety of customer architectures and implementations.

For full Forrester report click here.

Spark – Sparkling framework for big data management and analytics

sparkThere has been lot of buzz around Apache Spark since last several months, and I have been following it to some extent and comparing it with Hadoop. In this blog, I will share some of what I have read about it.

Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Well, wasn’t that Hadoop’s claim to fame? Well yes, but Spark was developed as a way to speed up processing jobs in Hadoop systems. Spark advocates claim that with its in-memory computing layer, Spark can run batch-processing programs up to 100 times faster than MapReduce can. When data is processed from disk, Spark can run batch jobs up to 10 times faster, they say.

While MapReduce is limited to batch processing, the Apache Spark architecture includes a stream processing module, a machine learning library and a graph processing API with related algorithms thereby making it a more general purpose platform. The Spark Streaming technology in particular has found its way into deployments at Spark early adopters, for uses such as analyzing online advertising data and processing satellite images and geo-tagged tweets. Does that imply that handling of these additional processing workload may require companies to expand the size of their Hadoop clusters? And the answer is obviously, Yes.

Unlike Hadoop, Spark doesn’t include its own file system. It can run in a standalone mode and access a variety of data sources, but most often it is used to process and analyze data stored in the Hadoop Distributed File System (HDFS). So it should not be surprising to note that Spark has been incorporated into the top Hadoop distributions in every major distribution of Hadoop, including the ones from Cloudera, Hortonworks, IBM, MapR and Pivotol. In such installations, one can still use MapReduce because of its reliability, but Spark may require less development expertise than MapReduce does because of its high-level APIs and support for writing applications in Java, Scala or Python.

Seven Reasons Why Enterprises Trust IBM Software

Recently IBM announced that it would be backing Spark in it’s effort to embrace and promote Open Source. At this, technology entrepreneur and co-founder of the venture capital firm Andreessen Horowitz, Ben Horowitz said, “It’s like Spark just got blessed by the enterprise rabbi.” So this is the position that IBM commands as it stands as a technology company supporting it’s clients for over a century. In this blog I will share Seven reasons why  major corporations around the world rely heavily on IBM for critical services and solutions.


1. Innovation: Once IBM’s CEO asked one of top Indian Telco customer to describe IBM in one word. They immediately said – Innovation. Innovation is in IBMers DNA. Watson is just an illustration of IBM’s innovative prowess which demolished human competitors in a highly touted series of Jeopardy! games. IBM has been the top position in number of inventions for more than two decades now. You can read some notable inventions hereFrom eWeek: “IBM might be, at heart, an old school, enterprise-focused company, but it also keep coming up with innovative ideas, including artificial intelligence, supercomputing and the role of the mainframe in cloud computing. The company’s Watson invention is one of the most important it’s brought to the public in some time, and its work on capturing and analyzing big data to make it actionable in a corporate environment could have a positive effect on the world for decades to come.”

2. Understands Customers Needs: Management expert (and author of books such as Built to Last) Jim Collins says, “If you consider what IBM’s mission is, it’s not about computers or technology. It’s about allowing its individual employees to create ways for its customers to solve operational problems. Whether that’s a task best done with scales, typewriters or computers doesn’t matter; what matters is that customers’ needs are answered.” IBM understands the business of Enterprises and so is the market Leader the Gartner’s Magic quadrant in almost any technology area.

3. Spread Across Geographies: IBM has it’s offices in over 170 countries making it easy to reach an executive to get a demo or a quick help. In my induction to IBM 13 years back I was told that it is one  the top three most popular brand name around the world!

4. Trust: Which of the company can an Enterprise trust that will last for the next decade? Will it be acquired by another company and with its fate unknown? IBM has managed to have organic growth to survive 10 decades. Nobody will ever complete a leveraged buy out of IBM. When a company is looking for important solutions in key areas such as infrastructure software or security, the vendor’s reputation and trustworthiness are crucial considerations. There is an old saying in the industry: “Nobody ever got fired for buying IBM.”

5. Big Pockets: Why IBM is a Leader in most of Gartner’s magic quadrants? You guessed it. Either it innovates to be there or it acquires the company which is there. Mobile and Cloud solutions market are on rise and IBM is ready with $4 billion investment in these areas. Hardware operations lost half a billion dollars in 2013 due to large shifts in the commodity hardware market. For most companies, that sort of loss would spell the end, but given that IBMs big pocket, the management team is simply transitioning the business through this change cycle.

6. Experience: IBM survived several recessions, technological shifts and intense competition and demonstrated a strength shared by most 100-year-old companies: the ability to learn and change. For example,  many enterprises are now joining the band wagon of big data,  whereas IBM’s InfoSphere Information Server has over a decade of experience in big data movement and data governance. You may watch this video that captures IBM’s 100 years of experience that changed the world.

ibm7. Stack Integration: The one advantage you get with IBM is that IBM does everything – from silicon to solutions (end-to-end). Morningstar analyst Peter Wahlstrom says, “IBM holds a defensible position in enterprise software, services and hardware. While each of these businesses is an industry leader in its own right, the combination of these products and services provides the firm with a unique solution creation perspective and delivery ability that is key to its wide economic moat.“‘

I hope this would have been an interesting read – specially when it comes from an IBM developer who had been developing market leading software since over a decade.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions

Determistic Vs Probabilistic Match

As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.


Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.

Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.

In Summary
Deterministic decisions tables:

  • Fields are compared
  • Letter grades are assigned
  • Combined letter grades are compared to a vendor-delivered file
  • Result: Match; Fail; Suspect

Probabilistic record linkage:

  • Fields are evaluated for degree of match
  • Weight is assigned and represents the information content by value.
  • Weights are summed to derive a total score.
  • Result: Statistical probability of a match

InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.