Some of you would have noticed in “about me” page, that one of my post that gets lot of hit is IBM and Informatica Leads Gartner Magic Quadrant for Data Integration Tools 2013. I realize that many of visitors would want to get a comparison between IBM Information Server and Informatica. I am into ETL domain since last 13 years, and have several publications and patents in this domain. So I thought of venturing into comparing these two solutions. One may ask that Gartner anyhow compares these solutions, so why is a need for me to do the same. The answer is that Gartner takes into account many different factors, and I am basing my comparison based on just the technical capability perspective of the key areas. Also I am opening a dialog where other practitioners who have worked on these two can provide inputs so that all the readers (including me) can benefit.
In this blog, I will focus on the scalability aspect of these two ETL solutions.
Scalability and Parallel Processing
Big Data Integration requires something called Massive Data Scalability. Massive Data Scalability requires the ability to process more data by simply adding more hardware.
- IBM’s Information Server is built on a shared nothing, massively parallel processing architecture. There is no limitation on throughput and performance. If you want to process more data, you just add hardware. You don’t change your application. You can refer to my earlier blog which describes the Information Server Parallel processing which is much faster (10X to 15X) than the processing by Hadoop.
- Informatica’s PowerCenter and Blaze can’t support partitioning a large data set across the nodes of a cluster or grid or MPP system. This is one of the fundamental architectural requirements for processing large data volumes. This means there is no support for running data integration logic in parallel across computing nodes, with the same logic running against separate data partitions. Because of this architectural limitations, the amount of data that you can sort, aggregate, transform, join, etc. is limited to what you can process on one node. So what does this mean?
- First, you can’t exploit commodity grid hardware and storage for processing Big Data. You have to buy expensive SMP servers and storage because the amount of data that you can sort, transform, aggregate is what you can process on one server.
- A second implication is that you are forced to push big ETL workloads into the parallel database. Users will be forced to push big ETL workloads into parallel databases such as Netezza, DB2, Oracle, and Teradata
- Finally, because you can’t run all complex data transformations in the parallel database, you have to live with dirty data that has not been cleansed.
Processing large data in a scalable manner require data to be partitioned across separate nodes so that a single job executes the same application logic against all partitioned data. This is not possible by Informaticas Power Center. And so for processing large dataset, INFA customer has to depend on pushing the processing in the Database (too expensive) or offload some of the work to Hadoop (too slow).