I posted part I of this discussion a couple of months back where I discussed the role Information Server can play in Hadoop based solutions. Many Hadoop advocates argue that Hadoop’s data-processing platform is an ideal place to handle data transformation, as it offers scalability and cost advantages over conventional ETL software and server infrastructure. Some go to the extent to claim that rise of Hadoop can bring the end of ETL. In this blog I would want to bring out some of the limitations in a purely Hadoop based solution.
Limitations of Hadoop based solutions:
While Hadoop and Hadoop-based solutions have their advantages when it comes to addressing big data volumes, Hadoop is not designed for data integration. Data integration carries its own unique requirements (such as supporting governance, metadata management, data quality and flexible data delivery styles) for success. In a recent paper, Gartner states “As use of the Hadoop stack continues to grow, organizations are asking if it is a suitable solution for data integration. Today, the answer is no. Not only are many key data integration capabilities immature or missing from the stack, but many have not been addressed in current projects.”
Hadoop is essentially a file system. It is a way to store big data so that it can be analyzed. The performance issues associated with using Hadoop for data processing are well known:
- Hadoop is written in Java, which is slower than frameworks in C or C++.
- MapReduce/ Hadoop File System lands data between reduce steps, which is a huge performance constraint.
- Hadoop File System centrally manages an index necessary to map tasks to the data distributed throughout the nodes— a documented bottleneck.
- Operations requiring data collocation to compute the result (joins, aggregations, sorts, deduplications and so on) will run inefficiently when the data distribution is different from the index.
- Hadoop is not a good choice when real-t ime, low-latency processing is required because there is no “real-time” version of Hadoop available.
- Job start- up is slow, which can be a big performance penalty, particularly for small, bursty jobs
So where do we go from here? Stay tuned…
Previous Post: Information Server and Big Data Integration – I