Information Server and Big Data Integration – II (Limitations of Hadoop)

I posted part I of this discussion a couple of months back where I discussed the role Information Server can play in Hadoop based solutions. Many Hadoop advocates argue that Hadoop’s data-processing platform is an ideal place to handle data transformation, as it offers scalability and cost advantages over conventional ETL software and server infrastructure. Some go to the extent to claim that rise of Hadoop can bring the end of ETL. In this blog I would want to bring out some of the limitations in a purely Hadoop based solution.

Limitations of Hadoop based solutions:
While Hadoop and Hadoop-based solutions have their advantages when it comes to addressing big data volumes, Hadoop is not designed for data integration. Data integration carries its own unique requirements (such as supporting governance, metadata management, data quality and flexible data delivery styles) for success. In a recent paper, Gartner states “As use of the Hadoop stack continues to grow, organizations are asking if it is a suitable solution for data integration. Today, the answer is no. Not only are many key data integration capabilities immature or missing from the stack, but many have not been addressed in current projects.”

Hadoop is essentially a file system. It is a way to store big data so that it can be analyzed. The performance issues associated with using Hadoop for data processing are well known:

  • Hadoop is written in Java, which is slower than frameworks in C or C++.
  • MapReduce/ Hadoop File System lands data between reduce steps, which is a huge performance constraint.
  • Hadoop File System centrally manages an index necessary to map tasks to the data distributed throughout the nodes— a documented bottleneck.
  • Operations requiring data collocation to compute the result (joins, aggregations, sorts, deduplications and so on) will run inefficiently when the data distribution is different from the index.
  • Hadoop is not a good choice when real-t ime, low-latency processing is required because there is no “real-time” version of Hadoop available.
  • Job start- up is slow, which can be a big performance penalty, particularly for small, bursty jobs

So where do we go from here? Stay tuned…

Previous Post: Information Server and Big Data Integration – I


7 thoughts on “Information Server and Big Data Integration – II (Limitations of Hadoop)

  1. I understand that this post is a bit old now but I am curious to understand the details for the last two points. Could you please elaborate on why hadoop is not a good choice for real-time systems and what is the performance penalty due to which job start-up is slow?


    • Hi Amit,
      Here is my understanding. Hadoop, specially HDFS was made for commodity hardware predominantly for cold storage. It also replicates the data many place. What would make it slow for real time application is the poor quality of hardware (to respond fast) and the need for data replication before even processing it.

      • You mention two points – poor hardware and data replication. I am not sure if it would be right to state that. Those points would be applicable to any distributed storage and computing engine like Apache Spark, Apache Ignite. I believe the reason why it is not a good choice for real-time systems has to do something to its internal architecture. I do not know what that is though.

  2. Another aspect is that the minimum time taken for a task/ job to run on a hadoop cluster is more than a minute. For every task it starts the JVM/ does the task / serializes the result (means i/o overhead) and then stop the JVM. So this makes the real time analytics very slow. On the other hand frameworks like SPARK do not have such a limitation. They need not start the JVM for every call and their batch window for new data can be set to something as low is 1 second.

    • Don’t you need a external component like a Spark Job Server to be ready to accept jobs for execution? Spark out of the box doesn’t provide this capability.

  3. The single most defining attribute of Spark is its departure from slow I/O operations against spinning disks and hard drives. For traditional Hadoop/MapReduce, the slow performance of disks for frequent read and write operations is a significant, speed-hampering bottleneck. MapReduce jobs are very much dependent on high throughput of read & write requests; every time the system needs to perform a look-up or a commit to a spinning disk, there is some latency associated with that process. Over time, the cumulative effect of this bottleneck on performance can be significant; more importantly, it limits the types of use cases and data streams that traditional Hadoop distributions have been able to work with.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s