In my earlier blog, I described what Hadoop is? In last two parts of this series, I mentioned some of the limitations of of Hadoop. That should not discount the potential that Hadoop has. So in this blog I will share in simple terms what Hadoop is and what role it can play for enterprises who are testing waters in the Big Data world.
Example to illustrate the potential of Hadoop:
Imagine you have a jar of multicolored candies, and you need the count of blue candies relative to red and yellow ones. You could empty the jar onto a plate, sift through them and tally up your answer. If the jar held only a few hundred candies, this process would take only a few minutes.
Now imagine you have four plates and four helpers. You pour out about one-fourth of the candies onto each plate. Everybody sifts through their set and arrives at an answer that they share with the others to arrive at a total. Isn’t it much faster?
That is what Hadoop does for data. Hadoop is an open-source software framework for running applications on large clusters of commodity hardware. Hadoop delivers enormous processing power – the ability to handle virtually limitless concurrent tasks and jobs – making it a remarkably low-cost complement to a traditional enterprise data infrastructure.
Enterprises are using Hadoop for several notable merits:
• Hadoop is distributed. Bringing a high-tech twist to the adage, “Many hands make light work,” data is stored on local disks of a distributed cluster of servers.
• Hadoop runs on commodity hardware. Based on the average cost per terabyte of compute capacity of a prepackaged system, Hadoop is easily 10 times cheaper for comparable computing capacity compared to higher-cost specialized hardware.
• Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data replication and speculative processing. If capacity is available, Hadoop runs multiple copies of the same task, accepting the results from the task that finishes first.
• Hadoop does not require a predefined data schema. A key benefit of Hadoop is the ability to just upload any unstructured files without having to “schematize” them first. You can dump any type of data into Hadoop and allow the consuming programs to determine and apply structure when necessary.
• Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000 and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000 concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000 Hadoop nodes storing more than 200 petabytes of data. Linkedin manage over 1 billion personalized recommendations every week by using Hadoop and its MapReduce and HDFS features! Facebook keeps track of 1 billion user profiles, along with the related data such as posts, comments, images, videos, and so on using Hadoop.
• Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of data in 62 seconds. To put it in context 10 terabytes could store the entire US Library of Congress print collection.
Hadoop handles big data. It does it fast. It redefines the possible when it comes to analyzing large volumes of data, particularly semi-structured and unstructured data (text). In my upcoming blogs on IBM InfoSphere BigInsights, I will try to share how tools built over Hadoop accelerate uncovering the potential of Hadoop for enterprise users.