Before I move on to a different series (Hopefully on Quality Stage), I thought that BIG data series will not be over till I speak on Hadoop. So here are somethings on Hadoop…
What is Hadoop?
Hadoop is a top-level Apache project in the Apache Software Foundation that’s written in Java. For all intents and purposes, we can think of Hadoop as a computing environment built on top of a distributed clustered file system that was designed speciﬁcally for very large-scale data operations.
Hadoop was inspired by Google’s work on its Coogle (distributed) File System (CPS) and the MapReduce programming paradigm, in which work is broken down into mapper and reducer tasks to manipulate data that is stored across a cluster of servers for massive parallelism. Hadoop has made itself practical to be applied to a much wider set of use cases. Unlike transactional systems, Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system.
Hadoop is not about speed-of-thought response times, real-time warehousing, or blazing transactional speeds; it is about discovery and making the once near-impossible possible from a scalability and analysis perspective. The Hadoop methodology is built around a function-to-data model as opposed to data-to-function; in this model, because there is so much data, the analysis programs are sent to the data.
Usecases where Hadoop is not suitable:
– With Hadoop, massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.
– As of Hadoop version 0.20.2, updates are not possible, but appends will be possible starting in version 0.21.
– Hadoop is not suitable for OnLine Transaction Processing workloads where data are randomly accessed on structured data like a relational database.
– Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence.
– It is NOT a replacement for a relational database system.
Components of Hadoop:
Hadoop is generally seen as having two parts: a file system (the Hadoop Distributed File System) and a programming paradigm (MapReduce). One of the key components of Hadoop is the redundancy built into the environment. Not only is the data redundantly stored in multiple places across the cluster, but the programming model is such that failures are expected and are resolved automatically by running portions of the program on various servers in the cluster. Due to this redundancy, it’s possible to distribute the data and its associated programming across a very large cluster of commodity components. It is well known that commodity hardware components will fail (especially when we have very large numbers of them), but this redundancy provides fault tolerance and a capability for the Hadoop cluster to heal itself. This allows Hadoop to scale out workloads across large clusters of inexpensive machines to work on Big Data problems.
There are a number of Hadoop-related projects. Some of the notable Hadoop-related projects include: Apache Avro (for data serialization), Cassandra and HBase (databases), Chukwa (a monitoring system specifically designed with large distributed systems in mind), Hive (provides ad hoc SQL-like queries for data aggregation and summarization), Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data-ﬂow language and execution framework for parallel computation), ZooKeeper (provides coordination services for distributed applications), YARN (the traffic cop; manages the systems resources), Ambari(the manager; software to provision, manage, and monitor the cluster), Solr/Lucene(index and search) and more.