In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.
Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!
Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.
How is this accomplished? YARN!
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.
- Step 1
- The conductor process manages the section leader and player processes that run on the InfoSphere Information Server engine. The conductor process on the engine tier receives a job run request for an InfoSphere DataStage, InfoSphere QualityStage job. This job might be generated from an InfoSphere Information Analyzer analysis.
- Step 2
- The conductor connects to the YARN client, which assigns an Application Master to the job from the available pool of Application Masters it maintains. If an Application Master is not available in the pool the client will start a new one for this job. The conductor connects to the Application Master and sends the details about the resources that are required for running the job.
- Step 3
- The Application Master requests resources from the Yarn resource manager. The jobs processes run in a YARN container, with each container running a section leader and players. The YARN container designates resource requirements such as CPU and memory. When the resources are allocated, the conductor sends the process commands for the section leader to the Application Master, which starts those commands on the allocated resources.
More details can be found here.
A few other capabilities offered by InfoSphere Information Server on Hadoop includes:
- InfoSphere Information Analyzer features are now supported, executing directly inside a Hadoop cluster.
- Hadoop metadata management is made easy
- HDFS file connectivity including new data formats, additional character sets and additional data types
- Support for Kerberos-enabled clusters
- YARN job browser