Organizations exploring big data analytics, such as Apache Hadoop for data at rest or streaming technology for data in motion, face many of the same challenges as they do with other analytical environments. These challenges include determining the location of the information sources needed for analysis, how that information can be moved into the analytical environment, how it must be reformatted so that it becomes easier and more efficient to explore, and what data should be persisted to quickly get to the next level of analysis. Several Data Integration tools are assisting these organizations to solve these challenges. I plan to write a few blogs on describing how Information Server is assisting in this Big Data world.
InfoSphere Information Server includes capabilities that organizations need to integrate the extreme volume, variety and velocity of big data – from new and emerging big data sources. Here are some of these (not an exhaustive list), I will be explaining them whenever my time permits.
Balanced optimization for Hadoop
When a data integration job includes a big data source, InfoSphere Information Server now can push the processing to the data. Using the same common set of InfoSphere DataStage stages and links to build the data integration logic, developers may now choose to run the entire logic, or only portions of that logic, as a MapReduce job that will execute directly on the Hadoop platform. When the sources and targets of the integration task are Hadoop data stores, this approach will yield significant performance gains, as well as savings in network resource consumption.
IBM InfoSphere Streams integration
For big data projects that focus on real-time analytical processing, IBM now offers direct data flow integration between InfoSphere Information Server and InfoSphere Streams to combine the power and reach of both platforms. With this feature, organizations can use standard data integration conventions to gather information from across the enterprise and pass that information to the real-time analytical processes. Similarly, when InfoSphere Streams finds records of insight, that data can now be passed directly to a running data-integration job and made available to data stores or applications across the information landscape, using the full depth and breadth of InfoSphere Information Server connectivity.
Big data job sequencing
InfoSphere Information Server now allows any InfoSphere BigInsights or Cloudera-certified Oozie-contained MapReduce job to be included in the job sequencer. This feature provides end-to-end workflow across heterogeneous topologies executed in both InfoSphere Information Server and Hadoop.
InfoSphere Information Server also supports big data-related governance features, such as impact analysis and data lineage, on any big data integration points, thus providing enterprises the ability to deliver on the promises of massively scalable analytics, without sacrificing organizational insight into the information infrastructure.