In Part1 of this series I mentioned the role an ETL tool can play in the world of Hadoop. In Part2, we discussed some of the technical limitations of Hadoop. In Part3, based on my recent readings, we will discuss more on how Hadoop cannot play a part of a Data Integration Solution independently. This may come as a surprise to some of the Hadoop proponents as they see Hadoop projects performing extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools. Here are a few…
- Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
- Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
- Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.
- Purely Hadoop-based approach to data integration will require custom code and higher costs, which demands specialized skills and ongoing effort to maintain and change.
- Data integration projects requires good governance principles, and select technologies that support the application of the required policies and procedures. This is not addressed in Hadoop based projects as of now.
Not only are many key data integration capabilities immature or missing from the Hadoop stack, but many have not been addressed in current projects.
Disclaimer: “The postings on this site are my own (based on my readings and interpretations) and don’t necessarily represent IBM’s positions, strategies or opinions.”