DataStage job run time architecture on Hadoop

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Here is more detail on how Information Server uses YARN


  1. A job is submitted to run in the Information Server engine.
  2. The ‘Conductor’ (the process responsible for coordinating the job) asks YARN to instantiate the YARN version of the Conductor: The Application Master.
  3. The YARN Client is responsible for starting and stopping Application Masters
  4. Now that the Application Master is ready, ‘Section Leaders’ (responsible for work on a datanode) are prepared
  5. Section Leaders are created and managed by YARN Node Managers.  This is the point where the BigIntegrate/BigQuality binaries will be copied to the Hadoop DataNode if they do not already exist there.
  6. Now the real work can begin – the ‘players’ (that actually run the process) are started.

All of this is automatic and behind the scenes.  The actual user interface will look and feel identical to when a job is run on Windows, AIX, or Linux.



IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes


IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.

DataStage Best Practices – 3

DataStage1. Avoid unnecessary type conversions:
Set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.
2. Use Transformer stages sparingly and wisely.
Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3. Increase Sort performance where possible.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types.

4. Remove Unneeded Columns.
Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5. Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly re-partition.

6. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU intensive applications, which typically perform multiple CPU demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMS’s, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10. Turn off Runtime Column propagation wherever it’s not required.


DataStage Best Practices – 1

BestPracticesIn this series, I wish to share some of the best practices that I have come across or learnt from my peers in using DataStage. I hope  this will be helpful for a DataStage practitioner. Here are the links to Best Practice 2 and Best Practice 3 blog which is a continuation to this one.

The following are best practices:

a. There should be no Network bottleneck between Source -> ETL -> Target.
– Typically this means Private Network connection using dedicated switches.
– It also means proper capacity planning in term of network bandwidth for the Network Card as well as Switch Capacity.

b. There should be no bottleneck within the Source/Target System/Application to provide/consume data. ETL server can only process as fast as what the source can provide or the target can consume.
How fast can the ETL process the data in this example?
– Source can provide data at 010K rows / sec
– ETL can handle data at 100K rows / sec
– Target can consume data at 025K rows / sec

c. There should be no I/O bottleneck within the ETL Server.

d. There should be proper Capacity planning to cater for growth.

e. There should proper Job design to ensure Job scalability as the hardware scale. You can get some information on job design here.

f. Always use dedicated server or at least “dedicated CPU” if virtualization is required.

g. When there is a bottleneck in Source/Target, we can use less nodes in configuration file. It helps to improve performance and reduce resources usage.

h. You should be running the “right” number of jobs to ensure there is no system / process overload and wastage of resources to manage those process.

Here are some of the good reads:

  1. Architecture and Deployments
  2. Redbook on Deployment Architecture
  3. IBM InfoSphere Information Server Installation and Configuration Guide
  4. Managing and Deleting Persistent Data Sets within IBM InfoSphere Datastage

Determistic Vs Probabilistic Match

As explained in an earlier blog , Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key. There are two common approaches to decide a match in data while comparing two similar records. They are deterministic match and probabilistic match.


Deterministic matching typically searches for a pool of candidate duplicates and then compares values found in specified attributes between all pairs of possible duplicates. It makes allowances for missing data. The results are given a score, and the scores are used to decide if the records should be considered the same or different. There is a gray area where the scores indicate uncertainty, and such duplicates are usually referred to a data steward for investigation and decision.

Probabilistic matching looks at specified attributes and checks the frequency that these attributes occur in the dataset before assigning scores. The scores are influenced by the frequencies of existing values found. A threshold can be assigned to decide whether it is a definite match or a clerical intervention of data steward is required to determine a match.

In Summary
Deterministic decisions tables:

  • Fields are compared
  • Letter grades are assigned
  • Combined letter grades are compared to a vendor-delivered file
  • Result: Match; Fail; Suspect

Probabilistic record linkage:

  • Fields are evaluated for degree of match
  • Weight is assigned and represents the information content by value.
  • Weights are summed to derive a total score.
  • Result: Statistical probability of a match

InfoSphere QualityStage can perform both deterministic matching and probabilistic record linkage, but uses probabilistic record linkage by default. The above example highlights the advantage of probabilistic matching.

Information Server and Big Data Integration – III (Limitations of Hadoop contd.)

In Part1 of this series I mentioned the role an ETL tool can play in the world of Hadoop. In Part2, we discussed some of the technical limitations of Hadoop. In Part3, based on my recent readings, we will discuss more on how Hadoop cannot play a part of a Data Integration Solution independently. This may come as a surprise to some of the Hadoop proponents as they see Hadoop projects performing extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools. Here are a few…

  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.
  • Purely Hadoop-based approach to data integration will require custom code and higher costs, which demands specialized skills and ongoing effort to maintain and change.
  • Data integration projects requires good governance principles, and select technologies that support the application of the required policies and procedures. This is not addressed in Hadoop based projects as of now.

Concluding Remarks:
Not only are many key data integration capabilities immature or missing from the Hadoop stack, but many have not been addressed in current projects.

Disclaimer: “The postings on this site are my own (based on my readings and interpretations) and don’t necessarily represent IBM’s positions, strategies or opinions.”

Just Arrived – InfoSphere Information Server v11.3

IBM Information Integration platform today announced InfoSphere Information Server v11.3 and InfoSphere Data IISReplication v11.3.

Here are some of the InfoSphere Information Server V11.3  highlights:

  • The new InfoSphere Information Governance Catalog (replacing InfoSphere Business Information Exchange) to provide a one-stop-shop for business and technical metadata
  • New on-premise data integration for Cloud environments – by providing integration with Amazon S3 and for distributed Database as a Service (DBaaS) offerings, including IBM Cloudant (via new REST-based service support)
  • New, deeper integration for MDM projects – via an MDM Stage to make it easy to load data into and extract data out of InfoSphere MDM (and benefit from peak scale and performance by using bulk API loading). The MDM Integration Stage is supported by a metadata-driven design, which means that metadata captured by the MDM Integration Stage will align with the MDM data model, enabling multiple segments of data to be sent in a single request.
  • Enhancements for InfoSphere Data Click – including native high-speed load to move data into InfoSphere BigInsights, integration with Cloud environments, Information Governance Catalog Integration (so that users may now search for information through the metadata available in InfoSphere Information Server Governance Catalog and launch directly into Data Click to integrate that data), and more
  • Governance Dashboard: The InfoSphere Information Governance Dashboard and its companion SQLViews (a fully described query layer to key metadata) are now included as part of the integrated Information Server installer, and reports have been expanded to include analysis of critical business elements that lack a data steward owner. Additionally, Information Server now includes entitlement to IBM Cognos BI components to support the organizational rollout of your data governance program to both IT and line of business users.
  • New ‘clerical pairs’ are now available in the Data Quality Console to better support matching stewardship.
  • New performance optmization for data quality – with new, faster algorithms for multi-column primary key detection, the ability to take full advantage of parallelism in column analysis, bulk-load write to the analysis database, and optimized client/server communication to reduce network traffic.
  • New operational quality rules – these pre-built data rules introspect Operational Metadata to help operations teams monitor how the data integration platform is meeting service level agreements and best practices of the organization.
  • Updated MDM bundling – Information Server Enterprise Edition is NOW included with MDM v11.3
  • InfoSphere Information Server Cloud support: InfoSphere Information Server Enterprise Hypervisor Edition V9.1 is now available for public cloud deployments via the IBM PureApplication Service on Softlayer.

For more information:
Official Annoucement
InfoSphere Information Server v11.3 – Era of Business Driven Governance