DataStage now available on Cloud

For data integration projects, DataStage has been the work horse for many years. It is used by Data Engineers to extract data from many different sources, transform and combine the data, and then populate them for applications and end users. DataStage has many distinct advantages over other popular ETL tools.

ETL on CloudUntil recently, these capabilities were only available with the on-premises offering. Now DataStage is available on the Cloud as a hosted cloud offering. Customers can take advantage of the full capabilities of DataStage and without the burden and time consumption of standing up the infrastructure and installing the software themselves. Customers can quickly deploy a DataStage environment (from ordering to provisioning it on the cloud) and be up and running in a day or less. There is no up-front capital expenditure as customers only pay a monthly subscription based on the capacity they purchase. Licensing is also greatly simplified.

Using DatasStage on Cloud, existing DataStage customers can start new projects quickly. Since it is hosted in the IBM cloud, the machine and operating system are managed by IBM. The customer will not have to spend time to either increase the current environment or create a new one. In other words, Cloud elasticity makes them ready to scale and handle any workload. DataStage ETL job developers can immediately be productive and the data integration activities can span both on-premises and cloud data if necessary, as the DataStage jobs can be exported from the cloud and brought back to an on-premises DataStage environment.

datastage-on-cloud2As an example; A customer has data sources such as Teradata, DB2, etc. in their data center as well as SalesForce, MongoDB and other data residing in the Cloud. They need access to their existing data sources and their cloud data sources for a new customer retention project . This project requires some sophisticated data integration to bring it all together but they don’t have the IT resources or budget to stand up a new data integration environment in their own data center for this project. So, an instance of DataStage on the Cloud can be deployed for their use. The customer can access the DataStage client programs on the Cloud to work with DataStage. The access would be either through the public Internet or a private connection via the SoftLayer VPN. DataStage ETL jobs running in the Cloud can access the customer’s on-premise data sources and targets using secured protocols and encryption methods. In addition, these DataStage jobs can also access cloud data sources like dashDB as well as data sources on other cloud platforms using the appropriate secured protocols.

So with DataStage hosted on the Cloud you can:

  1. Extend your ETL infrastructure: Expand your InfoSphere DataStage environment or begin transitioning into a private or public cloud with flexible deployment options and subscription pricing.
  2. Establish ad hoc environments: Extend your on-premises capacity to quickly create new environments for ad hoc development and testing or for limited duration projects.
  3. Start new projects in the cloud: Move straight to the cloud without establishing an on-premises environment. Realize faster time-to-value, reduce administration burden and use low-risk subscription pricing.

Go here for more information: https://developer.ibm.com/clouddataservices/docs/information-server/

DataStage job run time architecture on Hadoop

hadoop-logo In my earlier blog, I explored why enterprises are using Hadoop. In summary, scalable data platforms such as Hadoop offers unparalleled cost benefits and analytical opportunities (including content analytics) to enterprises. In this blog, I will mention some of the enhancements in  IBM‘s InfoSphere Informaiton Server 11.5 that helps leverage the scale and promise of Hadoop.

Data integration in Hadoop:
In this release, Information Server can execute directly inside a Hadoop cluster. This means that all of the data connectivity, transformation, cleansing, enhancement, and data delivery features that thousands of enterprises have relied on for years, can be immediately available to run within the Hadoop platform! Information Server is market leading product  in terms of it’s data integration and governance capability. Now the same product can be used to solve some of the industry’s most complex data challenges inside a Hadoop cluster directly. Imagine the time saved in moving the data back and forth from HDFS!

Even more, these new features for Hadoop use the same simple graphical design environment that IBM clients have previously been accustomed to build integration applications with. In other words, organizations can build new Hadoop-based information intensive applications without the need to retrain their development team on newly emerging languages that require manual hand coding and lack governance support.

How is this accomplished? YARN! 
Apache Hadoop YARN is the framework for job scheduling and cluster resource management. Information Server  can communicate with YARN to run a job on the data nodes on a Hadoop cluster using following steps.

Here is more detail on how Information Server uses YARN

DataStageOnHadoopArchitecture

  1. A job is submitted to run in the Information Server engine.
  2. The ‘Conductor’ (the process responsible for coordinating the job) asks YARN to instantiate the YARN version of the Conductor: The Application Master.
  3. The YARN Client is responsible for starting and stopping Application Masters
  4. Now that the Application Master is ready, ‘Section Leaders’ (responsible for work on a datanode) are prepared
  5. Section Leaders are created and managed by YARN Node Managers.  This is the point where the BigIntegrate/BigQuality binaries will be copied to the Hadoop DataNode if they do not already exist there.
  6. Now the real work can begin – the ‘players’ (that actually run the process) are started.

All of this is automatic and behind the scenes.  The actual user interface will look and feel identical to when a job is run on Windows, AIX, or Linux.

 

IBM Stewardship Center

Need for IBM Stewardship Center in Data Curation: 

Managing Data Quality requires the joint effort of business and IT. Business defines the information policies that govern the data quality for an organization. Based on these policies, IT team implement rules so that any deviations in the data quality can be reported for business to review. For example, if the policy for a Bank is that the account holder’s age should be greater than 18. During data load, an ETL tool can run some profiling on the data to check how many records are violating this rule. Now these records needs to be shared with the business (non technical domain experts called Data Stewards) who can take appropriate action to fix the issue. As many data stewards become increasingly responsible for improving the value of their data assets, they need capabilities to help them manage these new requirements like:

  • Collaborating across multiple lines of business to build information policies that support regulatory requirements
  • Assessing the cost of poor data quality and managing such data quality issues to closure
  • Engaging subject matter experts through business processes to review and approve corporate glossary changes

StewardshipCenter

IBM Stewardship Center is a powerful browser-based interface that helps to bridge the gap between business and IT, providing a central location for users to collaborate on data governance and manage data quality issues. Stewardship Center is built on an open event management infrastructure, which makes it possible to integrate information server based stewardship seamlessly into your existing stewardship solutions and collaboration environments.

IBM Stewardship Center leverages the strengths of IBM® Business Process Manager to offer solutions to these challenges that can be used immediately or can be customized or extended to suit the specific needs of your organization. The capabilities that are provided with IBM Stewardship Center are divided into three categories: data quality exception management, governance and workflow notification, and performance metrics.
IBM Stewardship Center includes these components:

  • The Data Quality Exception sample process application, which is a sample workflow for resolving data quality issues, which can be customized or extended.
  • The Stewardship Center Application Toolkit, which can be used to extend the capabilities of the Data Quality Exception sample process application or to create your own custom workflows.
  • The Email Notification process application, which can be used to notify users by email when governance events are generated in Information Governance Catalog.
  • The Governance Rule Approval process application, which can be used to manage approvals for information governance rules from Information Governance Catalog.

For more information, see Overview of IBM Stewardship Center.
For a video see Tech Talk: Stewardship Center.

DataStage Best Practices – 3

DataStage1. Avoid unnecessary type conversions:
Set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.
2. Use Transformer stages sparingly and wisely.
Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3. Increase Sort performance where possible.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types.

4. Remove Unneeded Columns.
Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5. Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly re-partition.

6. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU intensive applications, which typically perform multiple CPU demanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMS’s, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10. Turn off Runtime Column propagation wherever it’s not required.

 

DataStage Best Practices – 2

A red button with the words Act Now on it

Some consideration/recommendation when it come to deploying Information Server.
1. For a deployment on Intel based server, it is recommended to have 4GB RAM per Core as minimum and if you can afford do 8GB per Core.

2. Key File System that need good I/O are –

  • Source Files Folder
  • Target Files Folder
  • Scratch Disk (For those cases with 8GB/Core, consider creating RAM to be used for Scratch Disk)
  • Resource Disk

3. GRID related consideration –
Consider Network File System (NFS)  for the program files that is shared by conductor to all compute nodes. NFS or Clustered File System can be used for the Source/Target Files and Resource Disk.

4. In 11.3,  few more repositories were added to store information with the intention to manage the systems better. This implies higher capacity requirements for the repository. This means more licenses if not using DB2. The other way is to do more house keeping or not to use those features.

5. Job Design and Operation Consideration –

  • Try to keep the Project within 500 objects.
  • Always clean the job log in Director (Auto Purge can be set in Administrator Client). There new ways to store the logs in Repository.
  • When possible always complete the process with 1 job since all the processing is done within memory.
  • If there’s requirements to have multiple jobs to complete the process, the intermediate “files” should always be DataSets.
  • For Big DataSet, it is always recommended to use Compress/Expand to process it. In some cases you will see over 50% reduction in processing time. The extra processing time to Compress/Expand the DataSet is compensate by using lesser storage which is typically the slowest components in a Server. The size of DataSet after compress can be as small as 10% of the original size, a difference of 5GB versus 50GB for large file.

6. It is better to use an Enterprise scheduler rather than a Job Sequence when Enterprise Scheduler is available. Leverage on good Enterprise scheduler to handle the following –

  • Dependency Management (Cross Platform and Application)
  • Priority Management
  • Resource Management
  • Concurrency Management
  • DataStage Upgrade – It can be used to handle dependency across multiple version of DataStage during transition
  • Handle Active-Active Engine Requirement.

Here is another site providing valuable hints for DataStage.

DataStage Best Practices – 1

BestPracticesIn this series, I wish to share some of the best practices that I have come across or learnt from my peers in using DataStage. I hope  this will be helpful for a DataStage practitioner. Here are the links to Best Practice 2 and Best Practice 3 blog which is a continuation to this one.

The following are best practices:

a. There should be no Network bottleneck between Source -> ETL -> Target.
– Typically this means Private Network connection using dedicated switches.
– It also means proper capacity planning in term of network bandwidth for the Network Card as well as Switch Capacity.

b. There should be no bottleneck within the Source/Target System/Application to provide/consume data. ETL server can only process as fast as what the source can provide or the target can consume.
How fast can the ETL process the data in this example?
– Source can provide data at 010K rows / sec
– ETL can handle data at 100K rows / sec
– Target can consume data at 025K rows / sec

c. There should be no I/O bottleneck within the ETL Server.

d. There should be proper Capacity planning to cater for growth.

e. There should proper Job design to ensure Job scalability as the hardware scale. You can get some information on job design here.

f. Always use dedicated server or at least “dedicated CPU” if virtualization is required.

g. When there is a bottleneck in Source/Target, we can use less nodes in configuration file. It helps to improve performance and reduce resources usage.

h. You should be running the “right” number of jobs to ensure there is no system / process overload and wastage of resources to manage those process.

Here are some of the good reads:
Architecture and Deployments
Redbook on Deployment Architecture
IBM InfoSphere Information Server Installation and Configuration Guide

Making DataStage jobs Cloud ready using S3 connector

In my last blog, I was mentioning how a user can trigger a batch ETL job dynamically using the power of ISD. And we saw how the user can dynamically choose the input and output file. But there was a limitation, these files need to be available on the server where we have InfoSphere Information Server installed. Can we modify the behavior in a way that the input and output file exist on Cloud? In that case we have made the ETL batch job a step closer to be Cloud ready. Lets explore this.

In last blog we saw the following job taking the input and output from a file.

CASS

In my earlier blog I mentioned how in Information Server 11.3 release there were some Cloud Additions (to S3). Now we can replace the input and output of the above DataStage job with S3 connector (connector that can directly connect to the storage available on Amazon). This is as simple as deleting the Sequential File input and output and replace them with S3 connector. The job now looks like the following.

S3

To access  a File on S3, we require the following parameters. Access Key, Secret Key, Bucket, File name. Now we double click the Amazon_S3 stage to edit the properties and parametrize these values for both the input and output stage. S3_properties

In another minute we can compile and deploy this DataStage job as SOAP or REST Service. Now dynamically we can pick a file from Cloud, process it and then put the result back on Cloud! We do not have to write an app to test it also as there are many free plugins that can be used to verify this new Rest Service that was created. I typically use the Poster Plugin of Firefox Web Browser to compose the Rest request and view the response.

Information Server is now available on Cloud. Check out this blog to know more about it.