In this series, I wish to share some of the best practices that I have come across or learnt from my peers in using DataStage. I hope this will be helpful for a DataStage practitioner. Here are the links to Best Practice 2 and Best Practice 3 blog which is a continuation to this one.
The following are best practices:
a. There should be no Network bottleneck between Source -> ETL -> Target.
– Typically this means Private Network connection using dedicated switches.
– It also means proper capacity planning in term of network bandwidth for the Network Card as well as Switch Capacity.
b. There should be no bottleneck within the Source/Target System/Application to provide/consume data. ETL server can only process as fast as what the source can provide or the target can consume.
How fast can the ETL process the data in this example?
– Source can provide data at 010K rows / sec
– ETL can handle data at 100K rows / sec
– Target can consume data at 025K rows / sec
c. There should be no I/O bottleneck within the ETL Server.
d. There should be proper Capacity planning to cater for growth.
e. There should proper Job design to ensure Job scalability as the hardware scale. You can get some information on job design here.
f. Always use dedicated server or at least “dedicated CPU” if virtualization is required.
g. When there is a bottleneck in Source/Target, we can use less nodes in configuration file. It helps to improve performance and reduce resources usage.
h. You should be running the “right” number of jobs to ensure there is no system / process overload and wastage of resources to manage those process.
Here are some of the good reads:
- Architecture and Deployments
- Redbook on Deployment Architecture
- IBM InfoSphere Information Server Installation and Configuration Guide
- Managing and Deleting Persistent Data Sets within IBM InfoSphere Datastage