InfoSphere DataStage – IV (Parallel processing)

Companies today must manage, store, and sort through rapidly expanding volumes of data and deliver it to end users as quickly as possible. To address these challenges, organizations need a scalable data integration architecture that contains the following components:

  • A method for processing data without writing to disk, in batch and real time.
  • Dynamic data partitioning and in-flight repartitioning.
  • Scalable hardware that supports symmetric multiprocessing (SMP), clustering, grid, and massively parallel processing (MPP) platforms without requiring changes to the underlying integration process.
  • Support for parallel databases including DB2®, Oracle, and Teradata, in parallel and partitioned configurations.
  • An extensible framework to incorporate in-house and vendor software.

IBM® InfoSphere™ Information Server addresses all of these requirements by exploiting both pipeline parallelism and partition parallelism to achieve high throughput, performance, and scalability.

Data pipelining

Data pipelining is the process of pulling records from the source system and moving them through the sequence of processing functions that are defined in the data-flow (the job). Because records are flowing through the pipeline, they can be processed without writing the records to disk, as following figure shows.

Data can be buffered in blocks so that each process is not slowed when other components are running. This approach avoids deadlocks and speeds performance by allowing both upstream and downstream processes to run concurrently.

Without data pipelining, the following issues arise:

  • Data must be written to disk between processes, degrading performance and increasing storage requirements and the need for disk management.
  • The developer must manage the I/O processing between components.
  • The process becomes impractical for large data volumes.
  • The application will be slower, as disk use, management, and design complexities increase.
  • Each process must complete before downstream processes can begin, which limits performance and full use of hardware resources.

Data partitioning

Data partitioning is an approach to parallelism that involves breaking the record set into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance. Figure below shows data that is partitioned by customer surname before it flows into the Transformer stage.

InfoSphere Information Server automatically partitions data based on the type of partition that the stage requires. Typical packaged tools lack this capability and require developers to manually create data partitions, which results in costly and time-consuming rewriting of applications or the data partitions whenever the administrator wants to use more hardware capacity.

In a well-designed, scalable architecture, the developer does not need to be concerned about the number of partitions that will run, the ability to increase the number of partitions, or repartitioning data.

Dynamic repartitioning

In the examples shown earlier, data is partitioned based on customer surname, and then the data partitioning is maintained throughout the flow.

This type of partitioning is impractical for many uses, such as a transformation that requires data partitioned on surname, but must then be loaded into the data warehouse by using the customer account number.

Dynamic data repartitioning is a more efficient and accurate approach. With dynamic data repartitioning, data is repartitioned while it moves between processes without writing the data to disk, based on the downstream process that data partitioning feeds. The InfoSphere Information Server parallel engine manages the communication between processes for dynamic repartitioning.

Data is also pipelined to downstream processes when it is available, as figure below shows.

Without partitioning and dynamic repartitioning, the developer must take these steps:

  • Create separate flows for each data partition, based on the current hardware configuration.
  • Write data to disk between processes.
  • Manually repartition the data.
  • Start the next process.

The application will be slower, disk use and management will increase, and the design will be much more complex. The dynamic repartitioning feature of InfoSphere Information Server helps us overcome these issues.

2 thoughts on “InfoSphere DataStage – IV (Parallel processing)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s