I had a series of post on InfoSphere Information Server (DataStage). The core functionality of the tool is data integration (loosely called as ETL). So now I wish to take a step back and dwell more on what is this ETL and why is this required and what is the big deal about it? I plan to write some posts around this.
ETL is short for Extract, Transform, and Load. Some years back ETL would refer to “the set of processes for getting data from OLTP systems into a data warehouse.” In today’s world, data is not only coming from OLTP systems but from websites, flat files, e-mail databases, spreadsheets, and personal databases such as Access as well. Moreover, ETL is not only used to load a single data warehouse but can have many other use cases, like loading data marts, generating spreadsheets, scoring customers using data mining models, or even loading forecasts back into OLTP systems. The main ETL steps, however, can still be grouped into three sections:
- Extract:All processing required to connect to various data sources, extract the data from these data sources, and make the data available to the subsequent processing steps. Sounds trivial? Think about the connectivity to each tool or other possible data source, understanding their metadata etc.
- Transform:Any function applied to the extracted data between the extraction from sources and loading into targets. These functions can contain (but are not limited to) the following operations:
- Movement of data
- Validation of data against data quality rules
- Modification of the content or structure of the data
- Integration of the data with data from other sources
- Calculation of derived or aggregated values based on processed data
- Load: All processing required to load the data in a target system. This part of the process consists of a lot more than just bulk loading transformed data into a target table. Parts of the loading process include, for instance, surrogate key management and dimension table management.