So in Part 1 of the series I discussed the challenges that a group of college interns faced as they went to gather data for an analytics report. Continuing with my fictitious story, the students remembered their studies on OLAP Vs OLTP and realized that they need to move the data to a warehouse to run these Analytics queries. For this they require to develop the following functionality:
1. Develop a tool that can connect to multiple data sources to EXTRACT data to finally LOAD (or Land ) the data to a warehouse.
2. This tool should have the capacity to transform the data if possible or cleanse it to make the Warehouse data a source of Truth to run analytics on.
3. Since the window to move the data is very small (5 hours in this case) so ideally the tool should read the data very fast (or maybe in parallel)
4. Since at night time there may be no direct access to the place where database is physically located, so a way to trigger a script or job that can start Extracting the data and Loading it to destination without someone having the need to manually start.
5. Maybe in future the amount of data to be moved would increase more and so one should be easily able add more CPU/Memory to ensure that data movement happens in the given time window (called Batch Window) without the need for making any code change.
And here is the good news. Such tools already exist in the market and so there is no need not re-invent the wheel. A typical ETL tool like IBM InfoSphere Information Server has connectivity to several source (or target) repositories. It can transform the data through use of several inbuilt stages (can be seen synonymous to a function call in programming world). The tool support parallelism so that data can be quickly extracted, transformed and loaded into target repository. What one has to do is to create an ETL Job using a drag and drop palette, compile and run. So something that would have taken weeks happen in matter of hours.
In the Part 3 of this series we will try to extend the story as we explore the need of Data Quality tools. So stay tuned…