So let’s begin with the need of Data Lake.
It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data, because of the following issues:
- Data is often difficult to locate because it is scattered among many business applications and business systems.
- Frequently the data needs reengineering and reformatting in order to make it easier to analyze.
- The data must be refreshed regularly to keep it up-to-date when it is in use by analytics
Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data.
As a result, many organizations are considering implementing a data lake solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.