A dive into “Data Lake”

DataLakeWith the advancement in Big Data and Cloud lot of new concepts are surfacing. “Data Lake” is one such concept that is being talked about and in this blog I wish to de-mistify it.

So let’s begin with the need of Data Lake.

It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data, because of the following issues:

  • Data is often difficult to locate because it is scattered among many business applications and business systems.
  • Frequently the data needs reengineering and reformatting in order to make it easier to analyze.
  • The data must be refreshed regularly to keep it up-to-date when it is in use by analytics

Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data.

As a result, many organizations are considering implementing a data lake solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.

7 thoughts on “A dive into “Data Lake”

  1. Thanks for the information and your blog is very helpful.Great work….. I hope you will be detailing on data lake in future post. However i would like to clarify on the modelling used for Data lake as you have mentioned that it will be used for analytics and reporting. In case if dimensional modelling is used then how different this will be from the datawarehouse(DW) and datamarts (DM)? I am just trying to understand the pros and cons for an organization to have data lake against the DW/DM solution

    • Hi Balaji. Thank you for your comment and a thoughtful question.
      In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. But in Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data.
      The data lake arose because new types of data needed to be captured and exploited by the Enterprise. Today many different and varied data types ranging from video, sound, structured, docs, sensor data are being created. Therefore we need a way to bring it all together and generate some meaning out of it. But because there is so much data we need to do this at low cost thus causing rise to Data Lakes. Data Warehouses, traditionally popular for business intelligence tasks, are being replaced by less-structured Data Lakes which allow more flexibility. For more details you can read Data Lake Vs Data Warehouse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s