In my last blog, I wrote on Data Lake. The first comment on the Blog was to find out the difference between Data Lake and Data Warehouse. So in this blog, I will try to share some of my understanding on their difference:
Schema: In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. But in Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data.
Cost (Storage and Processing) : Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.
Data Access: The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.
Flexibility: Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.
Data Quality: The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.
Relevance in Big Data world: Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data. For example, data lakes are an ideal way to manage the millions of patient records for a hospital. These patient records can be physicians’ notes to lab results. With a data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record when needed, converting the data into uniform structures only when the situation calls for it.
Data Lake does provide some advantages to the Enterprises who require quick access to data. But Data Lakes brings it’s own sets of challenges. I will explore this in my subsequent blogs.