Do we need a data lake? And what to do with the data warehouse?

This article is a translation of my article on medium - Getting Started with Data Lake, which turned out to be quite popular, probably because of its simplicity. Therefore, I decided to write it in Russian and supplement it a little so that it becomes clear to a simple person who is not a data specialist what a data warehouse (DW) is, and what a data lake is (Data Lake), and how they get along together .

Why did I want to write about the data lake? I have been working with data and analytics for over 10 years, and now I definitely work with big data at Amazon Alexa AI in Cambridge, Boston, although I myself live in Victoria on Vancouver Island and often visit Boston, Seattle, and Vancouver, and sometimes even in Moscow, I speak at conferences. I also write from time to time, but I write mostly in English, and I have already written some books, I also have a need to share analytics trends from North America, and I sometimes write to telegrams.

I have always worked with data warehouses, and since 2015 I began to work closely with Amazon Web Services, and generally switched to cloud analytics (AWS, Azure, GCP). I have been watching the evolution of analytics solutions since 2007, and I myself even worked in the Teradata data warehouse vendor and implemented it in Sberbank, then Big Data with Hadoop appeared. Everyone began to say that the era of storage has passed and now everything is on Hadoop, and then they began to talk about Data Lake, again, that now the data warehouse has definitely come to an end. But fortunately (maybe for someone and unfortunately, who made a lot of money on setting up Hadoop), the data warehouse is not gone.

In this article, we will look at what a data lake is. Article designed for people who have little or no experience with data warehouses.

Do we need a data lake? And what to do with the data warehouse?

In the picture, Lake Bled is one of my favorite lakes, although I was there only once, but I remember it for the rest of my life. But we will talk about another type of lake - a data lake. Perhaps many of you have already heard about this term more than once, but one more definition will not hurt anyone.

First of all, here are the most popular Data Lake definitions:

"File storage of all types of raw data that is available for analysis by anyone in the organization" - Martin Fovler.

“If you think a data mart is a bottle of water—purified, packaged, and packaged for easy drinking—then a data lake is our huge reservoir of water in its natural form. Users, I can draw water for myself, dive to the depths, explore” - James Dixon.

Now we know for sure that the data lake is about analytics, it allows us to store large amounts of data in its original form and we have the necessary and convenient access to the data.

I often like to simplify things, if I can tell a complex term in simple words, then for myself I understand how it works and what it is for. Somehow, I was poking around in the iPhone in the photo gallery, and it dawned on me, this is a real data lake, I even made a slide for conferences:

Do we need a data lake? And what to do with the data warehouse?

Everything is very simple. We take a photo on the phone, the photo is saved on the phone and can be saved in iCloud (file storage in the cloud). Also, the phone collects meta-data of the photo: what is shown, geotag, time. As a result, we can use the user-friendly interface of the iPhone to find our photo and we even see indicators, for example, when I search for photos with the word fire (fire), I find 3 photos with a picture of a campfire. For me, this is just like a Business Intelligence tool that works very quickly and clearly.

And of course, we must not forget about security (authorization and authentication), otherwise our data can easily get into the public domain. There is a lot of news about large corporations and startups whose data was made publicly available due to the negligence of developers and not following simple rules.

Even such a simple picture helps us to imagine what a data lake is, its differences from a traditional data warehouse, and its main elements:

  1. Loading data (Ingestion) is a key component of the data lake. Data can enter the data warehouse in two ways - batch (loading at intervals) and streaming (data stream).
  2. File storage (Storage) is the main component of the Data Lake. We need storage to be highly scalable, extremely reliable, and low cost. For example, in AWS this is S3.
  3. Catalog and Search (Catalog and Search) - in order for us to avoid the Data Swamp (this is when we dump all the data in one heap, and then it is impossible to work with them), we need to create a meta-data layer for data classification so that users can easily find the data, they need for analysis. Additionally, additional search solutions can be used, such as ElasticSearch. Search helps the user to search for the desired data through a user-friendly interface.
  4. Performing the shaping (Process) - this step is responsible for processing and transforming data. We can transform data, change its structure, clean it up and much more.
  5. Security (Security) - It is important to spend time on the security design of the solution. For example, data encryption during storage, processing and loading. It is important to use authentication and authorization methods. Finally, an audit tool is needed.

From a practical point of view, we can characterize a data lake with three attributes:

  1. Collect and store anything - The data lake contains all data, both raw raw data for any period of time, and processed / cleaned data.
  2. Deep Scan A data lake allows users to explore and analyze data.
  3. Flexible access The data lake provides flexible access for different data and different scenarios.

Now we can talk about the difference between a data warehouse and a data lake. Usually people ask:

  • But what about the data warehouse?
  • Are we replacing the data warehouse with a data lake or are we expanding it?
  • Is it still possible to do without a data lake?

In short, there is no clear answer. It all depends on the specific situation, skills in the team and budget. For example, migrating a data warehouse on Oracle to AWS and creating a data lake by an Amazon subsidiary - Woot - Our data lake story: How Woot.com built a serverless data lake on AWS.

On the other hand, vendor Snowflake says you no longer need to think about a data lake, as their data platform (before 2020 it was a data warehouse) allows you to combine both a data lake and a data warehouse. I have not worked much with Snowflake and it is truly a unique product that can do this. The price of the issue is another matter.

In conclusion, my personal opinion is that we still need a data warehouse as the main data source for our reporting, and everything that does not fit, we store in the data lake. The entire role of analytics is to provide convenient access for business to make decisions. Whatever one may say, but business users work more efficiently with a data warehouse than a data lake, for example, Amazon has Redshift (analytical data warehouse) and Redshift Spectrum / Athena (SQL interface for a data lake in S3 based on Hive / Presto). The same applies to other modern analytical data warehouses.

Let's look at a typical data warehouse architecture:

Do we need a data lake? And what to do with the data warehouse?

This is a classic solution. We have source systems, using ETL / ELT we copy data to an analytical data warehouse and connect it to a Business Intelligence solution (my favorite is Tableau, but yours?).

This solution has the following disadvantages:

  • ETL/ELT operations require time and resources.
  • As a rule, memory for storing data in an analytical data warehouse is not cheap (for example, Redshift, BigQuery, Teradata), since we need to buy a whole cluster.
  • Business users have access to clean and often aggregated data, and they don't have the ability to access the raw data.

Of course, it all depends on your case. If you don't have problems with your data warehouse, then you don't need a data lake at all. But when there are problems with lack of space, capacity, or the price of the issue has a key role, then you can consider the option of a data lake. That is why the data lake is very popular. Here is an example of a data lake architecture:
Do we need a data lake? And what to do with the data warehouse?
Using the data lake approach, we load raw data into our data lake (batch or streaming), then we process the data as needed. The data lake allows business users to create their own data transformations (ETL/ELT) or analyze the data in Business Intelligence solutions (if the right driver is available).

The goal of any analytics solution is to serve business users. Therefore, we must always work from the requirements of the business. (In Amazon, this is one of the principles - working backwards).

Working with both the data warehouse and the data lake, we can compare both solutions:

Do we need a data lake? And what to do with the data warehouse?

The main conclusion that can be drawn is that the data warehouse does not compete with the data lake in any way, but rather complements it. But it's up to you to decide what suits your case. It is always interesting to try it yourself and draw the right conclusions.

I would also like to talk about one of the cases when I started using the data lake approach. Everything is quite banal, I tried using the ELT tool (we had Matillion ETL) and Amazon Redshift, my solution worked, but did not fit into the requirements.

I needed to take web logs, transform them and aggregate them to provide data for 2 cases:

  1. The marketing team wanted to analyze bot activity for SEO
  2. IT wanted to look at metrics on the work of sites

Very simple, very simple logs. Here is an example:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 
192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 
"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 
arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067
"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

One file weighed 1-4 megabytes.

But there was one difficulty. We had 7 domains all over the world and created 7000 files in one day. This is not much more volume, only 50 gigabytes. But the size of our Redshift cluster was also small (4 nodes). Downloading one file in the traditional way took about a minute. That is, the task was not solved head-on. And that was the case when I decided to use the data lake approach. The solution looked something like this:

Do we need a data lake? And what to do with the data warehouse?

It is quite simple (I want to point out that the advantage of working in the cloud is simplicity). I used:

  • AWS Elastic Map Reduce (Hadoop) as compute power
  • AWS S3 as file storage with the ability to encrypt data and restrict access
  • Spark as InMemory computing power and PySpark for logic and data transformation
  • Parquet as a result of Spark
  • AWS Glue Crawler as a metadata collector about new data and partitions
  • Redshift Spectrum as a SQL interface to a data lake for existing Redshift users

The smallest EMR + Spark cluster processed the entire batch of files in 30 minutes. There are other cases for AWS, especially many related to Alexa, where there is a lot of data.

More recently, I learned one of the drawbacks of the data lake is the GDPR. The problem is, when the client asks to delete it, and the data is in one of the files, we cannot use the Data Manipulation Language and the DELETE operation as in the database.

I hope this article has clarified the difference between a data warehouse and a data lake. If it was interesting, I can also translate my own articles or articles by professionals I read. And also tell about the solutions I work with and their architecture.

Source: habr.com

Add a comment