How we organized a highly efficient and inexpensive DataLake and why

We live in an amazing time when you can quickly and easily dock several ready-made open tools, configure them with a "disabled mind" on the advice of stackoverflow, without delving into "multiletters", and launch them into commercial operation. And when you need to update / expand or someone accidentally reboots a couple of machines - to realize that some obsessive bad dream has begun in reality, everything has become dramatically complicated beyond recognition, there is no turning back, the future is foggy and safer, instead of programming, breed bees and do cheese.

It’s not for nothing that more experienced colleagues, with bugs and already a gray head, contemplating the incredibly fast deployment of packs of “containers” in “cubes” on dozens of servers in “fashion languages” with built-in support for asynchronous non-blocking I / O - smile modestly . And silently continue to re-read "man ps", delve into the sources of "nginx" until bleeding from the eyes and write-write-write unit tests. Colleagues know that the most interesting will be ahead, when "all this" will one day become a stake on New Year's Eve. And only a deep understanding of the nature of unix, the learned TCP / IP state table and basic sort-search algorithms will help them. To bring the system back to life under the chimes.

Oh yes, I got a little distracted, but I hope I managed to convey the state of anticipation.
Today I want to share our experience of deploying a convenient and inexpensive DataLake stack that solves most of the analytical tasks in a company for completely different structural divisions.

Some time ago, we came to the understanding that companies need more and more the fruits of both product and technical analytics (not to mention the icing on the cake in the form of machine learning) and to understand trends and risks, you need to collect and analyze more and more and more metrics.

Basic technical analytics in Bitrix24

Several years ago, simultaneously with the launch of the Bitrix24 service, we actively invested time and resources in creating a simple and reliable analytical platform that would help to quickly see problems in the infrastructure and plan the next step. Of course, it was desirable to take the tools ready-made and as simple and understandable as possible. As a result, nagios was chosen for monitoring and munin for analytics and visualization. Now we have thousands of checks in nagios, hundreds of charts in munin and colleagues use them daily and successfully. The metrics are clear, the graphs are clear, the system has been working reliably for several years and new tests and graphs are regularly added to it: we put a new service into operation - we add several tests and graphs. Good luck.

Hand on the pulse - advanced technical analytics

The desire to get information about problems "as quickly as possible" led us to actively experiment with simple and understandable tools - pinba and xhprof.

Pinba sent us statistics on the speed of parts of web pages in PHP in UDP packets and you could see online in the MySQL storage (pinba comes with its own MySQL engine for fast event analytics) a short list of problems and respond to them. And xhprof automatically made it possible to collect execution graphs of the slowest PHP pages from clients and analyze what could lead to this - calmly, pouring tea or something stronger.

Some time ago, the toolkit was replenished with another fairly simple and understandable engine based on the reverse indexing algorithm, perfectly implemented in the legendary Lucene library - Elastic / Kibana. The simple idea of ​​multi-threaded writing documents to the Lucene inverse index based on events in the logs and quickly searching through them using faceting turned out to be really useful.

Despite the rather technical look of visualizations in Kibana with “flowing upwards” low-level concepts like “bucket” and the reinvented language of relational algebra that has not yet been completely forgotten, the tool has become a good help to us in the following tasks:

  • How many PHP errors did the Bitrix24 client have on the p1 portal in the last hour and which ones? Understand, forgive and correct quickly.
  • How many video calls were made on portals in Germany in the previous 24 hours, with what quality and were there any difficulties with the channel/network?
  • How well does the system functionality (our C extension for PHP) compiled from source in the latest service update and rolled out to clients work? Is there no segfaults?
  • Does client data fit into PHP memory? Are there any errors of exceeding the memory allocated to processes: "out of memory"? Find and destroy.

Here is a specific example. Despite careful and multi-level testing, the client, with a very non-standard case and corrupted input data, had an annoying and unexpected error, a siren sounded, and the process of quickly fixing it began:

How we organized a highly efficient and inexpensive DataLake and why

Additionally, kibana allows you to organize notifications on specified events, and in a short time dozens of employees from different departments began to use the tool in the company - from technical support and development to QA.

It has become convenient to track and measure the activity of any division within the company - instead of manually analyzing the logs on the servers, it is enough to set up the parsing of the logs once and send them to the elastic cluster in order to enjoy, for example, contemplation in the kibana dashboard of the number of sold two-headed kittens printed on 3-d printer for the last lunar month.

Basic business intelligence

Everyone knows that business intelligence in companies often starts with the extreme use of, yes, yes, Excel. But, the main thing is that it does not end there. Cloud-based Google Analytics still adds fuel to the fire - you quickly get used to the good.

In our harmoniously developing company, “prophets” of more intensive work with larger data began to appear here and there. Needs for more in-depth and multifaceted reports began to appear regularly, and with the efforts of guys from different departments, a simple and practical solution was organized some time ago - a bunch of ClickHouse and PowerBI.

For quite a long time, this flexible solution helped a lot, but gradually the understanding began to come that ClickHouse is not rubber and you can’t mock it like that.

It is important to understand well here that ClickHouse, like Druid, like Vertica, like Amazon RedShift (which is based on postgres), are analytical engines optimized for quite convenient analytics (sums, aggregations, minimum-maximum by column and a little bit of joins). ), because organized for efficient storage of columns of relational tables, in contrast to the known MySQL and other (row-oriented) databases.

In fact, ClickHouse is just a more capacious "base" of data, with not very convenient point insertion (as it is intended, everything is ok), but nice analytics and a set of interesting powerful functions for working with data. Yes, you can even create a cluster - but, you understand, hammering nails with a microscope is not entirely correct, and we began to look for other solutions.

Demand for python and analysts

There are many developers in our company who write code almost every day for 10-20 years in PHP, JavaScript, C#, C/C++, Java, Go, Rust, Python, Bash. There are also many experienced system administrators who have experienced more than one absolutely incredible disaster that does not fit into the laws of statistics (for example, when most disks in a raid-10 are destroyed during a strong lightning strike). In such conditions, for a long time it was not clear what a "analyst in python" is. Python is like PHP, only the name is slightly longer and the traces of mind-altering substances in the source code of the interpreter are slightly smaller. However, as more and more analytical reports are being created, experienced developers have become increasingly aware of the importance of narrow specialization in tools like numpy, pandas, matplotlib, seaborn.
The decisive role, most likely, was played by the sudden fainting of employees from the combination of the words “logistic regression” and the demonstration of effective reporting on volumetric data using yes, yes, pyspark.

Apache Spark, its functional paradigm, which lends itself well to relational algebra, and its capabilities has made such an impression on developers accustomed to MySQL that the need to strengthen the battle ranks by experienced analysts has become clear as day.

Further attempts by Apache Spark/Hadoop to take off and what went wrong

However, it soon became clear that with Spark, apparently, something was not quite right systematically, or you just need to wash your hands better. If the Hadoop/MapReduce/Lucene stack was made by fairly experienced programmers, which is obvious if you look with passion at the source code in Java or Doug Cutting's ideas in Lucene, then Spark is suddenly written in a very controversial in terms of practicality and now not developing exotic Scala language. And the regular drop in calculations on the Spark cluster due to illogical and not very transparent work with memory allocation for reduce operations (many keys arrive at once) created an halo around it of something that has room to grow. In addition, the situation was aggravated by a large number of strange open ports, temporary files growing in the most incomprehensible places, and a hell of jar dependencies - which caused system administrators to feel one well-known feeling from childhood: fierce hatred (or maybe they should have washed their hands with soap).

As a result, we have "survived" several internal analytical projects that actively use Apache Spark (including Spark Streaming, Spark SQL) and the Hadoop ecosystem (and others and others). Despite the fact that over time we learned to cook and monitor “it” well, and “it” practically stopped suddenly falling due to a change in the nature of the data and an imbalance in the uniform hashing of RDD, the desire to take something that is already ready, updated and administered somewhere in The cloud got stronger and stronger. It was at this time that we tried using a ready-made cloud build of Amazon Web Services − EMR and, subsequently, tried to solve problems already on it. EMR is Amazon-made Apache Spark with additional software from the ecosystem, much like Cloudera/Hortonworks builds.

"Rubber" file storage for analytics - an urgent need

The experience of "cooking" Hadoop / Spark with burns of different parts of the body was not in vain. The need to create a single inexpensive and reliable file storage that would be resistant to hardware failures and in which it would be possible to store files in different formats from different systems and make effective and reasonable selections for reports based on this data has become increasingly clear.

I also wanted the software update of this platform not to turn into a New Year's nightmare with reading 20-page Java traces and analyzing kilometer-long detailed cluster logs using Spark History Server and a backlit magnifier. I wanted to have a simple and transparent tool that does not require regular diving under the hood if the developer stopped running the standard MapReduce query when the reduce-data worker fell out of memory with a not very well-chosen source data partitioning algorithm.

Is Amazon S3 a Candidate for DataLake?

Experience with Hadoop/MapReduce has taught me that we need both a scalable reliable file system and scalable workers above it that “come” closer to the data so as not to drive the data over the network. Workers should be able to read data in different formats, but it is desirable not to read unnecessary information and so that data can be stored in advance in formats convenient for workers.

Once again, the main idea. There is no desire to “fill in” big data into a single cluster analytical engine, which will sooner or later choke and have to be ugly sharded. I want to store files, just files, in an understandable format and execute effective analytical queries on them with different, but understandable tools. And there will be more and more files in different formats. And it is better to shard not the engine, but the source data. We need an extensible and versatile DataLake, we decided ...

And what if you store files in the familiar and well-known Amazon S3 scalable cloud storage without having to cook your own Hadoop chops?

It is clear that the data are “nizya”, but if other data is taken out there and “driven effectively”?

Amazon Web Services Cluster-Bigdata-Analytic Ecosystem - In Very Simple Words

Judging by our experience with AWS, Apache Hadoop / MapReduce has been actively used there for a long time under various sauces, for example, in the DataPipeline service (I envy my colleagues, they learned how to cook it correctly). Here we set up backups from different services from DynamoDB tables:
How we organized a highly efficient and inexpensive DataLake and why

And they have been running regularly on embedded Hadoop/MapReduce clusters like clockwork for several years now. "Set and forget":

How we organized a highly efficient and inexpensive DataLake and why

Also, you can effectively engage in data satanism by raising Jupiter laptops in the cloud for analysts and using the AWS SageMaker service for training and deploying AI models into battle. Here's what it looks like for us:

How we organized a highly efficient and inexpensive DataLake and why

And yes, you can pick up a laptop for yourself or an analytics in the cloud and attach it to a Hadoop / Spark cluster, calculate it and then “nail” everything:

How we organized a highly efficient and inexpensive DataLake and why

Really handy for individual analytical projects and for some we have successfully used the EMR service for large-scale calculations and analytics. And what about a system solution for DataLake, will it work? At this point, we were on the verge of hope and despair and continued to search.

AWS Glue - neatly packaged Apache Spark on steroids

It turned out that AWS has “its own” version of the “Hive/Pig/Spark” stack. The role of Hive, i.e. the catalog of files and their types in DataLake is performed by the "Data catalog" service, which does not hide its compatibility with the Apache Hive format. In this service, you need to add information about where your files are located and in what format they are. The data can be not only in s3, but also in the database, but this is not in this post. Here is how the DataLake data catalog is organized for us:

How we organized a highly efficient and inexpensive DataLake and why

The files are registered, great. If the files have been updated, we launch crawlers either by hand or by schedule, which will update information about them from the lake and save them. Further, the data from the lake can be processed and the results uploaded somewhere. In the simplest case, we also upload it to s3. Data processing can be done anywhere, but it is suggested to set up processing on an Apache Spark cluster using advanced capabilities through the AWS Glue API. In fact, you can take the good old and familiar python code using the pyspark library and set up its execution on N nodes of a cluster of some capacity with monitoring, without digging into the giblets of Hadoop and dragging docker-moker containers and eliminating dependency conflicts.

Once again, a simple idea. You don't need to configure Apache Spark, you just need to write python code for pyspark, test it locally on your desktop and then run it on a large cluster in the cloud, specifying where the source data is and where to put the result. Sometimes it is necessary and useful, and here is how it is configured with us:

How we organized a highly efficient and inexpensive DataLake and why

Thus, if you need to calculate something on a Spark cluster on data in s3, we write code in python / pyspark, test it and have a good trip to the cloud.

What about orchestration? And if the task fell and disappeared? Yes, it is proposed to make a beautiful Apache Pig-style pipeline and we even tried them, but decided to use our deeply customized orchestration in PHP and JavaScript for now (I understand, there is cognitive dissonance, but it works for years and without errors).

How we organized a highly efficient and inexpensive DataLake and why

Lake Stored File Format Key to Performance

It is very, very important to understand two more key points. To ensure that requests for file data in the lake are executed as quickly as possible and performance does not degrade when new information is added, you need to:

  • Store columns of files separately (so that you do not have to read all the lines to understand what is in the columns). To do this, we took the parquet format with compression
  • It is very important to shard files by folders in the spirit: language, year, month, day, week. Engines that understand this type of sharding will look only at the right folders, without shoveling through all the data in a row.

In fact, in this way, you lay out in the most efficient form the source data for analytical engines hung from above, which can selectively enter and read only the necessary columns from files into sharded folders. You don’t need to “fill in” the data anywhere (the storage will simply burst) - just immediately put them intelligently into the file system in the correct format. Of course, it should be clear here that storing a huge csv file in DataLake, which must first be read line by line by the cluster in order to extract the columns, is not very advisable. Think about the above two points again, if it is not yet clear why all this is.

AWS Athena - "hell" from a snuffbox

And then, while creating a lake, we, somehow in passing, stumbled upon Amazon Athena. It suddenly turned out that by carefully folding our huge log files into shard-daddies in the correct (parquet) column format, you can very quickly make extremely informative selections on them and build reports WITHOUT, without an Apache Spark / Glue cluster.

The Athena engine powered by data in s3 is based on the legendary Presto - a representative of the MPP (massive parallel processing) family of approaches to data processing, taking data where it lies, from s3 and Hadoop to Cassandra and ordinary text files. You just need to ask Athena to execute a SQL query, and then everything “works quickly and by itself”. It is important to note that Athena is “smart”, it goes only to the necessary sharded folders and reads only the columns needed in the query.

Requests to Athena are also charged interestingly. We pay for amount of scanned data. Those. not for the number of machines in the cluster per minute, but ... for the data actually scanned on 100-500 machines, only the data necessary to complete the request.

And by requesting only the necessary columns from correctly sharded folders, it turned out that the Athena service costs us tens of dollars a month. Well, great, almost free, compared to analytics on clusters!

By the way, here is how we shard our data in s3:

How we organized a highly efficient and inexpensive DataLake and why

As a result, in a short time, completely different departments in the company, from information security to analytics, began to actively make requests to Athena and quickly, in seconds, receive useful answers from “big” data over rather large periods: months, half a year, etc. P.

But we went further and began to go to the cloud for answers. via ODBC driver: an analyst writes an SQL query in a familiar console, which on 100-500 machines “for a penny” wools data in s3 and returns a response, usually in a few seconds. Comfortable. And fast. Still can't believe it.

As a result, having decided to store data in s3, in an efficient column format and with reasonable sharding of data by folders ... we got DataLake and a fast and cheap analytical engine - for free. And he became very popular in the company, because. understands SQL and works orders of magnitude faster than through starts / stops / cluster settings. “And if the result is the same, why pay more?”

A request to Athena looks something like this. If desired, of course, you can create enough complex and multipart SQL query, but we will restrict ourselves to simple grouping. Let's see what response codes the client had a few weeks ago in the web server logs and make sure that there are no errors:

How we organized a highly efficient and inexpensive DataLake and why

Conclusions

Having passed, not to say that a long, but painful path, constantly adequately assessing the risks and the level of complexity and cost of support, we have found a solution for DataLake and analytics, which never ceases to please us with both speed and cost of ownership.

It turned out that building an efficient, fast and cheap to operate DataLake for the needs of completely different departments of the company is completely within the power of even experienced developers who have never worked as architects and who do not know how to draw squares on squares with arrows and who know 50 terms from the Hadoop ecosystem.

At the beginning of the journey, my head was breaking from the multitude of the wildest zoos of open and closed software and the understanding of the burden of responsibility to descendants. Just start building your DataLake from simple tools: nagios / munin -> elastic / kibana -> Hadoop / Spark / s3 ..., collecting feedback and deeply understanding the physics of ongoing processes. Everything complicated and muddy - give it to enemies and competitors.

If you do not want to go to the cloud and like to maintain, update and patch open projects, you can build a scheme similar to ours locally, on low-cost office machines with Hadoop and Presto on top. The main thing is not to stop and go forward, count, look for simple and clear solutions, and everything will definitely work out! Good luck to everyone and see you soon!

Source: habr.com

Add a comment