What is special about Cloudera and how to cook it

The market for distributed computing and big data, according to Statistics, is growing at 18-19% per year. This means that the issue of choosing software for these purposes remains relevant. In this post, we will start with why we need distributed computing, we will dwell in more detail on the choice of software, we will talk about using Hadoop with Cloudera, and finally we will talk about the choice of hardware and how it affects performance in different ways.

What is special about Cloudera and how to cook it
Why do we need distributed computing in ordinary business? Everything is simple and complicated at the same time. Simple - because in most cases we perform relatively simple calculations per unit of information. Difficult - because there is a lot of such information. So many. As a consequence, one has to process terabytes of data in 1000 threads. Thus, the use cases are quite universal: calculations can be applied wherever it is required to take into account a large number of metrics on an even larger data array.

One recent example: Dodo Pizza defined based on an analysis of the customer order base, that when choosing a pizza with arbitrary toppings, users usually operate with only six basic sets of ingredients plus a couple of random ones. Accordingly, the pizzeria adjusted purchases. In addition, it was able to better recommend additional products offered at the order stage to users, which allowed it to increase profits.

One more example: analysis merchandise allowed H&M to reduce the assortment in individual stores by 40%, while maintaining the level of sales. This was achieved by excluding poorly selling positions, and seasonality was taken into account in the calculations.

Tool selection

The industry standard for this kind of computing is Hadoop. Why? Because Hadoop is an excellent, well-documented framework (the same Habr gives out many detailed articles on this topic), which is accompanied by a whole set of utilities and libraries. You can submit huge sets of both structured and unstructured data as input, and the system itself will distribute them between computing power. Moreover, these same capacities can be increased or disabled at any time - that same horizontal scalability in action.

In 2017, the influential consulting company Gartner concludedthat Hadoop will soon become obsolete. The reason is quite banal: analysts believe that companies will massively migrate to the cloud, since there they will be able to pay based on the use of computing power. The second important factor supposedly capable of "burying" Hadoop is the speed of work. Because options like Apache Spark or Google Cloud DataFlow are faster than the MapReduce underlying Hadoop.

Hadoop rests on several pillars, the most notable of which are MapReduce technologies (a system for distributing data for calculations between servers) and the HDFS file system. The latter is specifically designed to store information distributed between cluster nodes: each block of a fixed size can be placed on several nodes, and thanks to replication, the system is resistant to failures of individual nodes. Instead of a file table, a special server called the NameNode is used.

The illustration below shows how MapReduce works. At the first stage, the data is divided according to a certain attribute, at the second stage it is distributed by computing power, at the third stage the calculation takes place.

What is special about Cloudera and how to cook it
MapReduce was originally created by Google for the needs of its search. Then MapReduce went into free code, and Apache took over the project. Well, Google gradually migrated to other solutions. An interesting nuance: at the moment, Google has a project called Google Cloud Dataflow, positioned as the next step after Hadoop, as its quick replacement.

A closer look shows that Google Cloud Dataflow is based on a variation of Apache Beam, while Apache Beam includes the well-documented Apache Spark framework, which allows us to talk about almost the same speed of solution execution. Well, Apache Spark works fine on the HDFS file system, which allows you to deploy it on Hadoop servers.

Add here the volume of documentation and ready-made solutions for Hadoop and Spark against Google Cloud Dataflow, and the choice of tool becomes obvious. Moreover, engineers can decide for themselves which code - under Hadoop or Spark - they will execute, focusing on the task, experience and qualifications.

Cloud or local server

The trend towards the general transition to the cloud has even given rise to such an interesting term as Hadoop-as-a-service. In such a scenario, the administration of connected servers has become very important. Because, alas, despite its popularity, pure Hadoop is a rather difficult tool to configure, since you have to do a lot by hand. For example, you can configure servers individually, monitor their performance, and fine-tune many parameters. In general, work for an amateur and there is a big chance to screw up somewhere or miss something.

Therefore, various distributions have become very popular, which are initially equipped with convenient deployment and administration tools. One of the more popular distributions that supports Spark and makes things easy is Cloudera. It has both paid and free versions - and in the latter, all the main functionality is available, and without limiting the number of nodes.

What is special about Cloudera and how to cook it

During setup, Cloudera Manager will connect via SSH to your servers. An interesting point: when installing, it is better to specify that it be carried out by the so-called parcels: special packages, each of which contains all the necessary components configured to work with each other. In fact, this is such an improved version of the package manager.

After installation, we get a cluster management console, where you can see telemetry for clusters, installed services, plus you can add / remove resources and edit the cluster configuration.

What is special about Cloudera and how to cook it

As a result, the cutting of that rocket appears in front of you, which will take you to the bright future of BigData. But before we say "let's go", let's fast forward under the hood.

hardware requirements

On their website, Cloudera mentions different possible configurations. The general principles by which they are built are shown in the illustration:

What is special about Cloudera and how to cook it
MapReduce can blur this optimistic picture. Looking again at the diagram in the previous section, it becomes clear that in almost all cases, a MapReduce job can hit a bottleneck when reading data from disk or network. This is also noted on the Cloudera blog. As a result, for any fast calculations, including through Spark, which is often used for real-time calculations, I / O speed is very important. Therefore, when using Hadoop, it is very important that balanced and fast machines get into the cluster, which, to put it mildly, is not always provided in the cloud infrastructure.

Balance in load distribution is achieved through the use of Openstack virtualization on servers with powerful multi-core CPUs. Data nodes are allocated their own processor resources and certain disks. In our solution Atos Codex Data Lake Engine wide virtualization is achieved, which is why we win both in terms of performance (the impact of the network infrastructure is minimized) and TCO (extra physical servers are eliminated).

What is special about Cloudera and how to cook it
In the case of using BullSequana S200 servers, we get a very uniform load, devoid of some of the bottlenecks. The minimum configuration includes 3 BullSequana S200 servers, each with two JBODs, plus additional S200s containing four data nodes are optionally connected. Here is an example load in a TeraGen test:

What is special about Cloudera and how to cook it

Tests with different data volumes and replication values ​​show the same results in terms of load distribution across cluster nodes. Below is a graph of the distribution of disk access by performance tests.

What is special about Cloudera and how to cook it

Calculations are based on a minimum configuration of 3 BullSequana S200 servers. It includes 9 data nodes and 3 master nodes, as well as reserved virtual machines in case of deployment of protection based on OpenStack Virtualization. TeraSort test result: 512 MB block size of a replication factor of three with encryption is 23,1 minutes.

How can the system be expanded? Various types of extensions are available for the Data Lake Engine:

  • Data nodes: for every 40 TB of usable space
  • Analytic nodes with the ability to install a GPU
  • Other options depending on business needs (for example, if you need Kafka and the like)

What is special about Cloudera and how to cook it

The Atos Codex Data Lake Engine complex includes both the servers themselves and pre-installed software, including the Cloudera kit with a license; Hadoop itself, OpenStack with virtual machines based on the RedHat Enterprise Linux kernel, data replication and backup systems (including using a backup node and Cloudera BDR - Backup and Disaster Recovery). Atos Codex Data Lake Engine is the first virtualization solution to be certified Cloudera.

If you are interested in the details, we will be happy to answer our questions in the comments.

Source: habr.com

Add a comment