The market for distributed computing and big data, according to
Why do we need distributed computing in ordinary business? Everything is simple and complicated at the same time. Simple - because in most cases we perform relatively simple calculations per unit of information. Difficult - because there is a lot of such information. So many. As a consequence, one has to
One recent example: Dodo Pizza
One more example:
Tool selection
The industry standard for this kind of computing is Hadoop. Why? Because Hadoop is an excellent, well-documented framework (the same Habr gives out many detailed articles on this topic), which is accompanied by a whole set of utilities and libraries. You can submit huge sets of both structured and unstructured data as input, and the system itself will distribute them between computing power. Moreover, these same capacities can be increased or disabled at any time - that same horizontal scalability in action.
In 2017, the influential consulting company Gartner
Hadoop rests on several pillars, the most notable of which are MapReduce technologies (a system for distributing data for calculations between servers) and the HDFS file system. The latter is specifically designed to store information distributed between cluster nodes: each block of a fixed size can be placed on several nodes, and thanks to replication, the system is resistant to failures of individual nodes. Instead of a file table, a special server called the NameNode is used.
The illustration below shows how MapReduce works. At the first stage, the data is divided according to a certain attribute, at the second stage it is distributed by computing power, at the third stage the calculation takes place.
MapReduce was originally created by Google for the needs of its search. Then MapReduce went into free code, and Apache took over the project. Well, Google gradually migrated to other solutions. An interesting nuance: at the moment, Google has a project called Google Cloud Dataflow, positioned as the next step after Hadoop, as its quick replacement.
A closer look shows that Google Cloud Dataflow is based on a variation of Apache Beam, while Apache Beam includes the well-documented Apache Spark framework, which allows us to talk about almost the same speed of solution execution. Well, Apache Spark works fine on the HDFS file system, which allows you to deploy it on Hadoop servers.
Add here the volume of documentation and ready-made solutions for Hadoop and Spark against Google Cloud Dataflow, and the choice of tool becomes obvious. Moreover, engineers can decide for themselves which code - under Hadoop or Spark - they will execute, focusing on the task, experience and qualifications.
Cloud or local server
The trend towards the general transition to the cloud has even given rise to such an interesting term as Hadoop-as-a-service. In such a scenario, the administration of connected servers has become very important. Because, alas, despite its popularity, pure Hadoop is a rather difficult tool to configure, since you have to do a lot by hand. For example, you can configure servers individually, monitor their performance, and fine-tune many parameters. In general, work for an amateur and there is a big chance to screw up somewhere or miss something.
Therefore, various distributions have become very popular, which are initially equipped with convenient deployment and administration tools. One of the more popular distributions that supports Spark and makes things easy is Cloudera. It has both paid and free versions - and in the latter, all the main functionality is available, and without limiting the number of nodes.
During setup, Cloudera Manager will connect via SSH to your servers. An interesting point: when installing, it is better to specify that it be carried out by the so-called parcels: special packages, each of which contains all the necessary components configured to work with each other. In fact, this is such an improved version of the package manager.
After installation, we get a cluster management console, where you can see telemetry for clusters, installed services, plus you can add / remove resources and edit the cluster configuration.
As a result, the cutting of that rocket appears in front of you, which will take you to the bright future of BigData. But before we say "let's go", let's fast forward under the hood.
hardware requirements
On their website, Cloudera mentions different possible configurations. The general principles by which they are built are shown in the illustration:
MapReduce can blur this optimistic picture. Looking again at the diagram in the previous section, it becomes clear that in almost all cases, a MapReduce job can hit a bottleneck when reading data from disk or network. This is also noted on the Cloudera blog. As a result, for any fast calculations, including through Spark, which is often used for real-time calculations, I / O speed is very important. Therefore, when using Hadoop, it is very important that balanced and fast machines get into the cluster, which, to put it mildly, is not always provided in the cloud infrastructure.
Balance in load distribution is achieved through the use of Openstack virtualization on servers with powerful multi-core CPUs. Data nodes are allocated their own processor resources and certain disks. In our solution Atos Codex Data Lake Engine wide virtualization is achieved, which is why we win both in terms of performance (the impact of the network infrastructure is minimized) and TCO (extra physical servers are eliminated).
In the case of using BullSequana S200 servers, we get a very uniform load, devoid of some of the bottlenecks. The minimum configuration includes 3 BullSequana S200 servers, each with two JBODs, plus additional S200s containing four data nodes are optionally connected. Here is an example load in a TeraGen test:
Tests with different data volumes and replication values ββshow the same results in terms of load distribution across cluster nodes. Below is a graph of the distribution of disk access by performance tests.
Calculations are based on a minimum configuration of 3 BullSequana S200 servers. It includes 9 data nodes and 3 master nodes, as well as reserved virtual machines in case of deployment of protection based on OpenStack Virtualization. TeraSort test result: 512 MB block size of a replication factor of three with encryption is 23,1 minutes.
How can the system be expanded? Various types of extensions are available for the Data Lake Engine:
- Data nodes: for every 40 TB of usable space
- Analytic nodes with the ability to install a GPU
- Other options depending on business needs (for example, if you need Kafka and the like)
The Atos Codex Data Lake Engine complex includes both the servers themselves and pre-installed software, including the Cloudera kit with a license; Hadoop itself, OpenStack with virtual machines based on the RedHat Enterprise Linux kernel, data replication and backup systems (including using a backup node and Cloudera BDR - Backup and Disaster Recovery). Atos Codex Data Lake Engine is the first virtualization solution to be certified
If you are interested in the details, we will be happy to answer our questions in the comments.
Source: habr.com