Release of the platform for distributed data processing Apache Hadoop 3.3

After a year and a half of development, the Apache Software Foundation ΠΎΠΏΡƒΠ±Π»ΠΈΠΊΠΎΠ²Π°Π»Π° Release Apache Hadoop 3.3.0, a free platform for organizing distributed processing of large amounts of data using the paradigm map/reduce, in which the task is divided into many smaller isolated fragments, each of which can be run on a separate cluster node. Hadoop-based storage can span thousands of nodes and contain exabytes of data.

Hadoop includes an implementation of the Hadoop Distributed Filesystem (HDFS) that provides automatic data redundancy and is optimized for MapReduce applications. To simplify data access in Hadoop storage, the HBase database and the SQL-like Pig language were developed, which is a kind of SQL for MapReduce, the queries of which can be parallelized and processed by several Hadoop platforms. The project is assessed as fully stable and ready for commercial operation. Hadoop is actively used in large industrial projects, providing capabilities similar to the Google Bigtable/GFS/MapReduce platform, while Google is officially delegated Hadoop and other Apache projects have the right to use technologies covered by patents related to the MapReduce method.

Hadoop ranks first among the Apache repositories in terms of the number of changes made and fifth in terms of the size of the code base (about 4 million lines of code). Of the large implementations of Hadoop, Netflix stores (more than 500 billion events are stored per day), Twitter (a cluster of 10 thousand nodes in real time stores more than zetabytes of data and processes more than 5 billion sessions per day), Facebook (a cluster of 4 thousand nodes stores over 300 petabytes and growing by 4 petabytes a day).

All changes in Apache Hadoop 3.3:

  • Added support for platforms based on the ARM architecture.
  • Format Implementation protobuf (Protocol buffers) used to serialize structured data has been updated to release 3.7.1 due to the end of life of the protobuf-2.5.0 branch.
  • S3A connector capabilities have been expanded: support for authentication using tokens has been added (Delegation Token), improved support for caching responses with code 404, increased performance of S3guard, increased reliability.
  • In the ABFS file system, problems with automatic tuning have been resolved.
  • Added native support for Tencent Cloud COS file system to access COS object storage.
  • Added full support for Java 11.
  • The implementation of HDFS RBF (Router-based Federation) has been stabilized. Added security controls to HDFS Router.
  • The DNS Resolution service has been added to allow the client to determine servers via DNS by hostnames, which allows you to do without listing all the hosts in the settings.
  • Added support for launch scheduling opportunistic containers through a centralized resource manager (ResourceManager), including the possibility of distributing containers, taking into account the load of each node.
  • Added searchable YARN (Yet Another Resource Negotiator) application directory.

Source: opennet.ru

Add a comment