Apache Bigtop and choosing a Hadoop distribution today

Apache Bigtop and choosing a Hadoop distribution today

It's probably no secret that the past year has been a year of great change for Apache Hadoop. Last year, Cloudera and Hortonworks merged (essentially a takeover of the latter), and Mapr, due to serious financial problems, was sold to Hewlett Packard. And if a few years earlier, in the case of on-premises installations, the choice more often had to be made between Cloudera and Hortonworks, today, alas, we don’t have this choice. Another surprise was the fact that since February this year, Cloudera has announced the cessation of releasing binary builds of its distribution to the public repository, and now they are available only by paid subscription. Of course, the ability to download the latest versions of CDH and HDP released before the end of 2019 is still available, and support for them is expected within one to two years. But what to do next? For those who previously paid for a subscription, nothing has changed. And for those who do not want to switch to the paid version of the distribution, but at the same time want to be able to receive the latest versions of the cluster components, as well as patches and other updates, we have prepared this article. In it, we will consider possible options for getting out of this situation.

The article is more of an overview. It will not compare distributions and analyze them in detail, and there will be no recipes for installing and configuring them. But what will happen? We will briefly talk about such a distribution kit as Arenadata Hadoop, which rightfully deserved our attention due to its availability, which is a rarity today. And then we'll talk about Vanilla Hadoop, mainly about how it can be β€œcooked” using Apache Bigtop. Ready? Then welcome under cat.

Arenadata Hadoop

Apache Bigtop and choosing a Hadoop distribution today

This is a completely new and, for the time being, little-known distribution kit of domestic development. Unfortunately, at the moment there is only about him on HabrΓ© this article.

More information can be found on the official Online project. The latest versions of the distribution are based on Hadoop 3.1.2 for version 3, and 2.8.5 for version 2.

Roadmap information can be found here.

Apache Bigtop and choosing a Hadoop distribution today
Arenadata Cluster Manager Interface

Arenadata's key product is Arenadata Cluster Manager (ADCM), which is used to install, configure, and monitor the company's various software solutions. ADCM is distributed free of charge, and its functionality is expanded by adding bundles to it, which are a set of ansible-playbooks. Bundles are divided into two types: enterprise and community. The latter are available for free download from the Arenadata website. It is also possible to develop your own bundle and connect it to ADCM.

For deployment and management of Hadoop 3, a community version of the bundle is offered in conjunction with ADCM, and for hadoop 2 there is only apache ambari as an alternative. As for the repositories with packages, they are open to public access, they can be downloaded and installed in the usual way for all cluster components. In general, the distribution looks very interesting. I'm sure there will be those who are used to solutions such as Cloudera Manager and Ambari, and who will like ADCM itself. For someone, it will also be a huge plus that the distribution kit included in the software registry for import substitution.

If we talk about the cons, then they will be the same as for all other Hadoop distributions. Namely:

  • The so-called "vendor lock-in". Using the example of Cloudera and Hortonworks, we have already understood that there is always a risk of changing the company's policy.
  • Significant lag behind Apache upstream.

Vanilla Hadoop

Apache Bigtop and choosing a Hadoop distribution today

As you know, Hadoop is not a monolithic product, but, in fact, a whole galaxy of services around its HDFS distributed file system. Few people will be satisfied with one file cluster. Some need Hive, others need Presto, and then there's HBase and Phoenix, Spark is increasingly being used. Oozie, Sqoop and Flume are sometimes found for orchestration and data loading. And if the question of security arises, then Kerberos is immediately remembered in conjunction with Ranger.

Binary versions of Hadoop components are available on the website of each of the ecosystem projects in the form of tarballs. You can download them and start the installation, but with one condition: in addition to self-assembly of packages from "raw" binaries, which you most likely want to do, you will not have any confidence in the compatibility of the downloaded versions of the components with each other. The preferred option is to build with Apache Bigtop. Bigtop will allow you to build from Apache maven repositories, run tests, and build packages. But, what is very important for us, Bigtop will assemble those versions of the components that will be compatible with each other. We will talk about it in more detail later.

Apache bigtop

Apache Bigtop and choosing a Hadoop distribution today

Apache Bigtop is a tool for building, packaging and testing a number of
open source projects such as Hadoop and Greenplum. Bigtop has plenty
releases. At the time of writing, the latest stable release is version 1.4,
and in master was 1.5. Different versions of releases use different versions
components. For example, for 1.4 Hadoop core components have version 2.8.5, and in master
2.10.0. The list of supported components also changes. Something old and
the unrenewable goes away, and something new, more in demand, comes in its place, and
not necessarily something from the Apache family itself.

In addition, Bigtop has many forks.

When we began to get acquainted with Bigtop, we were first of all surprised by its modest, in comparison with other Apache projects, prevalence and fame, as well as a very small community. It follows from this that there is a minimum of information on the product, and the search for solutions to problems that have arisen on forums and mailing lists may not give anything at all. At first, it turned out to be a difficult task for us to complete the complete assembly of the distribution due to the features of the tool itself, but we will talk about this a little later.

As a teaser, those who once visited such projects of the Linux universe as Gentoo and LFS may find it nostalgically pleasant to work with this thing and remember those β€œepic” times when we ourselves looked for (or even wrote) ebuilds and regularly rebuilt mozilla with new patches.

The big advantage of Bigtop is the openness and versatility of the tools on which it is based. It is based on Gradle and Apache Maven. Gradle is fairly well known as the tool Google builds Android with. It is flexible, and, as they say, "tested in battle." Maven is a standard tool for building projects in Apache itself, and since most of its products are released through Maven, it could not have done without it either. It is worth paying attention to POM (project object model) - a "fundamental" xml file with a description of everything necessary for Maven to work with your project, around which all work is built. Exactly at
parts of Maven and some of the hurdles that newcomers to Bigtop usually run into.

Practice

So where should you start? We go to the download page and download the latest stable version as an archive. You can also find binary artifacts compiled by Bigtop there. By the way, of the common package managers, YUM and APT are supported.

Alternatively, you can download the latest stable release directly from
github:

$ git clone --branch branch-1.4 https://github.com/apache/bigtop.git

Cloning to "bigtop"...

remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 40217 (delta 14), reused 10 (delta 1), pack-reused 40171
ΠŸΠΎΠ»ΡƒΡ‡Π΅Π½ΠΈΠ΅ ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ΠΎΠ²: 100% (40217/40217), 43.54 MiB | 1.05 MiB/s, Π³ΠΎΡ‚ΠΎΠ²ΠΎ.
ΠžΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΠΈΠ·ΠΌΠ΅Π½Π΅Π½ΠΈΠΉ: 100% (20503/20503), Π³ΠΎΡ‚ΠΎΠ²ΠΎ.
Updating files: 100% (1998/1998), Π³ΠΎΡ‚ΠΎΠ²ΠΎ.

The resulting ./bigtop directory looks something like this:

./bigtop-bigpetstore - demo applications, synthetic examples
./bigtop-ci - CI toolkit, jenkins
./bigtop-data-generators - data generation, synthetics, for smoke tests, etc.
./bigtop-deploy - deployment tools
./bigtop-packages - configs, scripts, build patches, the main part of the tool
./bigtop-test-framework - testing framework
./bigtop-tests - the tests themselves, load and smoke
./bigtop_toolchain - build environment, preparing the environment for the tool to work
./build - build working directory
./dl - directory for downloaded sources
./docker - build in docker images, testing
./gradle - gradle config
./output – directory where build artifacts go
./provisioner β€” provisioning

The most interesting at this stage for us is the main config ./bigtop/bigtop.bom, in which we see all supported components with versions. It is here that we can specify a different version of the product (if we suddenly want to try to build it) or an assembly version (if, for example, we added a significant patch).

Also of great interest is the subdirectory ./bigtop/bigtop-packages, which is directly related to the process of assembling components and packages with them.

So, we downloaded the archive, unpacked it or made a clone from github, can we start building?

No, let's prepare the environment first.

Preparing the Environment

And here we need a little digression. To build almost any more or less complex product, you need a certain environment - in our case, this is the JDK, the same shared libraries, header files, etc., tools, for example, ant, ivy2 and much more. One option to get the right environment for Bigtop is to install the right components on the build host. I can be wrong in the chronology, but it seems that since version 1.0 there has also been a build option in preconfigured and available docker images, you can find them here.

As for the preparation of the environment, there is an assistant for this - Puppet.

You can use the following commands, the launch is done from the root directory
instrument, ./bigtop:

./gradlew toolchain
./gradlew toolchain-devtools
./gradlew toolchain-puppetmodules

Or directly via puppet:

puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::installer"
puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::deployment-tools"
puppet apply --modulepath=<path_to_bigtop> -e "include bigtop_toolchain::development-tools"

Unfortunately, even at this stage, difficulties may arise. The general advice here is to use a supported distribution, up to date on the build host, or try the docker path.

Assembly

What can we try to collect? The answer to this question will give the output of the command

./gradlew tasks

In the Package tasks section, there are a number of products that are the end artifacts of Bigtop.
They can be identified by the suffix -rpm or -pkg-ind (in the case of building
in docker). In our case, the most interesting is Hadoop.

Let's try to build in our build server environment:

./gradlew hadoop-rpm

Bigtop will download the necessary sources needed for a particular component and start building. Thus, the work of the tool is tied to the Maven repositories and other sources, that is, it needs access to the Internet.

During operation, standard output is generated. Sometimes it and error messages can be used to understand what went wrong. And sometimes you need more information. In this case, you should add arguments --info or --debugand also can be useful –stacktrace. There is a convenient way to generate a data set for subsequent access to mailing lists, the key --scan.

With it, bigtop will collect all the information and put it in gradle, after which it will issue a link,
after passing through which, a competent person will be able to understand why the assembly failed.
Be aware that this option may make public information you don't want, such as usernames, nodes, environment variables, etc., so be careful.

Often errors are the result of the inability to obtain any components necessary for assembly. Typically, the way to fix a problem is to create a patch to fix something in the sources, such as an address in pom.xml in the source root directory. This is done through creating and placing it in the appropriate directory ./bigtop/bigtop-packages/src/common/oozie/ patch, for example, in the form patch2-fix.diff.

--- a/pom.xml
+++ b/pom.xml
@@ -136,7 +136,7 @@
<repositories>
<repository>
<id>central</id>
- <url>http://repo1.maven.org/maven2</url>
+ <url>https://repo1.maven.org/maven2</url>
<snapshots>
<enabled>false</enabled>
</snapshots>

Most likely, at the time of reading this article, you will not have to do the above fix yourself.

When introducing any patches and edits to the build mechanism, you may need to β€œreset” the build through the cleanup command:

./gradlew hadoop-clean
> Task :hadoop_vardefines
> Task :hadoop-clean
BUILD SUCCESSFUL in 5s
2 actionable tasks: 2 executed

This operation will roll back all changes to the assembly of this component, after which the assembly will be performed again. This time, let's try to build the project in a docker image:

./gradlew -POS=centos-7 -Pprefix=1.2.1 hadoop-pkg-ind
> Task :hadoop-pkg-ind
Building 1.2.1 hadoop-pkg on centos-7 in Docker...
+++ dirname ./bigtop-ci/build.sh
++ cd ./bigtop-ci/..
++ pwd
+ BIGTOP_HOME=/tmp/bigtop
+ '[' 6 -eq 0 ']'
+ [[ 6 -gt 0 ]]
+ key=--prefix
+ case $key in
+ PREFIX=1.2.1
+ shift
+ shift
+ [[ 4 -gt 0 ]]
+ key=--os
+ case $key in
+ OS=centos-7
+ shift
+ shift
+ [[ 2 -gt 0 ]]
+ key=--target
+ case $key in
+ TARGET=hadoop-pkg
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z x ']'
+ '[' -z x ']'
+ '[' '' == true ']'
+ IMAGE_NAME=bigtop/slaves:1.2.1-centos-7
++ uname -m
+ ARCH=x86_64
+ '[' x86_64 '!=' x86_64 ']'
++ docker run -d bigtop/slaves:1.2.1-centos-7 /sbin/init
+
CONTAINER_ID=0ce5ac5ca955b822a3e6c5eb3f477f0a152cd27d5487680f77e33fbe66b5bed8
+ trap 'docker rm -f
0ce5ac5ca955b822a3e6c5eb3f477f0a152cd27d5487680f77e33fbe66b5bed8' EXIT
....
ΠΌΠ½ΠΎΠ³ΠΎ Π²Ρ‹Π²ΠΎΠ΄Π°
....
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-mapreduce-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-namenode-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-secondarynamenode-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-zkfc-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-journalnode-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-datanode-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-httpfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-resourcemanager-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-nodemanager-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-proxyserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-yarn-timelineserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-mapreduce-historyserver-2.8.5-
1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-client-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-conf-pseudo-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-doc-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-libhdfs-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-libhdfs-devel-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-hdfs-fuse-2.8.5-1.el7.x86_64.rpm
Wrote: /bigtop/build/hadoop/rpm/RPMS/x86_64/hadoop-debuginfo-2.8.5-1.el7.x86_64.rpm
+ umask 022
+ cd /bigtop/build/hadoop/rpm//BUILD
+ cd hadoop-2.8.5-src
+ /usr/bin/rm -rf /bigtop/build/hadoop/rpm/BUILDROOT/hadoop-2.8.5-1.el7.x86_64
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.uQ2FCn
+ exit 0
+ umask 022
Executing(--clean): /bin/sh -e /var/tmp/rpm-tmp.CwDb22
+ cd /bigtop/build/hadoop/rpm//BUILD
+ rm -rf hadoop-2.8.5-src
+ exit 0
[ant:touch] Creating /bigtop/build/hadoop/.rpm
:hadoop-rpm (Thread[Task worker for ':',5,main]) completed. Took 38 mins 1.151 secs.
:hadoop-pkg (Thread[Task worker for ':',5,main]) started.
> Task :hadoop-pkg
Task ':hadoop-pkg' is not up-to-date because:
Task has not declared any outputs despite executing actions.
:hadoop-pkg (Thread[Task worker for ':',5,main]) completed. Took 0.0 secs.
BUILD SUCCESSFUL in 40m 37s
6 actionable tasks: 6 executed
+ RESULT=0
+ mkdir -p output
+ docker cp
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb:/bigtop/build .
+ docker cp
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb:/bigtop/output .
+ docker rm -f ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
+ '[' 0 -ne 0 ']'
+ docker rm -f ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
Error: No such container:
ac46014fd9501bdc86b6c67d08789fbdc6ee46a2645550ff6b6712f7d02ffebb
BUILD SUCCESSFUL in 41m 24s
1 actionable task: 1 executed

The build was done under CentOS, but you can also do it under Ubuntu:

./gradlew -POS=ubuntu-16.04 -Pprefix=1.2.1 hadoop-pkg-ind

In addition to building packages for various Linux distributions, the tool can create a repository with compiled packages, for example:

./gradlew yum

You can also remember about smoke tests and deployment in docker.

Create a cluster of three nodes:

./gradlew -Pnum_instances=3 docker-provisioner

Run smoke tests on a cluster of three nodes:

./gradlew -Pnum_instances=3 -Prun_smoke_tests docker-provisioner

Delete cluster:

./gradlew docker-provisioner-destroy

Get commands to connect inside docker containers:

./gradlew docker-provisioner-ssh

Show state:

./gradlew docker-provisioner-status

You can read more about Deployment tasks in the documentation.

If we talk about tests, then there are a fairly large number of them, mainly smoke and integration. Their analysis is beyond the scope of this article. Let me just say that building a distribution is not as difficult a task as it might seem at first glance. All the components that we use in our production were able to assemble and pass tests on them, and we also had no problems deploying them and performing basic operations in a test environment.

In addition to the existing components in Bigtop, it is possible to add something else, even your own software development. All this is perfectly automated and fits into the concept of CI / CD.

Conclusion

Obviously, a distribution compiled in this way should not be immediately sent to production. You need to understand that if there is a real need to build and support your distribution, then you need to invest in it financially and in time.

However, in combination with the right approach and a professional team, it is quite possible to do without commercial solutions.

It is important to note that the Bigtop project itself needs development, and it seems that there is no active development in it today. The prospect of Hadoop 3 appearing in it is also incomprehensible. By the way, if you have a real need to build Hadoop 3, you can look at fork from Arenadata, in which, in addition to standard
there are a number of additional components (Ranger, Knox, NiFi).

As for Rostelecom, for us Bigtop is one of the options under consideration today. Whether we decide on it or not, only time will tell.

Appendix

To include a new component in the assembly, you need to add its description to bigtop.bom and ./bigtop-packages. You can try to do this by analogy with the existing components. Try to figure it out. It is not as difficult as it seems at first glance.

What do you think? We will be glad to see your opinion in the comments and thank you for your attention!

The article was prepared by the data management team of Rostelecom

Source: habr.com

Add a comment