How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes

Cube-on-cube, metaclusters, honeycombs, resource allocation

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 1. Kubernetes Ecosystem on Alibaba Cloud

Since 2015, Alibaba Cloud Container Service for Kubernetes (ACK) has been one of the fastest growing cloud services in Alibaba Cloud. It serves numerous customers and also maintains Alibaba's internal infrastructure and other cloud services of the company.

As with similar container services from world-class cloud providers, our top priorities are reliability and availability. Therefore, a scalable and globally available platform has been created for tens of thousands of Kubernetes clusters.

In this article, we will share the experience of managing a large number of Kubernetes clusters on a cloud infrastructure, as well as the architecture of the underlying platform.

Entry

Kubernetes has become the de facto standard for various workloads in the cloud. As shown in fig. 1 above, more and more Alibaba Cloud applications are now running on Kubernetes clusters: stateful/stateless applications and application managers. Kubernetes management has always been an interesting and serious topic of discussion for engineers who build and maintain infrastructure. When it comes to cloud providers like Alibaba Cloud, the issue of scaling comes to the fore. How to manage Kubernetes clusters at this scale? We've already covered best practices for managing huge Kubernetes clusters with 10 nodes. Of course, this is an interesting scaling problem. But there is another scale of scale: the number the clusters themselves.

We have discussed this topic with many ACK users. Most of them prefer to run dozens, if not hundreds, of small or medium sized Kubernetes clusters. There are good reasons for this: limiting potential damage, separating clusters for different teams, creating virtual clusters for testing. If ACK aims to serve a global audience with this usage model, it must reliably and efficiently manage a large number of clusters in more than 20 regions.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 2. Problems of managing a huge number of Kubernetes clusters

What are the main challenges of managing clusters at this scale? As shown in the figure, there are four problems to be dealt with:

  • Heterogeneity

ACK should support various types of clusters, including standard, serverless, Edge, Windows, and a few others. Different clusters require different parameters, components and hosting models. Some customers need help with customization for their specific needs.

  • Different size of clusters

Clusters vary in size, from a couple of nodes with a few pods to tens of thousands of nodes with thousands of pods. Resource requirements also vary greatly. Incorrect resource allocation can affect performance or even cause a crash.

  • Different versions

Kubernetes is growing very fast. New versions are released every few months. Customers are always ready to try new features. So they want to host the test load on new versions of Kubernetes and the workload on the stable ones. To meet this requirement, ACK must continually deliver new versions of Kubernetes to customers while maintaining stable versions.

  • Safety Compliance

Clusters are distributed across different regions. Thus, they must comply with various safety requirements and official regulations. For example, a cluster in Europe must comply with the GDPR, while a financial cloud in China must have additional layers of protection. These requirements are mandatory, it is unacceptable to ignore them, as this creates huge risks for customers of the cloud platform.

The ACK platform is designed to solve most of the above problems. It currently securely and consistently manages over 10 Kubernetes clusters around the world. Let's see how we managed to achieve this, including thanks to several key principles of design / architecture.

Design

Cube-on-cube and honeycombs

Unlike a centralized hierarchy, a cell-based architecture is usually used to scale a platform beyond a single data center or to expand the disaster recovery area.

Each region in the Alibaba Cloud is made up of multiple zones (AZ) and usually corresponds to a specific data center. In a large region (such as Huangzhou), there are often thousands of Kubernetes client clusters running ACKs.

ACK manages these Kubernetes clusters with Kubernetes itself, meaning we have a Kubernetes metacluster running to manage client Kubernetes clusters. This architecture is also called "cube-on-cube" (kube-on-kube, KoK). The KoK architecture simplifies the management of client clusters as cluster deployment becomes simple and deterministic. More importantly, we can reuse native Kubernetes features. For example, managing API servers through deployment, using the etcd statement to manage multiple etcds. Such recursion always brings special pleasure.

Within the same region, several Kubernetes metaclusters are deployed, depending on the number of clients. We call these metaclusters cells. To protect against an entire zone failure, ACK supports multi-active deployments in the same region: the metacluster distributes the Kubernetes Client Cluster Master components across multiple zones and runs them simultaneously, that is, in multi-active mode. To ensure the reliability and efficiency of the master, ACK optimizes component placement and ensures that the API server and etcd are close together.

This model allows efficient, flexible and reliable management of Kubernetes.

Metacluster Resource Planning

As we already mentioned, the number of metaclusters in each region depends on the number of clients. But at what point to add a new metacluster? This is a typical resource planning problem. As a rule, it is customary to create a new one when the existing metaclusters have exhausted all their resources.

Take, for example, network resources. In the KoK architecture, Kubernetes components from client clusters are deployed as pods in a metacluster. We use Terway (Figure 3) is a high-performance plug-in developed by Alibaba Cloud for container network management. It provides a rich set of security policies and allows you to connect to customers' virtual private clouds (VPCs) through the Alibaba Cloud Elastic Networking Interface (ENI). To effectively allocate network resources across nodes, pods, and services in a metacluster, we must carefully track their usage within the virtual private cloud metacluster. When network resources run out, a new cell is created.

To determine the optimal number of client clusters in each metacluster, we also consider our costs, density requirements, resource quota, reliability requirements, and statistics. The decision to create a new metacluster is made based on all this information. Please note that small clusters can expand greatly in the future, so resource consumption increases even with the same number of clusters. We usually leave enough free space for each cluster to grow.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 3. Terway network architecture

Scaling Wizard Components on Client Clusters

Wizard components have different resource requirements. They depend on the number of nodes and pods in the cluster, the number of non-standard controllers / operators interacting with the APIServer.

In the ACK, each client Kubernetes cluster is different in size and runtime requirements. There is no universal configuration for hosting wizard components. If we mistakenly set a low resource limit for a large client, then its cluster will not cope with the load. If you set a conservatively high limit for all clusters, then resources will be wasted.

To find a subtle compromise between reliability and cost, ACK uses a type system. Namely, we define three types of clusters: small, medium and large. Each type has a separate resource allocation profile. The type is determined based on the loading of the wizard components, the number of nodes, and other factors. The cluster type can change over time. ACK constantly monitors these factors and can upgrade/downgrade accordingly. After changing the cluster type, resource allocation is updated automatically with minimal user intervention.

We are working to improve this system with finer-grained scaling and more accurate type updates so that these changes are smoother and make more economic sense.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 4. Intelligent multi-stage type switching

The evolution of client clusters at scale

The previous sections have covered some aspects of managing a large number of Kubernetes clusters. However, there is one more problem that needs to be solved: the evolution of clusters.

Kubernetes is the "Linux" of the cloud world. It is continuously updated and becomes more modular. We must constantly deliver new versions to our customers, fix vulnerabilities and update existing clusters, and manage a large number of related components (CSI, CNI, Device Plugin, Scheduler Plugin and many others).

Let's take Kubernetes component management as an example. To begin with, we have developed a centralized registration and management system for all these plug-ins.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 5. Flexible and pluggable components

Before moving on, you need to make sure that the update was successful. To do this, we have developed a component health check system. The check is performed before and after the update.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 6. Preliminary check of cluster components

For fast and reliable updates of these components, a continuous deployment system with support for partial advancement (grayscale), pauses and other functions works. Standard Kubernetes controllers are not well suited for this use case. Therefore, to manage the cluster components, we have developed a set of specialized controllers, including a plug-in and an auxiliary management module (sidecar management).

For example, the BroadcastJob controller is designed to update components on each work machine or check nodes on each machine. The Broadcast job runs a pod on each cluster node like a DaemonSet. However, the DaemonSet always keeps the pod alive while the BroadcastJob collapses it. The Broadcast controller also starts up pods on newly joined nodes and initializes the nodes with the required components. In June 2019, we opened the source code for the OpenKruise automation engine, which we ourselves use internally.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 7. OpenKurise organizes the execution of the Broadcast job on all nodes

To help customers choose the right cluster configurations, we also provide a set of predefined profiles, including Serverless, Edge, Windows, and Bare Metal profiles. As the landscape expands and our customers' needs grow, we will be adding more profiles to ease the tedious setup process.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 8. Advanced and flexible cluster profiles for various scenarios

Global Observability by Data Center

As shown below in fig. 9, Alibaba Cloud Container cloud service is deployed in twenty regions of the world. Given this scale, one of the key goals of ACK is to easily monitor the status of running clusters: if a client cluster encounters a problem, we can quickly respond to the situation. In other words, you need to come up with a solution that will allow you to efficiently and securely collect real-time statistics from client clusters in all regions - and visually present the results.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 9. Global deployment of Alibaba Cloud Container service in twenty regions

As with many Kubernetes monitoring systems, we have Prometheus as our main tool. For each metacluster, Prometheus agents collect the following stats:

  • OS metrics such as host resources (CPU, memory, disk, etc.) and network bandwidth.
  • Metrics for managing metaclusters and client clusters, such as kube-apiserver, kube-controller-manager, and kube-scheduler.
  • Metrics from kubernetes-state-metrics and cadvisor.
  • Etcd metrics such as disk write time, database size, throughput of links between nodes, etc.

The collection of global statistics is carried out according to a typical multilayer aggregation model. Monitoring data from each metacluster is first aggregated in each region and then sent to a central server that shows the big picture. Everything works through the federation mechanism. The Prometheus server in each data center collects data center metrics, and the central Prometheus server is responsible for aggregating the monitoring data. AlertManager connects to the central Prometheus and, if necessary, sends alerts via DingTalk, email, SMS, etc. Visualization - using Grafana.

In Figure 10, the monitoring system can be divided into three levels:

  • boundary level

The layer farthest from the center. The Prometheus edge server runs in each metacluster, collecting meta and client cluster metrics within the same network domain.

  • Cascade level

The function of the Prometheus cascade layer is to collect monitoring data from multiple regions. These servers operate at the level of larger geographic areas such as China, Asia, Europe and America. As clusters grow, a region can be split, and then a cascading Prometheus server will appear in each new large region. With this strategy, you can seamlessly scale as needed.

  • Central level

The central Prometheus server connects to all cascading servers and performs the final data aggregation. For reliability, two central Prometheus instances connected to the same cascading servers were raised in different zones.

How Alibaba Cloud Manages Tens of Thousands of Kubernetes Clusters with… Kubernetes
Rice. 10. Global multi-level monitoring architecture based on the Prometheus federation mechanism

Summary

Kubernetes-based cloud solutions continue to transform our industry. Alibaba Cloud Container Service provides secure, reliable and high-performance hosting and is one of the best Kubernetes cloud hosting. The Alibaba Cloud team strongly believes in the principles of Open Source and the open source community. We will definitely continue to share our knowledge in the field of operation and management of cloud technologies.

Source: habr.com

Add a comment