Tupperware: Facebook's Kubernetes Killer?

Efficient and reliable cluster management at any scale with Tupperware

Tupperware: Facebook's Kubernetes Killer?

Today on Systems@Scale conferences we introduced Tupperware, our cluster management system that orchestrates containers across millions of servers that run almost all of our services. We first deployed Tupperware in 2011 and since then our infrastructure has grown from 1 data center to whole 15 geo-distributed data centers. All this time, Tupperware did not stand still and developed with us. We'll show you where Tupperware provides first-class cluster management, including convenient support for stateful services, a single control panel for all data centers, and the ability to distribute power between services in real time. We will also share the lessons we learned as our infrastructure evolved.

Tupperware performs different tasks. Application developers use it to deliver and manage applications. It packages the application's code and dependencies into an image and delivers it to servers as containers. Containers provide isolation between applications on the same server so that developers can deal with application logic and not worry about how to find servers or control updates. Tupperware also monitors the health of the server, and if it finds a failure, it transfers containers from the problem server.

Capacity planning engineers use Tupperware to allocate server capacity to teams based on budget and constraints. They also use it to improve server utilization. Datacenter operators look to Tupperware to properly distribute containers across datacenters and stop or move containers for maintenance. Due to this, maintenance of servers, network and equipment requires minimal human intervention.

Tupperware architecture

Tupperware: Facebook's Kubernetes Killer?

Tupperware PRN architecture is one of our datacenter regions. The region consists of several data center buildings (PRN1 and PRN2) located side by side. We plan to make one control panel that will manage all servers in one region.

Application developers deliver services as Tupperware jobs. A job consists of multiple containers, and they all typically run the same application code.

Tupperware is responsible for allocating containers and managing their lifecycle. It consists of several components:

  • The Tupperware frontend provides an API for the user interface, CLI, and other automation tools through which you can interact with Tupperware. They hide the entire internal structure from Tupperware job owners.
  • The Tupperware Scheduler is the control panel responsible for managing the container and job lifecycle. It is deployed regionally and globally, where a regional scheduler manages servers in one region and a global scheduler manages servers from different regions. The scheduler is divided into shards, and each shard manages a set of jobs.
  • The Tupperware scheduler proxy hides internal sharding and provides a convenient single control panel for Tupperware users.
  • The Tupperware allocator assigns containers to servers. The scheduler is responsible for stopping, starting, updating, and failing over containers. Currently, one distributor can manage the entire region without being divided into shards. (Note the difference in terminology. For example, the scheduler in Tupperware corresponds to the control panel in Kubernetes, and the Tupperware allocator is called a scheduler in Kubernetes.)
  • The resource broker stores the source of truth for the server and service events. We run one resource broker for each data center, and it stores all information about the servers in that data center. The resource broker and capacity management system, or resource allocation system, dynamically decide which scheduler supply manages which server. The health check service monitors servers and stores their health data in the resource broker. If a server has problems or needs maintenance, the resource broker tells the allocator and scheduler to stop the containers or move them to other servers.
  • The Tupperware agent is a daemon that runs on every server that handles preparing and removing containers. Applications run inside a container, which gives them more isolation and reproducibility. On last year's Systems @Scale conference we have already described how individual Tupperware containers are created using images, btrfs, cgroupv2 and systemd.

Distinctive features of Tupperware

Tupperware is similar in many ways to other cluster management systems such as Kubernetes and months, but there are some differences:

  • Built-in support for stateful services.
  • A single dashboard for servers across datacenters to automate intent-based container delivery, cluster decommissioning, and maintenance.
  • Clear division of the control panel to zoom in.
  • Elastic computing makes it possible to distribute power between services in real time.

We developed these cool features to support a variety of stateless and stateful applications across a huge global shared server park.

Built-in support for stateful services.

Tupperware operates many critical stateful services that store persistent product data for Facebook, Instagram, Messenger and WhatsApp. These can be large repositories of key-value pairs (for example, ZippyDB) and monitoring data stores (for example, O.D.S. Gorilla и Scuba ). Maintaining stateful services is not easy, as the system must ensure that container shipments survive large-scale disruptions, including network outages or power outages. And while conventional methods, such as distributing containers across fault domains, are well suited for stateless services, stateful services need additional support.

For example, if a server failure causes one database replica to become unavailable, should automatic maintenance be enabled to update the cores on 50 servers in a pool of 10? Depends on the situation. If one of those 50 servers has another replica of the same database, it's best to wait and not lose 2 replicas at once. In order to dynamically make decisions about the maintenance and health of the system, you need information about the internal data replication and the logic for placing each stateful service.

The TaskControl interface allows stateful services to influence decisions that affect data availability. Using this interface, the scheduler notifies external applications of container operations (restart, upgrade, migration, maintenance). A stateful service implements a controller that tells Tupperware when it is safe to perform each operation, and these operations can be swapped or temporarily deferred. In the example above, the database controller might tell Tupperware to upgrade 49 of the 50 servers, but not touch a specific server (X) yet. As a result, if the kernel upgrade period passes and the database is still unable to recover the problematic replica, Tupperware will upgrade the X server anyway.

Tupperware: Facebook's Kubernetes Killer?

Many stateful services in Tupperware do not use TaskControl directly, but through ShardManager, a common platform for creating stateful services on Facebook. With Tupperware, developers can specify their intent for exactly how containers should be distributed across datacenters. With the ShardManager, developers specify their intent for how data shards should be distributed across containers. ShardManager is aware of the data placement and replication of its applications and interacts with Tupperware through the TaskControl interface to schedule container operations without the direct involvement of applications. This integration greatly simplifies the management of stateful services, but TaskControl is capable of much more. For example, our rich web tier is stateless and uses TaskControl to dynamically adjust the rate of updates to containers. Eventually the web tier is capable of quickly executing multiple software releases per day without compromising availability.

Server management in data centers

When Tupperware first appeared in 2011, each server cluster was managed by a separate scheduler. Then the Facebook cluster was a group of server racks connected to one network switch, and the data center contained several clusters. The scheduler could only manage servers in one cluster, meaning a job could not be spread across multiple clusters. Our infrastructure grew, we increasingly decommissioned clusters. Since Tupperware could not move a job from a decommissioned cluster to other clusters without changes, a lot of effort and careful coordination between application developers and data center operators was required. This process resulted in a waste of resources, with servers down for months due to the decommissioning process.

We created a resource broker to solve the problem of decommissioning clusters and to coordinate other types of maintenance tasks. The resource broker keeps track of all the physical information associated with a server and dynamically decides which scheduler manages each server. Dynamic binding of servers to schedulers allows the scheduler to manage servers in different data centers. Since a Tupperware job is no longer limited to a single cluster, Tupperware users can specify how containers should be distributed across fault domains. For example, a developer can declare their intent (let's say "run my job on 2 fault domains in the PRN region") without specifying specific availability zones. Tupperware will find the right servers to carry out this intent, even if the cluster is decommissioned or serviced.

Scale to support the entire global system

Historically, our infrastructure has been divided into hundreds of dedicated server pools for individual teams. Due to fragmentation and lack of standards, we had high transaction costs, and idle servers were harder to reuse. At last year's conference Systems @Scale we presented infrastructure as a service (IaaS), which should unite our infrastructure into a large single fleet of servers. But a single fleet of servers has its own difficulties. It must meet certain requirements:

  • Scalability. Our infrastructure grew as we added data centers to each region. Servers have become smaller and more energy efficient, so there are many more in each region. As a result, a single scheduler per region cannot keep up with the number of containers that can be run on hundreds of thousands of servers in each region.
  • Reliability. Even if the scheduler can be scaled up this much, the large scope of the scheduler will increase the risk of errors, and the entire container region could become unmanageable.
  • Fault tolerance. In the event of a huge infrastructure failure (for example, due to a network break or a power outage, the servers running the scheduler will fail), the negative consequences should affect only a part of the region's servers.
  • Ease of use. It may seem that you need to run several independent schedulers per region. But from a convenience point of view, having a single point of entry to a shared pool across a region simplifies capacity and job management.

We have divided the scheduler into shards to solve problems with maintaining a large shared pool. Each scheduler shard manages its own set of jobs in the region, and this reduces the risk associated with the scheduler. As the total pool grows, we can add more scheduler shards. To Tupperware users, shards and scheduler proxies look like one control panel. They don't have to work with a bunch of shards that orchestrate jobs. Scheduler shards are fundamentally different from the cluster schedulers we used before, when the control panel was partitioned without statically separating the common pool of servers by network topology.

Improving Usage Efficiency with Elastic Computing

The larger our infrastructure, the more important it is to use our servers efficiently to optimize infrastructure costs and reduce load. There are two ways to increase server utilization efficiency:

  • Elastic Computing - Downscale online services during busy hours and use free servers for offline workloads such as machine learning and MapReduce jobs.
  • Overloading - Host online services and batch workloads on the same servers so that batch loads run at low priority.

The bottleneck in our data centers is Energy consumption. That's why we prefer small, energy-efficient servers that together provide more processing power. Unfortunately, on small servers with little CPU and memory, overloading is less effective. Of course, we can host several containers of small services that consume little CPU and memory on one small energy-efficient server, but large services in this situation will have poor performance. Therefore, we advise the developers of our large services to optimize them so that they use the entire servers.


Basically, we improve usage efficiency with elastic computing. The intensity of use of many of our major services, such as news feeds, messaging functionality, and the front-end web tier, depends on the time of day. We deliberately scale down online services during quiet hours and use the free servers for offline workloads such as machine learning and MapReduce jobs.

Tupperware: Facebook's Kubernetes Killer?

We know from experience that it's best to provide entire servers as units of elastic capacity, because large services are both the main contributors and the main consumers of elastic capacity, and they are optimized for using entire servers. When a server is released from online service during quiet hours, the resource broker lends the server to the scheduler to run offline loads on it. If the online service experiences a peak load, the resource broker quickly recalls the borrowed server and, together with the scheduler, returns it to the online service.

Lessons learned and plans for the future

In the past 8 years, we have evolved Tupperware to keep up with the rapid development of Facebook. We share what we have learned and hope it will help others manage rapidly growing infrastructures:

  • Set up flexible communication between the control panel and the servers it manages. This flexibility allows the control panel to manage servers across data centers, helps automate cluster decommissioning and maintenance, and enables dynamic capacity allocation through elastic computing.
  • With a single control panel in the region, it becomes more convenient to work with tasks and easier to manage a large common fleet of servers. Note that the control panel maintains a single entry point, even if its internal structure is divided for reasons of scale or fault tolerance.
  • Using the plugin model, the control panel can notify external applications of upcoming container operations. Moreover, stateful services can use the plugin interface to customize the container management. With this plugin model, the dashboard provides simplicity while serving many different stateful services efficiently.
  • We believe that elastic computing, in which we take entire servers from donor services for batch jobs, machine learning, and other non-urgent services, is the best way to increase the efficiency of using small and energy-efficient servers.

We are just getting started a single global shared fleet of servers. Right now about 20% of our servers are in the shared pool. Achieving 100% requires addressing many issues, including shared storage pool support, maintenance automation, multi-tenant demand management, improved server utilization, and improved support for machine learning workloads. We can't wait to take on these challenges and share our successes.

Source: habr.com

Add a comment