Tinder's transition to Kubernetes

Note. transl.: The staff of the world famous service Tinder recently shared some technical details of migrating their infrastructure to Kubernetes. The process took almost two years and resulted in the launch of a very large-scale platform on K8s, consisting of 200 services hosted on 48 containers. What interesting difficulties Tinder engineers faced and what results they came up with - read in this translation.

Tinder's transition to Kubernetes

What for?

Almost two years ago, Tinder decided to migrate its platform to Kubernetes. Kubernetes would allow the Tinder team to containerize and go live with minimal effort through immutable deployment (immutable deployment). In this case, the assembly of applications, their deployment, and the infrastructure itself would be uniquely defined by the code.

We were also looking for a solution to the problem of scalability and stability. When scaling became critical, we often had to wait several minutes for new EC2 instances to start up. The idea of ​​launching containers and starting serving traffic in seconds instead of minutes has become very attractive to us.

The process turned out to be difficult. During our early 2019 migration, the Kubernetes cluster reached critical mass and we started running into various issues due to traffic volume, cluster size, and DNS. Along the way, we solved a lot of interesting problems related to migrating 200 services and maintaining a Kubernetes cluster consisting of 1000 nodes, 15000 pods and 48000 running containers.

How?

Since January 2018, we have gone through various stages of migration. We started by containerizing all of our services and deploying them to test Kubernetes cloud environments. Starting in October, we began methodically migrating all existing services to Kubernetes. By March next year, we had completed the "relocation" and now the Tinder platform runs exclusively on Kubernetes.

Building images for Kubernetes

We have over 30 source code repositories for microservices running on a Kubernetes cluster. The code in these repositories is written in different languages ​​(for example, Node.js, Java, Scala, Go) with multiple runtime environments for the same language.

The build system is designed to provide a fully customizable "build context" for each microservice. It usually consists of a Dockerfile and a list of shell commands. Their content is fully customizable, and at the same time, all these assembly contexts are written according to a standardized format. The standardization of build contexts allows one single build system to handle all microservices.

Tinder's transition to Kubernetes
Figure 1-1. Standardized build process via Builder Container

For maximum consistency across runtime environments (runtime environments) the same build process is used during development and testing. We were faced with a very interesting challenge: we had to develop a way to guarantee a consistent build environment across the entire platform. To do this, all assembly processes are carried out inside a special container. Builder.

His container implementation required advanced Docker tricks. Builder inherits the local user ID and secrets (such as SSH key, AWS credentials, etc.) required to access private Tinder repositories. It mounts local directories containing sources to naturally store build artifacts. This approach improves performance because it eliminates the need to copy build artifacts between the Builder container and the host. Stored build artifacts can be reused without additional configuration.

For some services, we had to create another container to map the compilation environment to the runtime environment (for example, during installation, the Node.js bcrypt library generates platform-specific binary artifacts). During compilation, the requirements may vary for different services, and the final Dockerfile is compiled on the fly.

Kubernetes cluster architecture and migration

Cluster size control

We decided to use kube-aws for automated cluster deployment on Amazon EC2 instances. At the very beginning, everything worked in one common pool of nodes. We quickly recognized the need to segregate workloads by instance size and type to use resources more efficiently. The logic was that running a few loaded multi-threaded pods was more predictable in performance than coexisting with a large number of single-threaded pods.

In the end, we ended up with:

  • m5.4xlarge β€” for monitoring (Prometheus);
  • c5.4xlarge - For Node.js workload (single-threaded workload)
  • c5.2xlarge - for Java and Go (multi-threaded workload);
  • c5.4xlarge β€” for the control panel (3 nodes).

Migration

One of the preparatory steps for migrating from the old infrastructure to Kubernetes was to redirect the existing direct interaction between services to the new Elastic Load Balancers (ELBs). They were created on a specific virtual private cloud (VPC) subnet. This subnet was connected to a Kubernetes VPC. This allowed us to migrate modules incrementally without considering the specific order of service dependencies.

These endpoints were created using a weighted set of DNS records with CNAMEs pointing to each new ELB. To switch over, we added a new entry pointing to the new Kubernetes service ELB with a weight of 0. Then we set the Time To Live (TTL) of the recordset to 0. After that, the old and new weights slowly adjusted, and eventually 100% of the load was sent to to the new server. After the switch was completed, the TTL value returned to a more adequate level.

The Java modules we had were handling the low TTL DNS, but the Node apps weren't. One of the engineers rewrote part of the connection pool code and wrapped it in a manager that updated the pools every 60 seconds. The chosen approach worked very well and without noticeable performance degradation.

Lessons

Network Factory Limits

In the early hours of January 8, 2019, the Tinder platform suddenly went down. In response to an unrelated increase in platform latency earlier that morning, the number of pods and nodes in the cluster increased. This caused the ARP cache to be exhausted on all of our nodes.

There are three Linux options related to the ARP cache:

Tinder's transition to Kubernetes
(source)

gc_thresh3 is a hard limit. The appearance of β€œneighbor table overflow” entries in the log meant that even after synchronous garbage collection (GC), there was not enough space in the ARP cache to store the neighboring entry. In this case, the kernel simply dropped the packet completely.

We use Flannel as a network fabric in Kubernetes. Packets are sent over VXLAN. VXLAN is an L2 tunnel raised over an L3 network. The technology uses MAC-in-UDP (MAC Address-in-User Datagram Protocol) encapsulation and allows you to expand layer 2 network segments. The transport protocol on the physical data center network is IP plus UDP.

Tinder's transition to Kubernetes
Figure 2-1. Flannel chart (source)

Tinder's transition to Kubernetes
Figure 2-2. VXLAN package (source)

Each Kubernetes worker node allocates a /24 masked virtual address space from a larger /9 block. For each node, this means one entry in the routing table, one entry in the ARP table (on interface flannel.1), and one entry in the switching table (FDB). They are added the first time a worker node is started or each new node is discovered.

In addition, node-pod (or pod-pod) communications end up going through the interface eth0 (as shown in the Flannel chart above). This results in an additional entry in the ARP table for each corresponding host source and destination.

In our environment, this type of communication is very common. For service objects in Kubernetes, an ELB is created and Kubernetes registers each node with the ELB. ELB knows nothing about pods and the selected node may not be the final destination of the packet. The fact is that when a node receives a packet from ELB, it considers it according to the rules iptables for a particular service and randomly selects a pod on another node.

At the time of the failure, there were 605 nodes in the cluster. For the reasons stated above, this proved to be sufficient to overcome the value gc_thresh3, set by default. When this happens, not only are packets dropped, but Flannel's entire /24 virtual address space disappears from the ARP table. The pod-to-pod communication and DNS queries are broken (DNS is hosted in a cluster; see details later in this article).

To solve this problem, you need to increase the values gc_thresh1, gc_thresh2 ΠΈ gc_thresh3 and restart Flannel to re-register the missing networks.

Unexpected DNS Scaling

During the migration process, we actively used DNS to manage traffic and gradually transfer services from the old infrastructure to Kubernetes. We have set relatively low TTL values ​​for the associated RecordSets in Route53. When the old infrastructure was running on EC2 instances, our resolver configuration pointed to Amazon DNS. We took it for granted and the impact of low TTL on our services and Amazon services (eg DynamoDB) went almost unnoticed.

As we migrated services to Kubernetes, we found that DNS was processing 250K requests per second. As a result, applications began to experience constant and serious timeouts on DNS queries. This happened despite an incredible effort to optimize and switch the DNS provider to CoreDNS (which reached 1000 pods running on 120 cores at peak load).

While investigating other possible causes and solutions, we found Article, which describes the race conditions affecting the packet filtering framework net filter in linux. The timeouts we observe, along with an incrementing counter insert_failed in the Flannel interface corresponded to the conclusions of the article.

The problem occurs at the stage of Source and Destination Network Address Translation (SNAT and DNAT) and subsequent entry into the table conntrack. One of the workarounds that was discussed internally and suggested by the community was to migrate DNS to the worker node itself. In this case:

  • SNAT is not needed because the traffic stays inside the host. It does not need to be passed through the interface eth0.
  • DNAT is not needed, because the destination IP is local to the node, and not randomly chosen by the pod according to the rules iptables.

We decided to take this approach. CoreDNS has been deployed as a DaemonSet in Kubernetes and we have implemented a local host DNS server in resolve.conf each pod'a by setting a flag --cluster-dns teams kubletβ€Š. This solution proved to be effective for DNS timeouts.

However, we still saw packet loss and counter increments. insert_failed in the Flannel interface. This persisted even after the workaround was implemented, as we were able to eliminate SNAT and/or DNAT for DNS traffic only. Race conditions were kept for other types of traffic. Fortunately, most of the packets we have are TCP, and if a problem occurs, they are simply retransmitted. We are still trying to find a suitable solution for all types of traffic.

Using Envoy for Better Load Balancing

As backend services migrated to Kubernetes, we began to suffer from unbalanced load between pods. We found that HTTP Keepalive was causing ELB connections to hang on the first ready pods of each rolled out deployment. Thus, the bulk of the traffic went through a small percentage of the available pods. The first solution we tested was setting MaxSurge to 100% on new deployments for the worst cases. The effect turned out to be insignificant and unpromising in terms of larger deployments.

Another solution we used was to artificially increase resource requests for critical services. In this case, nearby pods would have more wiggle room compared to other heavy pods. In the long run, it would also not work due to a waste of resources. In addition, our Node applications were single-threaded and, accordingly, could only use one core. The only real solution was to use better load balancing.

We have long wanted to fully appreciate Envoy. The current situation allowed us to deploy it in a very limited way and get immediate results. Envoy is an open source high-performance Layer XNUMX proxy designed for large SOA applications. It knows how to apply advanced load balancing techniques, including automatic retries, circuit breakers, and global rate limiting. (Note. transl.: You can read more about this in this article about Istio, which is based on Envoy.)

We came up with the following configuration: to have an Envoy sidecar for each pod and a single route, and the cluster to connect to the container locally by port. To minimize potential cascading and keep the kill radius small, we used a fleet of Envoy front-proxy pods, one per Availability Zone (AZ) for each service. They used a simple service discovery mechanism written by one of our engineers that simply returned a list of pods in each AZ for a given service.

The service front Envoys then used this service discovery mechanism with a single upstream cluster and route. We set adequate timeouts, increased all circuit breaker settings, and added a minimal retry configuration to help with single failures and ensure smooth deployments. In front of each of these service front-Envoys, we placed a TCP ELB. Even if the keepalive from our main proxy layer hung on some of the Envoy pods, they could still handle the load much better and were configured to balance through the least_request in the backend.

For deployments, we used the preStop hook on both the application pods and the sidecar pods. The hook triggered a status check error on the admin endpoint located on the sidecar container and went to sleep for a while to allow active connections to complete.

One of the reasons we were able to progress so quickly is because of the detailed metrics we were able to easily integrate into a typical Prometheus installation. This allowed us to see exactly what was going on while we tweaked the configuration options and redistributed the traffic.

The results were immediate and obvious. We started with the most unbalanced services, and at the moment it is already functioning in front of the 12 most important services in the cluster. This year we plan to move to a full service mesh with more advanced service discovery, circuit breaking, outlier detection, rate limiting, and tracing.

Tinder's transition to Kubernetes
Figure 3-1. Single service CPU convergence during transition to Envoy

Tinder's transition to Kubernetes

Tinder's transition to Kubernetes

Final result

Through experience gained and additional research, we have built a strong infrastructure team with strong skills in designing, deploying and operating large Kubernetes clusters. Now all Tinder engineers have the knowledge and experience on how to package containers and deploy applications on Kubernetes.

When the old infrastructure needed additional capacity, we had to wait several minutes for new EC2 instances to start up. Now containers start up and start processing traffic in seconds instead of minutes. Scheduling multiple containers on a single EC2 instance also provides improved horizontal concentration. As a result, in 2019 we predict a significant reduction in EC2 costs compared to last year.

The migration took almost two years, but we completed it in March 2019. The Tinder platform currently runs exclusively on a Kubernetes cluster of 200 services, 1000 nodes, 15 pods, and 000 running containers. Infrastructure is no longer the exclusive domain of operations teams. All of our engineers share this responsibility and control the process of building and deploying their applications through code alone.

PS from translator

Read also in our blog a series of articles:

Source: habr.com

Add a comment