Our findings from a year of migrating GitLab.com to Kubernetes

Note. transl.: GitLab's adoption of Kubernetes is considered one of the two main factors contributing to the company's growth. However, until recently, the infrastructure of the GitLab.com online service was built on virtual machines, and only about a year ago began its migration to K8s, which has not yet been completed. We are glad to present a translation of a recent article by a GitLab SRE engineer about how this happens and what conclusions the engineers involved in the project make.

Our findings from a year of migrating GitLab.com to Kubernetes

For about a year now, our infrastructure division has been migrating all services running on GitLab.com to Kubernetes. During this time, we have faced challenges not only with moving services to Kubernetes, but also with managing hybrid deployments during the transition. The valuable lessons we have learned will be discussed in this article.

Since the beginning of GitLab.com, its servers have been running in the cloud in virtual machines. These virtual machines are managed by Chef and installed using our official linux package. Deployment strategy in case the application needs to be updated, is to simply update the server fleet in a coordinated sequential manner using the CI pipeline. This method, albeit slow and a little boring - ensures that GitLab.com uses the same installation and configuration methods as standalone users (self-managed) GitLab installations using our Linux packages for this.

We use this method because it's critical to experience the pain and joy that the average member of the community goes through when they install and configure their copies of GitLab. This approach worked well for some time, but when the number of projects on GitLab exceeded 10 million, we realized that it no longer met our scaling and deployment needs.

First steps towards Kubernetes and cloud-native GitLab

The project was created in 2017 GitLab Charts to prepare GitLab for deployment in the cloud, and to enable users to install GitLab on Kubernetes clusters. Back then, we knew that migrating GitLab to Kubernetes would increase the SaaS platform's scalability, simplify deployments, and improve compute efficiency. At the same time, many features of our application depended on NFS-mounted partitions, which slowed down the transition from virtual machines.

The focus on cloud native and Kubernetes allowed our engineers to plan for a gradual transition, during which we dropped some of the application's NAS dependencies while continuing to develop new features along the way. Since we started planning our migration in the summer of 2019, many of these restrictions have been removed and the process of migrating GitLab.com to Kubernetes is now well under way!

Features of GitLab.com in Kubernetes

For GitLab.com, we use a single regional GKE cluster that handles all application traffic. To minimize the complexity of an (already tricky) migration, we are focusing on services that do not rely on local storage or NFS. GitLab.com uses a predominantly monolithic Rails codebase and we route traffic based on workload characteristics to different endpoints isolated in their own node pools.

In the case of the frontend, these types are divided into requests to the web, API, Git SSH/HTTPS, and Registry. In the case of the backend, we split the jobs in the queue according to various characteristics, depending on predefined resource boundaries, which allow us to set Service-Level Objectives (SLOs) for different workloads.

All of these GitLab.com services are configured with an unmodified GitLab Helm chart. Configuration is done in sub-charts that can be selectively enabled as we gradually migrate services to the cluster. Even with the decision not to include some of our stateful services such as Redis, Postgres, GitLab Pages and Gitaly in the migration, using Kubernetes allows us to drastically reduce the number of VMs currently managed by Chef.

Transparency and configuration management of Kubernetes

All settings are managed by GitLab itself. Three configuration projects based on Terraform and Helm are used for this. We try to use GitLab itself wherever possible to run GitLab, but for operational tasks we have a separate GitLab installation. It is needed in order not to depend on the availability of GitLab.com when performing deployments and updates of GitLab.com.

Although our pipelines for the Kubernetes cluster run on a separate GitLab installation, there are mirrors of the code repositories publicly available at the following addresses:

  • k8s-workloads/gitlab-com β€” GitLab.com configuration binding for GitLab Helm-chart;
  • k8s-workloads/gitlab-helmfiles - contains configurations for services that are not directly related to the GitLab application. These include configurations for logging and cluster monitoring, as well as integrated tools like PlantUML;
  • gitlab-com-infrastructure - Terraform configuration for Kubernetes and legacy VM infrastructure. Here you configure all the resources needed to run the cluster, including the cluster itself, node pools, service accounts, IP address reservation.

Our findings from a year of migrating GitLab.com to Kubernetes
When changes are made, the public view is shown. short summary with a link to a detailed diff that SRE analyzes before making changes to the cluster.

For SRE, the link leads to a detailed diff in the GitLab installation, which is used for production and access to which is restricted. This allows employees and the community without access to the operational design (which is open only to SREs) to view proposed configuration changes. By combining a public instance of GitLab for code with a private instance for CI pipelines, we maintain a single workflow while guaranteeing independence from GitLab.com for configuration updates.

What we learned during the migration

During the move, experience was gained that we apply to new migrations and deployments in Kubernetes.

1. Increased costs due to traffic between Availability Zones

Our findings from a year of migrating GitLab.com to Kubernetes
Daily egress statistics (bytes per day) for the Git storage fleet on GitLab.com

Google divides its network into regions. Those, in turn, are divided into availability zones (AZ). Git hosting is associated with large amounts of data, so it is important for us to control network egress. In the case of internal traffic, egress is free only if it remains within the boundaries of one availability zone. At the time of this writing, we're giving away roughly 100TB of data on a typical business day (and that's just for Git repositories). Services that were in the same virtual machines in our old VM-based topology now run in different Kubernetes pods. This means that some of the traffic that used to be local to the VM could potentially go outside of availability zones.

Regional GKE clusters allow you to span multiple Availability Zones for redundancy. We are considering the possibility split the GKE regional cluster into single-zone clusters for services that generate large amounts of traffic. This will reduce the cost of egress while maintaining redundancy at the cluster level.

2. Limits, resource requests and scaling

Our findings from a year of migrating GitLab.com to Kubernetes
The number of replicas handling production traffic to registry.gitlab.com. Traffic peaks at ~15:00 UTC.

Our migration story began in August 2019, when we migrated the first service, the GitLab Container Registry, to Kubernetes. This high-traffic mission-critical service was a good fit for the first migration because it is a stateless application with few external dependencies. The first problem we encountered was a large number of pods being pushed out due to lack of memory on the nodes. Because of this, we had to change requests and limits.

It was found that in the case of an application whose memory consumption grows over time, low values ​​for requests (reserving memory for each pod), coupled with a β€œgenerous” hard limit on use, led to saturation (saturation) nodes and a high level of displacements. To deal with this problem, it was it was decided to increase requests and reduce limits. This took the pressure off the nodes and provided a pod lifecycle that didn't put too much pressure on the node. We now start migrations with generous (and nearly identical) request and limit values, adjusting them as needed.

3. Metrics and logs

Our findings from a year of migrating GitLab.com to Kubernetes
Infrastructure division focuses on latency, error rate and saturation with established service level goals (SLO) linked to overall availability of our system.

Over the past year, one of the key developments in the infrastructure division has been improvements in monitoring and working with SLO. SLOs allowed us to set goals for individual services, which we kept a close eye on during the migration. But even with this improved observability, it is not always possible to immediately see problems using metrics and alerts. For example, by focusing on latency and error rates, we do not fully cover all use cases for a service that is being migrated.

This issue was discovered almost immediately after moving some of the workloads to the cluster. It made itself especially acute when it came to checking functions, the number of requests to which is small, but which have very specific configuration dependencies. One of the key lessons from the migration was the need to take into account when monitoring not only metrics, but also logs and the β€œlong tail” (this is about such a distribution on the chart - approx. transl.) errors. Now for each migration we include a detailed list of requests to the logs (log queries) and we plan clear rollback procedures that can be transferred from one shift to another in case of problems.

Servicing the same requests in parallel on the old VM infrastructure and the new one based on Kubernetes was a unique challenge. Unlike lift-and-shift migration (fast transfer of applications β€œas is” to a new infrastructure; for more details, see e.g. here - approx. transl.), parallel work on "old" VMs and Kubernetes requires that monitoring tools be compatible with both environments and be able to combine metrics into one view. It is important that we use the same dashboards and log queries to achieve consistent observability during the transition period.

4. Switching traffic to the new cluster

For GitLab.com, some of the servers are allocated under canary stage. Canary Park serves our internal projects and can also enabled by users. But it's primarily designed to test changes to the infrastructure and application. The first migrated service started by accepting a limited amount of internal traffic, and we continue to use this method to ensure that the SLO is met before pushing all traffic to the cluster.

In the case of migration, this means that first requests to internal projects are sent to Kubernetes, and then we gradually switch the rest of the traffic to the cluster by changing the weight for the backend through HAProxy. During the transition from VM to Kubernetes, it became clear that it is very beneficial to have an easy way to redirect traffic between the old and new infrastructure and, accordingly, keep the old infrastructure ready for rollback in the first few days after the migration.

5. Spare power of pods and their use

Almost immediately, the following problem was identified: pods for the Registry service started quickly, but the launch of pods for Sidekiq took up to two minutes. Sidekiq's long running pods became a problem when we started migrating workloads to Kubernetes for workers who need to process jobs quickly and scale quickly.

In this case, the lesson was that while the Horizontal Pod Autoscaler (HPA) in Kubernetes handles traffic growth well, it's important to take into account workload characteristics and allocate spare pod capacity (especially when demand is uneven). In our case, there was a sudden burst of jobs, resulting in rapid scaling, which led to saturation of CPU resources before we could scale the node pool.

It's always tempting to squeeze as much out of a cluster as possible, however we've run into performance issues initially and now start with a generous pod budget and scale it down later, keeping a close eye on SLO. Launching pods for the Sidekiq service has been significantly accelerated and now takes about 40 seconds on average. From reducing pod startup time won both GitLab.com and our users of self-managed installations running the official GitLab Helm Chart.

Conclusion

After migrating each service, we rejoiced at the benefits of using Kubernetes in production: faster and safer application deployment, scaling, and more efficient resource allocation. Moreover, the advantages of migration go beyond the GitLab.com service. With every improvement on the official Helm Chart, its users also benefit.

I hope you enjoyed the story of our Kubernetes migration adventure. We continue to migrate all new services to the cluster. Additional information can be found in the following publications:

PS from translator

Read also on our blog:

Source: habr.com

Add a comment