Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises

Note. transl.: Dailymotion is one of the largest video hosting services in the world and therefore a notable user of Kubernetes. In this post, system architect David Donchez shares the results of building the company's production platform based on K8s, which started with a cloud installation in GKE and ended as a hybrid solution, which allowed for better response times and savings on infrastructure costs.

Deciding to Rebuild the Core API Dailymotion three years ago, we wanted to develop a more efficient way to host applications and make it easier processes in development and production. For this purpose, we decided to use a container orchestration platform and naturally chose Kubernetes.

Why build your own platform based on Kubernetes?

Production-grade API in no time with Google Cloud

Summer 2016st

Three years ago, right after the purchase of Dailymotion by the company Vivendi, our engineering teams focused on one global goal: to create a brand new Dailymotion product.

Based on our analysis of containers, orchestration solutions, and our past experience, we are convinced that Kubernetes is the right choice. Some of the developers already had an understanding of the basic concepts and knew how to use it, which was a huge advantage for infrastructure transformation.

In terms of infrastructure, a powerful and flexible system was required to host new types of cloud-native applications. We chose to stay in the cloud at the start of our journey so we could build the most secure local platform in peace. We decided to deploy our applications using the Google Kubernetes Engine, although we knew that sooner or later we would move to our own data centers and apply a hybrid strategy.

Why choose GKE?

We made this choice mainly for technical reasons. In addition, it was necessary to quickly provide infrastructure that meets the needs of the company's business. We had some application hosting requirements, such as geographic distribution, scalability, and fault tolerance.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises
GKE clusters in Dailymotion

Since Dailymotion is a video platform available worldwide, we really wanted to improve the quality of the service by reducing the waiting time (latency). Earlier our API was only available in Paris, which was sub-optimal. I wanted to be able to host applications not only in Europe, but also in Asia and the USA.

This latency sensitivity meant that a lot of work had to be done on the network architecture of the platform. While most cloud services were forced to create their own network in each region and then connect them through a VPN or some kind of managed service, Google Cloud allowed you to create a fully routable single network covering all Google regions. This is a big plus in terms of operation and system efficiency.

In addition, network services and load balancers from Google Cloud do an excellent job. They simply allow arbitrary public IP addresses from each region, and the wonderful BGP protocol takes care of the rest (i.e. redirects users to the nearest cluster). Obviously, in the event of a failure, traffic will automatically go to another region without any human intervention.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises
Google load balancing monitoring

Our platform also makes heavy use of GPUs. Google Cloud allows you to use them very effectively right in Kubernetes clusters.

At the time, the infrastructure team was mostly focused on the old stack deployed on physical servers. That is why using a managed service (including Kubernetes master components) met our requirements and allowed us to train teams to work with local clusters.

As a result, we were able to start accepting production traffic on the Google Cloud infrastructure in just 6 months from the start.

However, despite a number of advantages, working with a cloud provider comes with certain costs, which may increase depending on the load. That is why we carefully analyzed each managed service used, hoping to implement them on-premises in the future. In fact, the introduction of local clusters began at the end of 2016 and the hybrid strategy was initiated at the same time.

Launch of Dailymotion local container orchestration platform

Autumn '2016

In conditions when the entire stack was ready for production, and work on the API went on, it was time to focus on regional clusters.

At that time, users watched more than 3 billion videos every month. Of course, we have been operating our own branched Content Delivery Network for more than a year. We wanted to take advantage of this circumstance and deploy Kubernetes clusters in existing data centers.

The Dailymotion infrastructure consisted of more than 2,5 thousand servers in six data centers. All of them are configured using Saltstack. We started to prepare all the necessary recipes for creating master and worker nodes, as well as an etcd cluster.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises

Network part

Our network is fully routable. Each server advertises its IP on the network using Exabgp. We compared several network plug-ins and the only one that satisfies all needs (due to the approach used at the L3 level) was Calico. It fits perfectly into the existing network infrastructure model.

Since we wanted to use all of the existing infrastructure, the first thing we had to do was figure out our homebrew network utility (used on all servers): use it to advertise IP address ranges on a network with Kubernetes nodes. We allowed Calico to assign IP addresses to pods, but did not use it and still do not use it for BGP sessions on network equipment. In fact, routing is handled by Exabgp, which announces the subnets used by Calico. This allows us to reach any pod from the internal network (and load balancers in particular).

How we manage ingress traffic

To redirect incoming requests to the desired service, it was decided to use the Ingress Controller due to its integration with Kubernetes ingress resources.

Three years ago, nginx-ingress-controller was the most mature controller: Nginx has been in use for a long time and was known for its stability and performance.

In our system, we decided to place the controllers on dedicated 10-gigabit blade servers. Each controller connected to the endpoint of the kube-apiserver of the corresponding cluster. These servers also used Exabgp to advertise public or private IP addresses. Our network topology allows us to use BGP from these controllers to route all traffic directly to the pods without using a service like NodePort. This approach helps to avoid horizontal traffic between nodes and improves efficiency.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises
Movement of traffic from the Internet to pods

Now that we have dealt with our hybrid platform, we can delve into the process of traffic migration itself.

Migrating traffic from Google Cloud to Dailymotion infrastructure

Autumn '2018

After nearly two years of building, testing, and configuring, we finally have a full Kubernetes stack ready to take some of the traffic.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises

The current routing strategy is quite simple, but it suits the needs well. In addition to public IPs (on Google Cloud and Dailymotion), AWS Route 53 is used to set policies and redirect users to the cluster of our choice.

Dailymotion Kubernetes adventure: building infrastructure in the clouds + on-premises
Routing policy example using Route 53

With Google Cloud, this is easy, as we use the same IP for all clusters, and the user is redirected to the nearest GKE cluster. For our clusters, the technology is different, because their IPs are different.

During the migration, we aimed to redirect regional requests to the appropriate clusters and evaluated the benefits of this approach.

Because our GKE clusters are configured to autoscale using Custom Metrics, they scale up/down based on incoming traffic.

In normal mode, all regional traffic is directed to the local cluster, and GKE serves as a backup in case of problems (health checks are carried out by Route 53).

...

In the future, we want to fully automate routing policies to achieve a standalone hybrid strategy that continually improves user accessibility. As for the pros: cloud costs have been significantly reduced and even managed to reduce the response time of the API. We trust the resulting cloud platform and are ready to redirect more traffic to it if necessary.

PS from translator

You might also be interested in another recent Dailymotion post about Kubernetes. It is dedicated to deploying applications with Helm on multiple Kubernetes clusters and was published about a month ago.

Read also on our blog:

Source: habr.com

Add a comment