Speed ​​up Internet requests and sleep peacefully

Speed ​​up Internet requests and sleep peacefully

Netflix is ​​the market leader in Internet TV, the company that created and actively develops this segment. Netflix is ​​known not only for its vast catalog of movies and TV shows, accessible from almost every corner of the planet and any device with a display, but also for its reliable infrastructure and unique engineering culture.

A clear example of the Netflix approach to developing and maintaining complex systems at DevOops 2019 presented Sergey Fedorov Director of Development at Netflix. Graduate of the faculty of VMK UNN named after. Lobachevsky, Sergey one of the first engineers in Open Connect - CDN team at Netflix. He built systems for monitoring and analyzing video data, launched the popular service for assessing the speed of the Internet connection FAST.com, and for the past few years has been working on optimizing Internet requests so that the Netflix application works as quickly as possible for users.

The report received the best reviews from the conference participants, and we have prepared a text version for you.

In the report, Sergey spoke in detail

  • about what affects the delay in Internet requests between the client and the server;
  • how to reduce this delay;
  • how to design, maintain and monitor fault-tolerant systems;
  • how to achieve results in a short time, and with minimal risk for business;
  • how to analyze results and learn from mistakes.

Answers to these questions are needed not only by those who work in large corporations.

The presented principles and techniques should be known and practiced by everyone who develops and maintains Internet products.

What follows is a narration from the speaker's point of view.

The Importance of Internet Speed

The speed of Internet requests is directly related to the business. Consider shopping: Amazon in 2009 saidthat a 100ms delay results in a loss of 1% of sales.

There are more and more mobile devices, followed by mobile sites and applications. If your page takes longer than 3 seconds to load, you lose about half of your users. WITH July 2018 Google takes into account the speed of loading your page in search results: the faster the page, the higher its position in Google.

Connection speed is also important in financial institutions where latency is critical. In 2015 Hibernia Networks finished a $400 million cable between New York and London to reduce city-to-city latency by 6ms. Imagine $66M for 1ms of latency reduction!

According to Exploration, a connection speed above 5 Mbps ceases to directly affect the loading speed of a typical site. However, there is a linear relationship between connection latency and page load speed:

Speed ​​up Internet requests and sleep peacefully

However, Netflix is ​​not a typical product. The impact of latency and speed on the user is an active area of ​​analysis and development. There are app downloads and content selections that depend on latency, but static element downloads and streaming also depend on connection speed. Analyzing and optimizing the key factors that affect user experience is an active area of ​​development for several teams at Netflix. One of the goals is to reduce the latency of requests between Netflix devices and cloud infrastructure.

In the report, we will focus on reducing latency using the Netflix infrastructure as an example. Let's consider from a practical point of view how to approach the processes of design, development and operation of complex distributed systems and spend time on innovation and results, rather than diagnosing operational problems and breakdowns.

Inside Netflix

Thousands of different devices support Netflix apps. They are developed by four different teams that make separate versions of the client for Android, iOS, TV and web browsers. And we spend a lot of effort on improving and personalizing the user interface. To do this, we run hundreds of A/B tests in parallel.

Personalization is supported through hundreds of microservices in the AWS cloud, providing personalized user data, request dispatching, telemetry, Big Data, and Encoding. Traffic visualization looks like this:

Link to video with demonstration (6:04-6:23)

On the left is the entry point, and then the traffic is distributed among several hundred microservices, which are supported by different backend teams.

Another important component of our infrastructure is the Open Connect CDN, which delivers static content to the end user - videos, images, code for clients, etc. CDN is located on custom servers (OCA - Open Connect Appliance). Inside are arrays of SSDs and HDDs running an optimized FreeBSD, with NGINX and a set of services. We design and optimize hardware and software components in such a way that such a CDN server can send as much data as possible to users.

The "wall" of these servers at the Internet traffic exchange point (Internet eXchange - IX) looks like this:

Speed ​​up Internet requests and sleep peacefully

Internet Exchange provides an opportunity for ISPs and content providers to "connect" to each other for a more direct exchange of data on the Internet. There are approximately 70-80 Internet Exchange points around the world where our servers are installed, and we independently install and maintain them:

Speed ​​up Internet requests and sleep peacefully

In addition, we also provide servers directly to ISPs that they install in their network, improving the localization of Netflix traffic and the quality of streaming for users:

Speed ​​up Internet requests and sleep peacefully

A set of AWS services is responsible for dispatching video requests from clients to CDN servers, as well as configuring the servers themselves - updating content, code, settings, etc. For the latter, we also built a backbone network that connects servers at Internet Exchange points to AWS. The backbone network is a global network of fiber optic cables and routers that we can design and configure based on our needs.

On Sandvine estimates, our CDN infrastructure delivers approximately ⅛ of the world's Internet traffic during peak hours and ⅓ of the traffic in North America, where Netflix has been around the longest. Impressive numbers, but for me one of the most amazing achievements is that the entire CDN system is developed and maintained by a team of less than 150 people.

Initially, the CDN infrastructure was designed to deliver video data. However, over time, we realized that we can also use it to optimize dynamic requests from clients in the AWS cloud.

About speeding up the internet

Today, Netflix has AWS Region 3, and the latency of requests to the cloud will depend on how far the customer is from the nearest region. At the same time, we have many CDN servers that are used to deliver static content. Is it possible to somehow use this infrastructure to speed up dynamic queries? At the same time, unfortunately, it is impossible to cache these requests - the APIs are personalized and each result is unique.

Let's make a proxy on the CDN server and start sending traffic through it. Will it be faster?

Matrimony

Recall how network protocols work. Today, most of the traffic on the Internet uses HTTPs, which depends on the lower layer protocols TCP and TLS. In order for the client to connect to the server, it makes a handshake, and to establish a secure connection, the client needs to exchange messages with the server three times and at least once more to transfer data. With a single exchange delay (RTT) of 100ms, we need 400ms to get the first bit of data:

Speed ​​up Internet requests and sleep peacefully

If we place the certificates on the CDN server, then the handshake time between the client and the server can be significantly reduced if the CDN is closer. Let's assume that the delay to the CDN server is 30ms. Then it will take 220 ms to get the first bit:

Speed ​​up Internet requests and sleep peacefully

But the benefits don't end there. Once a connection has been established, TCP increases the congestion window (the amount of information it can transmit in parallel over that connection). If a data packet is lost, then the classical implementations of the TCP protocol (like TCP New Reno) reduce the open “window” by half. The growth of the congestion window, and the speed of its recovery from loss again depends on the delay (RTT) to the server. If this connection only goes to the CDN server, this recovery will be faster. At the same time, packet loss is a standard phenomenon, especially for wireless networks.

Internet bandwidth can be reduced, especially during peak hours due to traffic from users, which can lead to “traffic jams”. However, there is no way on the Internet to give priority to one request over another. For example, give priority to small and delay-sensitive requests in relation to the "heavy" data streams that load the network. However, in our case, having our own backbone network allows us to do this on part of the request path - between the CDN and the cloud, and we can fully configure it. You can make it so that small and delay-dependent packets are prioritized, and large data streams go a little later. The closer the CDN is to the client, the greater the efficiency.

The application layer protocols (OSI Level 7) also have an impact on latency. Newer protocols such as HTTP/2 allow you to optimize the performance of parallel requests. However, we have Netflix clients with older devices that don't support the newer protocols. Not all clients can be upgraded or optimally configured. At the same time, between the CDN proxy and the cloud, there is full control and the ability to use new, optimal protocols and settings. The inefficient part with the old protocols will only work between the client and the CDN server. Moreover, we can multiplex requests on an already established connection between the CDN and the cloud, improving connection utilization at the TCP level:

Speed ​​up Internet requests and sleep peacefully

We measure

Despite the fact that the theory promises improvements, we do not immediately rush to launch the system in production. Instead, we must first prove that the idea will work in practice. To do this, you need to answer several questions:

  • Speed: will the proxy be faster?
  • Reliability: will it break more often?
  • Complexity: how to integrate with applications?
  • Price: How much does it cost to deploy additional infrastructure?

Let us consider in detail our approach to the evaluation of the first point. The rest are dealt with in a similar way.

To analyze the speed of requests, we want to get data for all users, not spend a lot of time on development and not break production. There are several approaches for this:

  1. RUM, or Passive Request Measurement. We measure the execution time of current requests from users and provide full coverage of users. The disadvantage is that the signal is not very stable due to many factors, for example, due to different request sizes, processing times on the server and client. In addition, you cannot test a new configuration without an effect on production.
  2. Laboratory tests. Dedicated servers and infrastructure that mimic clients. With their help, we carry out the necessary tests. So we get full control over the results of measurements and a clear signal. But there is no full device coverage and user location (especially with worldwide service and support for thousands of device models).

How can the benefits of both methods be combined?

Our team has found a solution. We wrote a small piece of code - a test - which we built into our application. Probes allow us to do fully controlled network tests from our devices. It works like this:

  1. Shortly after the app loads and the initial activity completes, we start our trials.
  2. The client makes a request to the server and receives a "recipe" for the test. A recipe is a list of URLs to which an HTTP(s) request should be made. In addition, the recipe configures request parameters: delays between requests, amount of requested data, HTTP(s) headers, etc. At the same time, we can test several different recipes in parallel - when requesting a configuration, we randomly determine which recipe to issue.
  3. The probe start time is chosen so as not to conflict with the active use of network resources on the client. In fact, the time when the client is not active is selected.
  4. After receiving the recipe, the client makes requests to each of the URLs in parallel. A request for each of the addresses can be repeated - the so-called. "pulses". On the first pulse, we measure how long it took to establish a connection and upload data. On the second pulse, we measure the data download time over an already established connection. Before the third, we can put a delay and measure the speed of reconnection, etc.

    During the test, we measure all the parameters that the device can receive:

    • DNS query time;
    • TCP connection establishment time;
    • TLS connection establishment time;
    • time of receipt of the first byte of data;
    • total download time;
    • status result code.
  5. After the end of all pulses, the sample downloads the results of all measurements for analytics.

Speed ​​up Internet requests and sleep peacefully

The key points are minimal dependence on logic on the client, data processing on the server, and measurement of parallel requests. Thus, we get the opportunity to isolate and test the influence of various factors that affect query performance, vary them within a single recipe, and get results from real clients.

This infrastructure has proved useful for more than just query performance analysis. We currently have 14 recipes active, over 6000 samples per second, receiving data from all corners of the earth, and full device coverage. If Netflix were to buy a similar service from third parties, it would cost millions of dollars a year, with much worse coverage.

Testing theory in practice: a prototype

With such a system, we were able to evaluate the effectiveness of CDN proxies in terms of request delays. Now you need:

  • create a proxy prototype;
  • host a prototype on a CDN;
  • determine how to direct clients to a proxy on a specific CDN server;
  • compare performance with queries in AWS without a proxy.

The task is to evaluate the effectiveness of the proposed solution as quickly as possible. For the implementation of the prototype, we chose Go, due to the presence of good networking libraries. On each CDN server, we set the prototype proxy as a static binary to minimize dependencies and simplify integration. In the initial implementation, we made the most of standard components and minor modifications to HTTP/2 connection pooling and request multiplexing.

For balancing across AWS regions, we used the DNS geographic database, the same one used for client balancing. To select a CDN server for the client, we use TCP Anycast for servers in Internet Exchange (IX). In this option, we use one IP address for all CDN servers, while the client will be directed to the CDN server with the fewest IP hops. In CDN servers installed by Internet Service Providers (ISPs), we do not have control over the router to configure TCP Anycast, so we use the same logic, which directs customers to ISPs for video streaming.

So, we have three types of request paths: to the cloud via the open Internet, via a CDN server in IX, or via a CDN server hosted by an ISP. Our goal is to understand which way is better, and what is the use of a proxy, compared to how requests are routed to production. To do this, we use the sampling system as follows:

Speed ​​up Internet requests and sleep peacefully

Each of the paths becomes a separate target, and we look at the time that we got. For analysis, we combine proxy results into one group (we choose the best time between IX and ISP proxies), and compare with the time of requests to the cloud without a proxy:

Speed ​​up Internet requests and sleep peacefully

As you can see, the results turned out to be ambiguous - in most cases, the proxy gives a good acceleration, but there are also a sufficient number of clients for which the situation will worsen significantly.

As a result, we did several important things:

  1. We evaluated the expected performance of requests from clients to the cloud through a CDN proxy.
  2. We received data from real customers, from all types of devices.
  3. We realized that the theory was not 100% confirmed and the initial proposal with a CDN proxy would not work for us.
  4. We didn’t take risks - we didn’t change production configurations for clients.
  5. They didn't break anything.

Prototype 2.0

So, back to the drawing board and repeat the process again.

The idea is that instead of a 100% proxy, we will determine the fastest path for each client, and we will send requests there - that is, we will do what is called client steering.

Speed ​​up Internet requests and sleep peacefully

How to implement it? We cannot use logic on the server side, because the goal is to connect to this server. You need to somehow do it on the client. And ideally, to do this with a minimum amount of complex logic, so as not to solve the issue of integration with a vast number of client platforms.

The answer is using DNS. In our case, we have our own DNS infrastructure, and we can set up a domain zone for which our servers will be authoritarian. It works like this:

  1. The client makes a request to the DNS server using a host such as api.netflix.xom.
  2. The request goes to our DNS server
  3. The DNS server knows which path is the fastest for this client and issues the corresponding ip address.

There is an additional complication to the solution: Authoritative DNS providers do not see the client's IP address and can only read the IP address of the recursive resolver that the client is using.

As a result, our authoritarian resolver should make a decision not for an individual client, but for a group of clients based on a recursive resolver.

To solve, we use the same samples, aggregate the measurement results from clients for each of the recursive resolvers and decide where to send them to this group - proxy through IX using TCP Anycast, through an ISP proxy, or directly to the cloud.

We get this system:

Speed ​​up Internet requests and sleep peacefully

The resulting DNS steering model allows you to direct clients based on historical observations about the speed of connections from clients to the cloud.

Again, the question is how effective will this approach work? To answer, we again use our trial system. Therefore, we set up a referent configuration, where one of the targets follows the direction from DNS steering, the other goes directly to the cloud (current production).

Speed ​​up Internet requests and sleep peacefully

As a result, we compare the results and obtain an estimate of efficiency:

Speed ​​up Internet requests and sleep peacefully

As a result, we learned a few important things:

  1. Evaluated the expected performance of queries from clients to the cloud using DNS Steering.
  2. We received data from real customers, from all types of devices.
  3. Proved the effectiveness of the proposed idea.
  4. We didn’t take risks - we didn’t change production configurations for clients.
  5. They didn't break anything.

Now about the difficult - we launch in production

The easiest part is now over - there is a working prototype. Now the hard part is launching the solution for all Netflix traffic, deploying it to 150M users, thousands of devices, hundreds of microservices, and a constantly changing product and infrastructure. Netflix servers receive millions of requests per second, and it is easy to break the service with a careless action. At the same time, we want to dynamically route traffic through thousands of CDN servers, on the Internet, where something changes and breaks constantly and at the most inopportune moment.

And with all this, the team has 3 engineers responsible for the development, deployment and full support of the system.

Therefore, we will continue to talk about a calm and healthy sleep.

How to continue development, and not spend all the time on support? Our approach is based on 3 principles:

  1. Reduce the potential scale of damage (blast radius).
  2. Preparing for surprises - we expect something to break, despite testing and personal experience.
  3. Gradual degradation (graceful degradation) - if something goes wrong, it should be repaired automatically, albeit not in the most efficient way.

It turned out that in our case, with this approach to the problem, you can find a simple and effective solution and greatly simplify system support. We realized that we could add a small piece of code to the client and monitor for network request errors caused by connection problems. In case of network errors, we do a fallback directly to the cloud. This solution does not require significant efforts for client teams, but greatly reduces the risk of unexpected breakdowns and surprises for us.

Of course, despite the fallback, we nevertheless follow a clear discipline during development:

  1. Sample test.
  2. A/B testing or Canaries.
  3. Gradual release (progressive rollout).

With trials, the approach has been described - changes are first tested using a customized recipe.

For canary testing, we need to get comparable pairs of servers on which we can compare how the system works before and after the changes. To do this, from our numerous CDN sites, we make a selection of pairs of servers that receive comparable traffic:

Speed ​​up Internet requests and sleep peacefully

Then we put the assembly with the changes on the Canary server. To evaluate the results, we run a system that compares approximately 100-150 metrics with a selection of Control servers:

Speed ​​up Internet requests and sleep peacefully

If the Canary testing was successful, then we make the release gradually, in waves. On each of the sites, we do not update the servers at the same time - the loss of an entire site in case of problems has a more significant impact on the service for users than the loss of the same number of servers, but in different places.

In general, the effectiveness and safety of this approach depends on the quantity and quality of the collected metrics. For our query acceleration system, we collect metrics from all possible components:

  • from clients — the number of sessions and requests, fallback rates;
  • proxy - statistics on the number and time of requests;
  • DNS - the number and results of requests;
  • cloud edge - the number and time to process requests in the cloud.

All this is collected in a single pipeline, and, depending on the needs, we decide which metrics to send to real-time analytics, and which ones to Elasticsearch or Big Data for more detailed diagnostics.

Monitored

Speed ​​up Internet requests and sleep peacefully

In our case, we are making changes to the critical request path between client and server. At the same time, the number of different components on the client, on the server, and on the way through the Internet is huge. Changes on the client and server occur constantly - in the course of the work of dozens of teams and natural changes in the ecosystem. We are in the middle - when diagnosing problems, there is a great chance that we will participate in this. Therefore, we need to clearly understand how to define, collect and analyze metrics to quickly isolate problems.

Ideal - full access to all types of metrics and filters in real time. But there are a lot of metrics, so the question of cost arises. In our case, we will separate the metrics and development tools as follows:

Speed ​​up Internet requests and sleep peacefully

We use our own real-time open source system to detect and triage problems Atlas и Lumen - for visualization. It stores aggregated metrics in memory, is reliable, and integrates with the alerting system. For localization and diagnostics, we have access to logs from Elasticsearch and Kibana. For statistical analysis and modeling, we use big data and visualization in Tableau.

It seems that this approach is very difficult to work with. However, with the hierarchical organization of metrics and tools, we can quickly analyze the problem, determine the type of problem, and then delve into detailed metrics. To identify the source of the breakdown, we generally spend about 1-2 minutes. After that, we are already working with a specific team on diagnostics - from tens of minutes to several hours.

Even if the diagnosis is done quickly, we don't want it to happen often. Ideally, we will only receive a critical alert when there is a significant impact on the service. For our request acceleration system, we have only 2 alerts that will notify:

  • percentage of Client Fallback - assessment of customer behavior;
  • Probe errors percentage — network component stability data.

These critical alerts keep track of whether the system works for most users. We're looking at how many clients took advantage of the fallback if they weren't able to get faster requests. We average less than 1 critical alert per week even though there is a huge amount of change going on in the system. Why is this enough for us?

  1. There is a client fallback in case our proxy does not work.
  2. There is an automatic steering system that responds to problems.

About the latter in more detail. Our probing system, and the system for automatically determining the optimal path for requests from the client to the cloud, allows us to automatically deal with some problems.

Let's go back to our probe configuration and 3 path categories. In addition to the loading time, we can look at the fact of delivery itself. If it was not possible to load the data, then, looking at the results in different paths, we can determine where and what is broken, and whether we can automatically fix it by changing the request path.

examples:

Speed ​​up Internet requests and sleep peacefully

Speed ​​up Internet requests and sleep peacefully

Speed ​​up Internet requests and sleep peacefully

This process can be automated. Include it in the steering system. And teach it to respond to performance and reliability issues. If something starts to break, react if there is a better option. At the same time, the instant reaction is not critical, thanks to fallback on clients.

Thus, the principles of system support can be formulated as follows:

  • reduce the scale of breakdowns;
  • collect metrics;
  • Automatically repair breakdowns if we can;
  • if not, we will notify;
  • working on dashboards and triage toolset for quick response.

Lessons learned

It doesn't take long to write a prototype. In our case, it was ready in 4 months. With it, we received new metrics, and after 10 months from the start of development, we received the first production traffic. Then the tedious and very difficult work began: gradually productize and scale the system, migrate the main traffic and learn from mistakes. However, this effective process will not be linear - despite all efforts, it is impossible to predict everything. Much more efficient - fast iteration and response to new data.

Speed ​​up Internet requests and sleep peacefully

Based on our experience, we can advise the following:

  1. Don't trust your intuition.

    Our intuition let us down constantly, despite the vast experience of the team members. For example, we incorrectly predicted the expected speedup from using a CDN proxy, or the behavior of TCP Anycast.

  2. Get data from production.

    It is important to get access to at least a small amount of production data as quickly as possible. The number of unique cases, configurations, settings in the laboratory is almost impossible to obtain. Quick access to the results will allow you to quickly learn about potential problems, and take them into account in the system architecture.

  3. Do not follow other people's advice and results - collect your data.

    Follow the principles of data collection and analysis, but do not blindly accept other people's results and statements. Only you can know exactly what works for your users. Your systems and your customers may differ significantly from other companies. Fortunately, analysis tools are now available and easy to use. Your results may not match what Netflix, Facebook, Akamai, and other companies claim. In our case, the performance of TLS, HTTP2 or DNS query statistics is different from the results of Facebook, Uber, Akamai - because we have different devices, clients and data flows.

  4. Do not follow fashion trends without the need and evaluation of effectiveness.

    Start simple. It is better to make a simple working system in a short time than to spend a huge amount of time developing components you do not need. Solve tasks and problems that matter based on your measurements and results.

  5. Get ready for new applications.

    Just as it is difficult to predict all the problems, it is difficult to predict the benefits and applications in advance. Take a cue from startups - their ability to adapt to customer conditions. In your case, you can discover new problems and their solutions. In our project, we set a goal to reduce the latency of requests. However, during the analysis and discussions, we realized that we can also use proxy servers:

    • to balance traffic across AWS regions and reduce costs;
    • for modeling CDN stability;
    • to configure DNS;
    • to configure TLS/TCP.

Conclusion

In the report, I described how Netflix solves the problem of speeding up Internet requests between clients and the cloud. How we collect data through a customer trial system, and use the collected historical data to route production requests from customers through the fastest path on the internet. How do we use the principles of network protocols, our CDN infrastructure, backbone network, and DNS servers to achieve this goal.

However, our solution is just an example of how we implemented such a system at Netflix. What worked for us. The applied part of my report for you is the principles of development and support that we follow and achieve good results.

Our solution may not work for you. However, the theory and development principles remain even if you do not have your own CDN infrastructure, or if it differs significantly from ours.

Also, the importance of the speed of business requests remains. And even for a simple service, you need to make a choice: between cloud providers, server locations, CDN and DNS providers. Your choice will affect the effectiveness of Internet queries for your customers. And it is important for you to measure and understand this influence.

Start with simple solutions, take care how you change the product. Learn by doing and improve the system based on data from your customers, your infrastructure, and your business. Think about the possibility of unexpected breakdowns during the design process. And then you can speed up your development process, improve the efficiency of the solution, avoid unnecessary support burden and sleep peacefully.

This year the conference will be held from 6 to 10 July in online format. It will be possible to ask questions to one of the fathers of DevOps, John Willis himself!

Source: habr.com

Add a comment