SRE (Site Reliability Engineering) is an approach to making web projects accessible. It is considered a framework for DevOps and tells how to succeed in the application of DevOps practices. This article translates Chapters 4 Service Level Objectives books Site Reliability Engineering from Google. I prepared this translation myself and relied on my own experience in understanding monitoring processes. In the telegram channel monitorim_it и last post on Habré I also published a translation of Chapter 6 of the same book on Service Level Objectives.

Translation by cat. Enjoy reading!

It is impossible to manage a service if there is no understanding of what indicators really matter and how to measure and evaluate them. To this end, we define and provide a certain level of service to our users, whether they use one of our internal APIs or a public product.

We use our intuition, experience and understand the desire of users to have an understanding of Service Level Indicators (SLI), Service Level Objectives (SLO) and Service Level Agreement (SLA). These measurements describe the main metrics that we want to monitor and that we will respond to if we cannot provide the expected quality of service. Ultimately, choosing the right metrics helps drive the right action if something goes wrong, and gives the SRE team confidence in the health of the service.

This chapter describes the approach we use to deal with the problems of metric modeling, metric selection, and metric analysis. Most of the explanation will be without examples, so we will use the Shakespeare service described in its example implementation (Shakespeare search) to illustrate the main points.

Service level terminology

Many readers are likely to be familiar with the concept of SLA, but the terms SLI and SLO deserve careful definition because, in general, the term SLA is overloaded and has a range of meanings depending on the context. For clarity, we want to separate these values.

Indicators

SLI is a service level indicator, a carefully defined quantitative measure of one aspect of the level of service being delivered.

For most services, the key SLI is request latency - how long it takes to return a response to a request. Other common SLIs include error rates, often expressed as a fraction of all requests received, and system throughput, usually measured in requests per second. Measurements are often aggregated: raw data is first collected and then converted to a rate of change, mean, or percentile.

Ideally, SLI directly measures the service level of interest, but sometimes only the associated metric is available for measurement because the initial one is difficult to obtain or interpret. For example, client-side latency is often a more appropriate metric, but sometimes latency can only be measured on the server.

Another kind of SLI that is important to SRE is availability, or the portion of the time a service can be used. Often defined as the percentage of successful requests, sometimes referred to as throughput. (Longevity—the likelihood that data will be retained for a long period of time—is also important for storage systems.) While 100% availability is not possible, close to 100% availability is often achievable, availability values are expressed as a number of nines. » Availability percentage. For example, availability of 99% and 99,999% could be labeled as "2 nines" and "5 nines". The current stated goal for Google Compute Engine availability is "three and a half nines" or 99,95%.

Goals

SLO is a Service Level Goal: The target value or range of values for a Service Level that is measured by SLI. The normal value for SLO is "SLI ≤ target" or "lower bound ≤ SLI ≤ upper bound". For example, we may decide that we will return Shakespeare search results "quickly" by assuming an SLO of an average search query latency of less than 100 milliseconds.

Choosing the right SLO is a complex process. First, you can't always choose a specific value. For external incoming HTTP requests to your service, the requests per second (QPS) metric is primarily determined by the desire of your users to visit your service, and you cannot set an SLO for that.

On the other hand, you can say that you want the average delay per request to be less than 100 milliseconds. Setting such a goal can force you to write your front-end with low latency or buy hardware that provides such a delay. (100 milliseconds is obviously an arbitrary value, but it's better to have even lower latency values. There is reason to believe that high speed is better than slow, and that delaying user requests above certain values actually forces people to stay away from your service.)

Again, this is more ambiguous than it might seem at first glance: you should not completely throw QPS out of the calculation. The fact is that QPS and latency are strongly related to each other: higher QPS often leads to higher latency, and usually services experience a sharp decrease in performance when a certain load threshold is reached.

Selecting and publishing an SLO sets user expectations for how the service will perform. This strategy can reduce unwarranted complaints about the service owner, such as slow service performance. Without an explicit SLO, users often create their own expectations about the desired performance, which may not be related to the opinions of the people designing and managing the service. This can lead to high expectations from the service, where users mistakenly believe that the service will be more accessible than it really is, and distrust when users believe that the system is less reliable than it really is.

Agreement

An SLA is an explicit or implicit contract with your users that includes the consequences of the occurrence (or absence) of the SLOs contained within. Consequences are most easily recognized when they are financial—a discount or a fine—but they can take other forms. An easy way to talk about the difference between SLOs and SLAs is to ask "what happens if SLOs are not met?". If there are no obvious consequences, you are almost certainly looking at SLO.

The SRE is not typically involved in creating SLAs because SLAs are closely related to business and product solutions. The SRE, however, is involved in helping to prevent the consequences of failed SLOs. They can also help determine the SLI: obviously there must be an objective way to measure the SLO in the agreement or there will be disagreement.

Google Search is an example of an important service that does not have an SLA for the public: we want everyone to use Search as efficiently as possible, but we have not signed a contract with the whole world. However, there are still consequences if search is unavailable - unavailability leads to a drop in our reputation, as well as a decrease in advertising revenue. Many other Google services, such as Google for Work, have explicit SLAs with users. Whether or not a particular service has an SLA, it is important to define SLIs and SLOs and use them to manage the service.

So much theory - now to experience.

Indicators in practice

Given that we've learned that it's important to choose the right metrics to measure service levels, how do you now know which metrics matter to the service or system?

What do you and your users care about?

You don't have to use every metric as an SLI that you can track in a monitoring system; understanding what users want from the system will help to select several indicators. Choosing too many indicators makes it difficult to properly focus on important indicators, while choosing too few can leave large chunks of your system unattended. We usually use several key indicators to evaluate and understand the state of the system.

Services, as a rule, can be broken down into several parts in terms of SLI, which are relevant to them:

Custom front-end systems, such as the Shakespeare service search interfaces from our example. They must be available, have no delays, and have sufficient bandwidth. Accordingly, questions can be asked: can we answer the request? How long did it take to answer the request? How many requests can be processed?
Storage systems. Low response latency, availability and durability are important for them. Related questions: How long does it take to read or write data? Can we access the data on demand? Is data available when we need it? See Chapter 26 Data Integrity: What You Read Is What You Record for a detailed discussion of these issues.
Big data systems, such as data processing pipelines, rely on throughput and latency to process a request. Related questions: How much data is being processed? How long does it take for data to move from receiving a request to issuing a response? (Some parts of the system may also have delays in certain steps.)

Collection of indicators

Many service level indicators are most naturally collected on the server side using a monitoring system such as Borgmon (see below). Chapter 10 Alerting Practices Based on Time Series Data) or Prometheus, or simply by periodically parsing the logs for HTTP responses with a 500 status. However, some systems should be provided with client-side metrics collection, since the lack of client-side monitoring can lead to missing a number of problems that affect users, but do not affect server-side metrics. For example, focusing on latency in the response of the backend of our Shakespeare search test application can result in a delay in user-side request processing due to JavaScript issues: in this case, measuring how long it takes to render a page in the browser is a better indicator.

Aggregation

For simplicity and ease of use, we often aggregate raw measurements. This must be done carefully.

Some metrics seem simple, like requests per second, but even this apparently direct measurement implicitly aggregates data over time. Is the measurement received specifically once per second, or is this measurement averaged over the number of requests per minute? The latter option can hide a much higher instantaneous number of requests that only persist for a few seconds. Consider a system that serves 200 even-numbered requests per second and 0 the rest of the time. A constant in the form of an average value of 100 requests per second and twice the instantaneous load are not the same thing at all. Similarly, query latency averaging may seem attractive, but hides an important detail: it is possible that most queries will be fast, but there will be many queries among them that will be slow.

Most measures are best viewed as distributions rather than averages. For example, for SLI latency, some requests will be processed quickly, while some will always take longer, sometimes much longer. A simple average can hide these long delays. The figure shows an example: although a typical request is served in approximately 50ms, 5% of requests are 20 times slower! Monitoring and alerts based only on average latency show no change in behavior over the course of a day, when in fact there are noticeable changes in the processing time for some requests (topmost row).

50th, 85th, 95th, and 99th percentiles of system latency. The y-axis is shown in logarithmic format.

Using percentiles for indicators allows you to see the shape of the distribution and its characteristics: a high percentile level, such as 99 or 99,9, shows the worst value, while the 50th percentile (also known as the median) shows the most frequent state of the metric. The greater the response time variance, the more long-running requests affect the user experience. The effect is enhanced at high load in the presence of queues. User experience studies have shown that people generally prefer a slower system with high response time variance, so some SRE teams only focus on high percentiles, on the assumption that if the behavior of the metric at the 99,9th percentile is good, most users will not experience problems.

Note on statistical errors

We usually prefer to work with percentiles rather than the mean (arithmetic mean) set of values. This allows more dispersed values to be considered, which often have significantly different (and more interesting) characteristics than the mean. Due to the artificial nature of computing systems, metric values are often distorted, for example, no request can receive a response in less than 0 ms, and a timeout of 1000 ms means that successful responses with values greater than the timeout cannot be. As a result, we cannot accept that the mean and median can be the same or close to each other!

Without prior verification, and unless some standard assumptions and approximations hold, we try not to conclude that our data is normally distributed. If the distribution is not as expected, the automation process that fixes the problem (for example, when it sees outliers, it restarts the server with high request processing latencies) may do it too often or not often enough (both options are not very good).

Standardize indicators

We recommend that you standardize the common characteristics for SLI so that you don't have to talk about them every time. Any function that satisfies the standard patterns can be excluded from the specific SLI specification, for example:

Aggregation intervals: "averaged over 1 minute"
Aggregation areas: "All tasks in the cluster"
How often measurements are taken: "Every 10 seconds"
What requests are included: "HTTP GET from black box monitoring jobs"
How the data is obtained: "Thanks to our monitoring measured on the server",
Data access latency: "Time to last byte"

To save effort, create a set of reusable SLI templates for each common metric; they also make it easier for everyone to understand what a particular SLI means.

Goals in practice

Start by thinking (or learning!) what your users care about, not what you can measure. Often what your users care about is difficult or impossible to measure, so you end up getting closer to their needs. However, if you just start with what is easy to measure, you end up with less useful SLOs. As a result, we have sometimes found that initially defining desired goals and then working with specific indicators works better than choosing indicators and then achieving the goals.

Define goals

For maximum clarity, it should be defined how SLOs are measured and the conditions under which they are valid. For example, we can say the following (the second line is the same as the first, but uses the SLI defaults):

99% (averaged over 1 minute) of Get RPC calls will complete in less than 100ms (measured on all backend servers).
99% of Get RPC calls will complete in less than 100ms.

If the shape of performance curves is important, you can specify multiple SLO targets:

90% of Get RPC calls completed in less than 1 ms.
99% of Get RPC calls completed in less than 10 ms.
99.9% of Get RPC calls completed in less than 100 ms.

If your users generate heterogeneous workloads: bulk processing (for which bandwidth is important) and interactive processing (for which latency is important), it may be appropriate to define separate targets for each load class:

For 95% of customer requests, bandwidth is important. Set the count of RPC calls in progress to <1 sec.
For 99% of clients, the amount of delay is important. Set the count of RPC calls with traffic < 1 kB and running < 10 ms.

It is unrealistic and undesirable to insist that SLOs will be implemented 100% of the time: this can slow down the rate of introduction of new functionality and deployment, and require expensive solutions. Instead, it's better to allow an error budget - a fraction of the allowed system downtime - and monitor this value on a daily or weekly basis. Perhaps senior management will want a monthly or quarterly assessment. (The error budget is just an SLO to compare to another SLO).

The SLO violation rate can be compared to the error budget (see Chapter 3 and section "Motivation for Bugs Budgets"), with the difference value used as input to a process that decides when to roll out new releases.

Selecting Planned Values

The selection of target values (SLOs) is not a purely technical activity due to product and business interests that must be reflected in the selected SLIs, SLOs (and possibly SLAs). Similarly, information may need to be exchanged on issues related to staffing, time to market, equipment availability and funding. The SRE should be part of this conversation and help sort out the risks and viability of the various options. We have jotted down a few questions that might help ensure a more productive discussion:

Don't choose a target based on current performance
While understanding the merits and limits of a system is important, adapting metrics without reasoning can block you from maintaining the system: heroic efforts will be required to achieve goals that cannot be achieved without significant reorganization.

Keep it simple
Complex SLI calculations can hide changes in system performance and make it harder to find the cause of a problem.

Avoid the absolute
While it is tempting to have a system that can handle an infinitely growing load without increasing latency, this requirement is unrealistic. A system that lives up to these ideals is likely to cost a lot of time to design and build, cost a lot to operate, and be too good for the expectations of users who would do with less.

Use as few SLOs as possible
Choose enough SLOs to provide good coverage of system attributes. Protect the SLOs you've chosen: If you can never win a priority argument over a particular SLO, it's probably not worth considering that SLO. However, not all attributes of a system lend themselves to SLO: it is difficult to quantify user satisfaction with SLO.

Don't chase perfection
You can always refine the definitions and goals of SLOs over time as you learn more about system behavior under load. It's better to start with a floating target that you'll refine over time than to pick a target that's too strict and should be loosened when you find it's not achievable.

SLOs can and should be a major driver in prioritizing work for SREs and product developers as they reflect a concern for users. A good SLO is a useful tool for forcing a development team. But a poorly designed SLO can lead to wasteful work if the team makes a heroic effort to achieve an overly aggressive SLO or a bad product if the SLO is too low. SLO is a powerful lever, use it wisely.

Control measurements

SLI and SLO are the key elements used to manage systems:

Monitor and measure SLI systems.
Compare SLI with SLO and decide if action is needed.
If action is required, figure out what needs to happen to reach the goal.
Take this action.

For example, if step 2 shows that the request timeout is increasing and will break the SLO in a few hours if nothing is done, step 3 might include testing the hypothesis that the load on the servers is CPU bound and adding more servers will spread the load . Without the SLO, you wouldn't know if (or when) to take action.

Installed SLO - followed by user expectations
The publication of the SLO sets the user's expectations of the behavior of the system. Users (and potential users) often want to know what to expect from a service in order to understand if it's the right fit for their use. For example, people who want to use a photo-sharing website may want to avoid using a service that promises durability and low cost in exchange for somewhat less availability, even though the same service may be ideal for a records management system.

To set realistic expectations for your users, use one or both of the following tactics:

Maintain a margin of safety. Use a tighter internal SLO than the one advertised to users. This will give you the opportunity to respond to problems before they become visible from the outside. The SLO buffer also allows you to have a margin of safety when installing releases that affect system performance and ensure that the system is easy to maintain without having to frustrate users with downtime.
Don't exceed user expectations. Users are based on what you offer, not what you say. If the actual performance of your service is much better than the claimed SLO, users will rely on the current performance. You can avoid overdependence by deliberately shutting down the system or limiting performance under light loads.

Understanding how well a system is meeting expectations helps decide whether to invest in speeding up the system and make it more accessible and more resilient. Alternatively, if the service is performing too well, some staff time should be spent on other priorities, such as paying off technical debt, adding new features, or bringing new products into production.

Agreements in practice

Creating an SLA requires business and legal teams to determine the consequences and penalties for violating it. The role of the SRE is to help them understand the likely difficulties in meeting the SLOs contained in the SLAs. Most of the recommendations for creating SLOs also apply to SLAs. It's wise to be conservative in what you promise users, as the more there are, the more difficult it is to change or remove SLAs that seem unreasonable or difficult to meet.

Thank you for reading the translation to the end. Subscribe to my telegram channel about monitoring monitorim_it и blog on Medium.

Source: habr.com

Service Level Objectives - Google Experience (Google SRE book chapter translation)