Distributed Systems Monitoring - Google Experience (translation of the chapter of the Google SRE book)

Distributed Systems Monitoring - Google Experience (translation of the chapter of the Google SRE book)

SRE (Site Reliability Engineering) is an approach to making web projects accessible. It is considered a framework for DevOps and tells how to succeed in the application of DevOps practices. This article translates Chapters 6 Monitoring Distributed Systems books Site Reliability Engineering from Google. I prepared this translation myself and relied on my own experience in understanding monitoring processes. In the telegram channel @monitorim_it и blog on Medium I also posted a link to a translation of Chapter 4 of the same book on Service Level Objectives.

Translation by cat. Enjoy reading!

The Google SRE teams have basic principles and best practices for building successful monitoring and notification systems. This chapter provides recommendations on what problems a web page visitor may encounter and how to solve problems that make it difficult to display web pages.

define

There is no single vocabulary used to discuss topics related to monitoring. Even on Google, the terms below are not in common use, but we will list the most common interpretations.

Monitoring

Collection, processing, aggregation and display of real-time quantitative data about the system: number of requests and types of requests, number of errors and types of errors, request processing time and server uptime.

White box monitoring

Monitoring based on metrics displayed by system internals, including logs, JVM or HTTP handler profiling metrics that generate internal statistics.

Black box monitoring

Testing the behavior of the application from the user's point of view.

Dashboard (dashboards)

An interface (usually a web interface) that provides an overview of the key health indicators of the services. The dashboard can have filters, the ability to select which metrics to display, and so on. The interface is designed to identify the most important metrics for users. The dashboard can also display information for technical support staff: a request queue, a list of high-priority errors, an assigned engineer for a given area of ​​responsibility.

Alert (notification)

Notifications intended to be received by a person by e-mail or otherwise, which may be triggered as a result of errors or an increase in the request queue. Notifications are categorized as: tickets, email alerts and messenger messages.

Root cause (root cause)

A software defect or human error that, when corrected, should not occur again. The problem can have several main reasons: insufficient process automation, software defect, insufficient study of the application logic. Each of these factors can be the root cause, and each of them must be eliminated.

Node and machine (node ​​and machine)

Interchangeable terms to refer to a single instance of a running application on a physical server, virtual machine, or container. There can be several services on one machine. Services can be:

  • related to each other: for example, a cache server and a web server;
  • unrelated services on the same hardware: for example, a code repository and a wizard for a configuration system, such as Puppet or Executive.

Push

Any change to the software configuration.

Why monitoring is needed

There are several reasons why applications should be monitored:

Analysis of long-term trends

How big is the database and how fast is it growing? How does the daily number of users change?

Performance Comparison

Are queries faster on Acme Bucket of Bytes 2.72 than Ajax DB 3.14? How much better are requests cached after the appearance of an additional node? Has the site become slower than last week?

Alerting (notifications)

Something is broken and someone has to fix it. Or something will break soon and someone has to check it soon.

Creating dashboards

Dashboards should answer basic questions and include something from "4 golden signals" - delays (latency), traffic (traffic), errors (errors) and load value (saturation).

Conducting retrospective analysis (debugging)

Request processing latency increased, what else happened around the same time?
Monitoring systems are useful as a data source for business intelligence systems and to facilitate the analysis of security incidents. Since this book focuses on engineering areas in which SREs have expertise, we will not discuss monitoring techniques here.

Monitoring and alerts allow the system to tell when it has broken or is about to break. When a system can't automatically repair itself, we want a human to analyze the alert, determine if the problem is still present, fix it, and determine its root cause. If you don't audit system components, you'll never get an alert just because "something seems a little odd."

Loading human alerts is a rather expensive use of an employee's time. If the employee is working, the alert interrupts the workflow. If the employee is at home, the alert interrupts personal time and possibly sleep. When alerts occur too frequently, employees skim, delay, or ignore incoming alerts. From time to time they ignore the real alert, which is masked by noise events. Service interruptions can last for a long time as noise events prevent rapid problem diagnosis and resolution. Effective public address systems have a good signal-to-noise ratio.

Determining reasonable expectations from the monitoring system

Setting up monitoring for a complex application is a complex engineering task in itself. Even with a significant infrastructure of collection, display, and alerting tools, a Google SRE team of 10-12 members typically includes one or two people whose main purpose is to build and maintain monitoring systems. This number has decreased over time as we generalize and centralize the monitoring infrastructure, but each SRE team typically has at least one monitoring-only staff member. It must be said that while it is quite interesting to watch monitoring system dashboards, SRE teams carefully avoid situations that require someone to look at the screen to monitor problems.

In general, Google has moved to simple and fast monitoring systems with optimal after-the-fact analysis tools. We avoid "magic" systems that try to predict thresholds or automatically discover the root cause. Sensors that detect unintended content in end user requests are the only counterexample; as long as these sensors remain simple, they can quickly detect the causes of serious anomalies. Other formats for using monitoring data, such as capacity planning or traffic forecasting, are more challenging. An observation over a very long time (months or years) at a low sampling rate (hours or days) will reveal a long-term trend.

The Google SRE team has worked with varying degrees of success with complex dependency hierarchies. We rarely use rules such as "if I find out that the database has become slow, I get a database slowdown alert, otherwise I get a slow site alert." Dependency based rules usually refer to the unchanging parts of our system, such as the system for filtering user traffic to the data center. For example, "if data center traffic filtering is configured, don't alert me about delays in processing user requests" is one common rule for data center alerts. Few teams at Google support complex dependency hierarchies because our infrastructure has a constant rate of continuous refactoring.

Some of the ideas outlined in this chapter still hold true: there is always a way to move faster from symptom to root cause, especially in ever-changing systems. Therefore, while this chapter outlines some goals for monitoring systems and how to achieve those goals, it is important that monitoring systems are simple and understandable to everyone on the team.

Likewise, to keep the noise level low and the signal level high, approaches to monitoring objects that are being alerted must be very simple and reliable. Rules that generate warnings for humans should be easy to understand and present a clear problem.

Symptoms versus causes

Your monitoring system should answer two questions: “what is broken” and “why is it broken”.
“What broke” refers to the symptom, and “why broke” refers to the cause. The table below shows examples of such links.

Symptom
Cause

Receiving HTTP Error 500 or 404
Database servers are refusing connections

Slow server responses
High CPU Utilization or Damaged Ethernet Cable

Users in Antarctica are not getting cat GIFs
Your CDN hates scientists and felines, so some IPs are blacklisted

Private content is available everywhere
Rolling a new software release made the firewall forget all ACLs and let everyone in

"What" and "why" are one of the most important building blocks for creating a good monitoring system with maximum signal and minimum noise.

Black-box vs. White-box

We combine extensive white-box monitoring with modest black-box monitoring for critical metrics. The easiest way to compare Black-box to White-box is that Black-box is symptom-focused and is reactive rather than proactive monitoring: “the system is not working correctly right now.” White-box depends on the internal checking capabilities of systems: event logs or web servers. Thus, White-box allows you to detect upcoming problems, malfunctions that look like a retransmission of a request, etc.

Note that in a multilayer system, a symptom in the area of ​​responsibility of one engineer is a symptom in the area of ​​responsibility of another engineer. For example, database performance has decreased. Slow database reads are a symptom of the database SRE that detects them. However, for a front-end SRE watching a slow website, the reason for the same slow database read is that the database is slow. Therefore, white-box monitoring is sometimes focused on symptoms and sometimes on causes, depending on how extensive it is.

When collecting telemetry for debugging, White-box monitoring is required. If web servers are slow to respond to database queries, you need to know how fast the web server is communicating with the database and how fast it is responding. Otherwise, you won't be able to tell the difference between a slow database server and a network problem between the web server and the database.

Black-box monitoring has a key advantage when sending alerts: you trigger a notification to the recipient when the problem has already caused actual symptoms. On the other hand, for the Black-box problem that has not yet arisen, but the future one, monitoring is useless.

Four golden signals

The four golden monitoring signals are latency, traffic, errors, and saturation. If you can only measure four user system metrics, focus on those four.

Delay

The time required to process the request. It is important to distinguish between the latency of successful and unsuccessful requests. For example, an HTTP 500 error caused by a lost connection to a database or another backend can be diagnosed very quickly, however, an HTTP 500 error can indicate a failed request. Finding the impact of a 500 error on overall latency can lead to erroneous conclusions. On the other hand, a slow error is even a fast error! Therefore, it is important to track error latency rather than just filtering out errors.

Трафик

The number of requests to your system, measured in high-level system metrics. For a web service, this measurement typically represents the number of HTTP requests per second divided by the nature of the requests (for example, static or dynamic content). For an audio streaming system, this measurement may be centered on the network I/O rate or the number of concurrent sessions. For a key-value storage system, this measurement could be transactions or lookups per second.

Errors

This is the rate of failed requests, either explicitly (for example, HTTP 500), implicitly (for example, HTTP 200 but combined with bad content), or by policy (for example, "If you capture a response in one second, any one second is an error"). If there are not enough HTTP response codes to express all failure conditions, secondary (internal) protocols may be required to detect partial failure. Monitoring all such erroneous requests can be uninformative, while end-to-end system tests can help you discover that you are processing the wrong content.

Saturation

The metric shows how heavily your service is being used. This is a system monitoring measure that identifies resources that are most limited (for example, in a system with limited memory, shows memory, in a system with limited I / O, shows the number of I / O). Note that many systems degrade before they reach 100% usage, so having a usage target is essential.

In complex systems, saturation can be supplemented by higher-level load measurement: can your service properly handle double traffic, only handle 10% more traffic, or handle even less traffic than it currently can? For simple services that don't have parameters that change the complexity of the request (e.g. "Give me nothing" or "I need a unique one monotonic integer") that rarely change configuration, a static load test value may be adequate. However, as discussed in the previous paragraph, most services should use indirect signals such as CPU utilization or network bandwidth that have a known upper bound. Rising latency is often the main indicator of saturation. Measuring the 99th percentile response time in a small window (eg one minute) can give a very early saturation signal.

Finally, saturation is also associated with predictions of impending saturation, such as: "Looks like your database will fill up your hard drive in 4 hours."

If you measure all four golden signals and when there is a problem with one of the metrics (or, in case of saturation, almost a problem), you notify the person, your service will be more or less covered by monitoring.

Worry about the tail (or instrumentation and performance)

When building a monitoring system from scratch, it is tempting to develop a system based on averages: average latency, average node CPU utilization, or average database occupancy. The danger of the last two examples is obvious: processors and databases are disposed of in a very unpredictable way. The same applies to delay. If you are running a web service with an average latency of 100ms at 1000 requests per second, 1% of requests can take 5 seconds. If users depend on multiple such web services, the 99th percentile of a single backend can easily become the interface's median response time.

The simplest way to distinguish between a slow average and a very slow tail of requests is to collect measurements of requests expressed in statistics (histograms are a suitable tool to display), rather than actual delays: how many requests were served by the service that took between 0 ms and 10ms, between 10ms and 30ms, between 30ms and 100ms, between 100ms and 300ms, etc. Expanding the histogram bounds approximately exponentially (by a factor of about 3) is often an easy way to visualize the distribution of requests.

Choosing the right granularity for measurements

Different elements of the system should be measured with different levels of detail. For example:

  • Watching CPU usage utilization over a period of time will not show long spikes that result in high latencies.
  • On the other hand, for a web service targeting no more than 9 hours of downtime per year (99,9% annual uptime), checking for an HTTP 200 response more than once or twice per minute would probably be unnecessarily frequent.
  • Similarly, checking for free space on the hard drive for 99,9% availability more than once every 1-2 minutes is probably unnecessary.

Take care how you structure the granularity of the dimensions. A CPU utilization rate of 1 per second can provide interesting data, but such frequent measurements can be very expensive to collect, store, and analyze. If your monitoring goal requires high granularity and does not require high responsiveness, you can reduce these costs by setting up metrics collection on the server and then configuring an external system to collect and aggregate those metrics. Could you:

  1. Measure CPU usage every second.
  2. Reduce detail to 5%.
  3. Aggregate metrics every minute.

This strategy will allow you to collect highly granular data without experiencing high overheads for analysis and storage.

As simple as possible, but not easier

Stacking different requirements on top of each other can lead to a very complex monitoring system. For example, your system may have the following complicating elements:

  • Alerts according to different thresholds for request latency, in different percentiles, across all kinds of different metrics.
  • Writing additional code to detect and identify possible causes.
  • Create related dashboards for each of the possible causes of problems.

The sources of potential complication never end. Like all software systems, monitoring can become so complex that it becomes fragile, difficult to change and maintain.

Therefore, design your monitoring system to simplify it as much as possible. When choosing what to track, keep the following in mind:

  • The rules that most often catch real incidents should be as simple, predictable, and reliable as possible.
  • The configuration for data collection, aggregation, and alerting that is performed infrequently (for example, less than quarterly for some SRE teams) should be removed.
  • Metrics that are collected but not shown in any preview panel or used by any alert are candidates for deletion.

At Google, basic collection and aggregation of metrics, combined with alerts and dashboards, works well as a relatively self-contained system (Google's monitoring system is actually broken into several subsystems, but usually people are aware of all aspects of these subsystems). It can be tempting to combine monitoring with other methods of testing complex systems: detailed system profiling, process debugging, tracking exception or failure details, load testing, log collection and analysis, or traffic inspection. While most of these things share commonalities with basic monitoring, mixing them up will result in too many results and create a complex and brittle system. As with many other aspects of software development, supporting different systems with clear, simple, loosely coupled integration points is the best strategy (for example, using a web API to retrieve summary data in a format that can remain constant over a long period of time).

Linking principles together

The principles discussed in this chapter can be combined into a monitoring and alerting philosophy that is endorsed and followed by Google SRE teams. Adhering to this monitoring philosophy is desirable, it is a good starting point for creating or revising an alert methodology, and can help you ask the right questions to operations regardless of the size of your organization or the complexity of the service or system.

When creating monitoring and alerting rules, asking the following questions can help you avoid false positives and unnecessary alerts:

  • Does this rule detect an otherwise undetectable system state that is urgent, calls to action, and inevitably impacts the user?
  • Can I ignore this warning knowing it's benign? When and why can I ignore this warning and how can I avoid this scenario?
  • Does this alert mean that users are being adversely affected? Are there situations where users are not negatively impacted, for example, due to traffic filtering or when using test systems, alerts on which should be filtered?
  • Can I take action in response to this alert? Are these measures urgent or can they wait until the morning? Is it safe to automate the action? Will this action be a long term solution or a short term workaround?
  • Some people get multiple alerts for this issue, so is it possible to reduce the number?

These questions reflect the fundamental philosophy on alerts and alert systems:

  • Every time an alert comes in, I have to react urgently. I can rush several times a day before I get tired.
  • Each alert must be up to date.
  • Each response to an alert must require human intervention. If the notification can be processed automatically, it should not come.
  • Alerts should be about a new issue or an event that hasn't happened before.

This approach blurs certain distinctions: if an alert satisfies the previous four conditions, it doesn't matter if the alert is sent from a White-box monitoring system or a Black-Box. This approach also reinforces certain differences: it is better to spend much more effort on identifying symptoms than on causes; when it comes to causes, you only need to worry about the inevitable causes.

Long term monitoring

In today's production environments, monitoring systems monitor an ever-evolving production system with changing software architecture, load characteristics, and performance targets. Alerts, which are currently difficult to automate, may become common, perhaps even deserving of being addressed. At this point, someone has to find and fix the root causes of the problem; if such resolution is not possible, the reaction to the alert requires full automation.

It is important that monitoring decisions are made with long-term goals in mind. Every alert that runs today takes a person away from improving the system tomorrow, so there is often a decrease in the availability or performance of a productive system for the time it takes to improve the monitoring system in the long run. Let's look at two examples that illustrate this phenomenon.

Bigtable SRE: A story about over-alert

Google's internal infrastructure is typically provided and measured in terms of Service Level (SLO). Years ago, the SLO of the Bigtable service was based on the average performance of a synthetic transaction simulating a running client. Due to problems in Bigtable and lower levels of the storage stack, average performance was driven by a "big" tail: the worst 5% of queries were often significantly slower than the rest.

Email notifications were sent as the SLO threshold was approached, and messenger alerts were sent when the SLO was exceeded. Both types of alerts were sent fairly frequently, consuming unacceptable amounts of engineering time: the team spent a significant amount of time parsing the alerts to find a few that were actually relevant. We often missed an issue that actually affected users because only a few of the alerts were for that specific issue. Many of the alerts were non-urgent due to understandable infrastructure issues and were handled in a standard way, or not handled at all.

To remedy the situation, the team used a three-pronged approach: while working hard to improve Bigtable's performance, we temporarily set the 75th percentile for query response delay as our SLO target. We also turned off email alerts, as there were so many of them that it was impossible to waste time diagnosing them.

This strategy allowed us a breather to start fixing long-term issues in Bigtable and the lower layers of the storage stack, rather than constantly fixing tactical issues. Engineers could actually get the job done when they weren't bombarded with alerts all the time. Ultimately, the temporary delay in processing alerts allowed us to improve the quality of service.

Gmail: Predictable, Algorithmic Human Responses

At the very beginning, Gmail was built on a modified Workqueue process control system that was created to batch process search index chunks. Workqueue was adapted to long-lived processes and subsequently applied to Gmail, but some bugs in the opaque scheduler code proved very hard to fix.

At the time, Gmail monitoring was structured so that alerts would fire when individual tasks were canceled using Workqueue. This approach was not ideal, because even at that time Gmail performed many thousands of tasks, each of which was given to fractions of a percent of our users. We took great care to ensure that Gmail users have a good user experience, but handling that many alerts was out of the question.

To solve this problem, Gmail SRE created a tool to help debug the scheduler as best as possible to minimize the impact on users. The team had several discussions about whether to simply automate the entire cycle from finding the problem to fixing it until a long-term solution was found, but some were concerned that such a solution would delay the actual fixing of the problem.

Such tension was common within the team and often reflected a lack of confidence in self-discipline: while some team members want to give time for a proper fix, others worry that the final fix will be forgotten and the temporary fix will take forever. This problem deserves attention, as it is too easy to fix problems temporarily, instead of making a permanent fix. Managers and technical staff play a key role in the implementation of long-term fixes by supporting and prioritizing potentially long-term long-term fixes even when the initial pain subsides.

Regular recurring alerts and algorithmic reactions should be a red flag. Your team's reluctance to automate these alerts means the team lacks confidence that they can trust the algorithms. This is a serious problem that needs to be addressed.

In the long run

A common theme links the Bigtable and Gmail examples: the competition between short-term and long-term availability. Often a strong effort can help a fragile system achieve high availability, but this path is usually short-lived, fraught with team burnout and dependence on a small number of members of this same heroic team.

A controlled, short-term decline in availability is often painful, but strategically important for the long-term stability of the system. It is important not to consider each alert in isolation, but to consider whether the overall rate of alerts leads to a healthy, properly accessible system with a viable team and a favorable prognosis. We analyze alert rate statistics (usually expressed as incidents per shift, where an incident can consist of multiple related incidents) in quarterly reports with management, allowing decision makers to continuously present alert system load and overall team health.

Conclusion

The path to healthy monitoring and alerts is simple and straightforward. It focuses on the symptoms of the problem for which alerts are generated, and monitoring the cause serves as an aid to debugging problems. Symptom monitoring is easier the higher you are in the stack you control, although database load and performance monitoring should be done directly on the database itself. Email notifications are of very limited use and tend to escalate into noise easily; instead, you should use a dashboard that monitors all current issues that are alerted by email. The dashboard can also be paired with an event log to analyze historical correlations.

In the long term, a successful alternation between symptom alerts and imminent real problems needs to be achieved, and goals adapted to ensure that monitoring supports rapid diagnosis.

Thank you for reading the translation to the end. Subscribe to my telegram channel about monitoring @monitorim_it и blog on Medium.

Source: habr.com

Add a comment