Thanos - Scalable Prometheus

The translation of the article was prepared specifically for the students of the course "DevOps practices and tools".

Fabian Reinartz is a software developer, Go fanatic and problem solver. He is also the maintainer of Prometheus and co-founder of Kubernetes SIG instrumentation. In the past, he was a production engineer at SoundCloud and led the monitoring team at CoreOS. Currently works at Google.

Bartek Plotka - Infrastructure Engineer at Improbable. He is fond of new technologies and problems of distributed systems. He has low-level programming experience at Intel, contributor experience at Mesos, and world-class SRE production experience at Improbable. Engaged in improving the world of microservices. His three loves are Golang, open source and volleyball.

Looking at our flagship product SpatialOS, you can guess that Improbable needs a highly dynamic global cloud infrastructure with dozens of Kubernetes clusters. We were among the first to start using the monitoring system Prometheus. Prometheus is capable of tracking millions of real-time metrics and comes with a powerful query language to extract the information you need.

Simplicity and reliability of Prometheus is one of its main advantages. However, after passing a certain scale, we encountered several disadvantages. To solve these problems, we have developed Thanos is an open source project created by Improbable to seamlessly transform existing Prometheus clusters into a single monitoring system with unlimited historical data storage. Thanos is available on Github here.

Stay up to date with the latest news from Improbable.

Our goals with Thanos

At a certain scale, problems arise that are beyond the capabilities of vanilla Prometheus. How to securely and economically store petabytes of historical data? Can this be done without compromising query response time? Is it possible to access all metrics located on different Prometheus servers with a single API request? Is there any way to merge the replicated data collected with Prometheus HA?

To address these issues, we created Thanos. The following sections describe how we approached these issues and explain the goals we pursued.

Query data from multiple Prometheus instances (global query)

Prometheus offers a functional approach to sharding. Even a single Prometheus server provides enough scalability to free users from the complexities of horizontal sharding in almost all use cases.

While this is a great deployment model, it is often required to access data on different Prometheus servers through a single API or UI - the global view. Of course, it is possible to display multiple requests in one Grafana panel, but each request can only be made to one Prometheus server. On the other hand, with Thanos you can query and aggregate data from multiple Prometheus servers since they are all accessible from the same endpoint.

Earlier, in order to get a global view in Improbable, we organized our Prometheus instances into a multi-level Hierarchical Federation. This meant creating one Prometheus meta server that collects a portion of the metrics from each “leaf” server.

Thanos - Scalable Prometheus

This approach has proven problematic. It resulted in a more complex configuration, adding an additional potential point of failure, and applying complex rules to provide the federated endpoint with only the data it needs. In addition, this kind of federation does not allow you to get a true global view, since not all data is available from a single API request.

Closely related to this is a single view of data collected on high-availability (HA) Prometheus servers. The Prometheus HA model independently collects data twice, which is as simple as it gets. However, using a combined and deduplicated representation of both streams would be much more convenient.

Of course, there is a need for highly available Prometheus servers. At Improbable, we're really serious about monitoring data every minute, but having one Prometheus instance per cluster is a single point of failure. Any configuration error or hardware failure can potentially result in the loss of important data. Even a simple deployment can result in slight glitches in metrics collection because the restart can be significantly longer than the scraping interval.

Reliable storage of historical data

Cheap, fast and long-term storage of metrics is our dream (shared by the majority of Prometheus users). In Improbable, we were forced to set the metrics retention period to nine days (for Prometheus 1.8). This adds obvious limits to how far back we can look.

Prometheus 2.0 has improved in this regard, as the number of time series no longer affects the overall performance of the server (see below). KubeCon keynote about Prometheus 2). However, Prometheus stores data on the local drive. Although highly efficient data compression can greatly reduce the use of a local SSD, there is ultimately a limit to the amount of historical data that can be stored.

In addition, at Improbable we care about reliability, simplicity and cost. Large local drives are more difficult to maintain and back up. They cost more and require more backup tools, resulting in unnecessary complexity.

Downsampling

As soon as we started working with historical data, we realized that there are fundamental difficulties with big O that make queries slower and slower if we work with data for weeks, months and years.

The standard solution to this problem would be downsampling (downsampling) - reducing the sampling frequency of the signal. With downsampling, we can “downscale” to a larger time range and maintain the same number of samples, which will keep our queries responsive.

Downsampling of old data is an inevitable requirement of any long-term storage solution and goes beyond vanilla Prometheus.

Additional goals

One of the original goals of the Thanos project was seamless integration with any existing Prometheus installations. The second goal was simple operation with a minimum barrier to entry. Any dependencies should be easily satisfied for both small and large users, which also implies a negligible underlying cost.

Thanos architecture

With our goals listed in the previous section, let's work on them and see how Thanos solves these problems.

global view

To get a global view on top of existing Prometheus instances, we need to link a single request entry point to all servers. This is exactly what the Thanos component does. Sidecar. It is deployed next to every Prometheus server and acts as a proxy serving local Prometheus data via the gRPC Store API, which allows you to select time series data by timestamp and time range.

On the other side is a horizontally scalable, stateless Querier component that does little more than respond to PromQL queries through the standard Prometheus HTTP API. Components Querier, Sidecar and other Thanos interact via gossip protocol.

Thanos - Scalable Prometheus

  1. Querier, upon receiving a request, connects to the corresponding Store API server, that is, to our Sidecars, and receives time series data from the corresponding Prometheus servers.
  2. After that, it combines the responses and executes a PromQL query on them. Querier can aggregate both non-overlapping data and duplicated data from Prometheus HA servers.

This solves the main part of our puzzle - combining data from isolated Prometheus servers into a single view. In fact, Thanos can only be used for this opportunity. Existing Prometheus servers do not require any modifications!

Unlimited storage time!

However, sooner or later we will want to store data that goes beyond the normal Prometheus retention time. For storing historical data, we have chosen object storage. It is widely available in any cloud as well as in on-premises data centers and is very cost effective. In addition, almost any object storage is available through the well-known S3 API.

Prometheus writes data from RAM to disk approximately every two hours. The stored data block contains all data for a fixed period of time and is immutable. This is very convenient, as Thanos Sidecar can simply look at the Prometheus data directory and, as new blocks appear, load them into object storage buckets.

Thanos - Scalable Prometheus

Loading into object storage immediately after writing to disk also keeps the “scraper” simplicity (Prometheus and Thanos Sidecar) intact. This simplifies the maintenance, cost and design of the system.

As you can see, backing up data is very easy to implement. But what about querying data in object storage?

The Thanos Store component acts as a proxy for getting data from the object store. Like Thanos Sidecar, it participates in the gossip cluster and implements the Store API. Thus, existing Queriers can treat it as a Sidecar, as another source of time series data - no special configuration is required.

Thanos - Scalable Prometheus

Time series data blocks consist of several large files. Loading them on demand would be rather inefficient, and caching them locally would require a huge amount of memory and disk space.

Instead, Store Gateway knows how to handle the Prometheus storage format. Thanks to a smart query planner and caching only the necessary index parts of blocks, it became possible to reduce complex requests to the minimum number of HTTP requests to object storage files. Thus, it is possible to reduce the number of requests by four to six orders of magnitude and achieve a response time that is generally difficult to distinguish from requests for data on a local SSD.

Thanos - Scalable Prometheus

As shown in the diagram above, Thanos Querier significantly reduces the cost per request for data in object storage by using the Prometheus storage format and placing related data side by side. Using this approach, we can combine many single requests into a minimum number of bulk operations.

Compaction and downsampling

After a new block of time series data is successfully loaded into the object store, we consider it as "historical" data, which is immediately available through the Store Gateway.

However, after some time, blocks from one source (Prometheus with Sidecar) accumulate and no longer use the full potential of indexing. To solve this problem, we have introduced another component called Compactor. It simply applies the local Prometheus compaction mechanism to the historical data in object storage and can be run as a simple periodic batch job.

Thanos - Scalable Prometheus

Due to efficient compression, querying the storage over a long period of time does not pose a problem in terms of data size. However, the potential cost of unpacking a billion values ​​and running them through a query processor will inevitably lead to a dramatic increase in query execution time. On the other hand, since there are hundreds of data points per pixel on the screen, it becomes impossible to even render the data at full resolution. Thus, downsampling is not only possible, but will not lead to a noticeable loss of accuracy.

Thanos - Scalable Prometheus

For data downsampling, Compactor continuously aggregates data at a resolution of five minutes and one hour. For each raw fragment encoded with TSDB XOR compression, various types of aggregated data are stored, such as min, max or sum for one block. This allows Querier to automatically select an aggregate that is appropriate for a given PromQL query.

No special configuration is required for the user to use reduced precision data. Querier automatically switches between different resolutions and raw data as the user zooms in and out. If desired, the user can control this directly through the "step" parameter in the request.

Since the storage cost of one GB is low, Thanos saves the original data by default, data with a resolution of five minutes and one hour. There is no need to delete the original data.

Recording rules

Even with Thanos, recording rules are an essential part of the monitoring stack. They reduce the complexity, latency, and cost of requests. They are also convenient for users to get aggregated data by metrics. Thanos is based on vanilla instances of Prometheus, so it's perfectly acceptable to keep recording rules and alerting rules on an existing Prometheus server. However, in some cases this may not be enough:

  • Global alert and rule (for example, an alert when a service is down on more than two of the three clusters).
  • Rule for data outside of local storage.
  • The desire to store all rules and alerts in one place.

Thanos - Scalable Prometheus

For all these cases, Thanos includes a separate component called Ruler that evaluates rule and alert via Thanos Queries. By providing a well-known StoreAPI, the Query node can access freshly computed metrics. Later, they are also stored in object storage and made available through the Store Gateway.

Might of Thanos

Thanos is flexible enough to be customized to your needs. This is especially useful when migrating from plain Prometheus. Let's quickly review what we've learned about Thanos components using a small example. Here's how to bring your vanilla Prometheus into the world of "unlimited metric storage":

Thanos - Scalable Prometheus

  1. Add Thanos Sidecar to your Prometheus servers - for example, a neighboring container in a Kubernetes pod.
  2. Deploy multiple Thanos Querier replicas to view data. At this point, it's easy to set up gossip between Scraper and Querier. To check the interaction of components, use the 'thanos_cluster_members' metric.

These two steps alone are enough to provide a global view and seamless data deduplication from potential Prometheus HA replicas! Simply connect your dashboards to the HTTP Querier endpoint or use the Thanos UI interface directly.

However, if you need metrics backup and long-term retention, there are three more steps you need to take:

  1. Create an AWS S3 or GCS bucket. Set up Sidecar to copy data to these buckets. You can now minimize local data storage.
  2. Deploy a Store Gateway and connect it to an existing gossip cluster. Now you can send requests to data in backups!
  3. Deploy Compactor to improve query performance over long periods of time using compaction and downsampling.

If you want to know more, feel free to take a look at our kubernetes manifest examples и ييييييييييبببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببببب!

In just five steps, we turned Prometheus into a robust monitoring system with a global view, unlimited storage time, and potential high availability of metrics.

Pull request: we need you!

Thanos has been an open source project from the beginning. Seamless integration with Prometheus and the ability to use only a portion of Thanos makes it a great choice for scaling your monitoring system effortlessly.

We always welcome GitHub Pull Requests and Issues. In the meantime, feel free to contact us via Github Issues or slack Improbable-eng #thanosif you have questions or feedback, or want to share your experience! If you like what we do at Improbable feel free to contact us - we always have vacancies!

Learn more about the course.

Source: habr.com

Add a comment