Collecting logs from Loki

Collecting logs from Loki

We at Badoo are constantly monitoring new technologies and evaluating whether or not to use them in our system. We want to share one of these studies with the community. It is dedicated to Loki, a log aggregation system.

Loki is a solution for storing and viewing logs, and this stack also provides a flexible system for analyzing them and sending data to Prometheus. In May, another update was released, which is actively promoted by the creators. We were interested in what Loki can do, what opportunities it provides, and to what extent it can act as an alternative to ELK, the stack that we use now.

What is Loki

Grafana Loki is a set of components for a complete logging system. Unlike other similar systems, Loki is based on the idea of ​​indexing only log metadata - labels (just like in Prometheus), and compressing the logs themselves side by side into separate chunks.

Home page, GitHub

Before I get into what you can do with Loki, I want to clarify what is meant by "the idea of ​​indexing only metadata". Let's compare the Loki approach and the indexing approach in traditional solutions, such as Elasticsearch, using the example of a line from the nginx log:

172.19.0.4 - - [01/Jun/2020:12:05:03 +0000] "GET /purchase?user_id=75146478&item_id=34234 HTTP/1.1" 500 8102 "-" "Stub_Bot/3.0" "0.001"

Traditional systems parse the entire row, including fields with many unique user_id and item_id values, and store everything in large indexes. The advantage of this approach is that you can run complex queries quickly, since almost all of the data is in the index. But you have to pay for this in that the index becomes large, which translates into memory requirements. As a result, the full-text index of logs is comparable in size to the logs themselves. In order to quickly search through it, the index must be loaded into memory. And the more logs, the faster the index increases and the more memory it consumes.

The Loki approach requires that only the necessary data be extracted from the string, the number of values ​​of which is small. This way we get a small index and can search the data by filtering it by time and indexed fields, and then scanning the rest with regular expressions or substring searches. The process does not seem the fastest, but Loki splits the request into several parts and executes them in parallel, processing a large amount of data in a short time. The number of shards and parallel requests in them is configurable; thus, the amount of data that can be processed per unit of time depends linearly on the amount of resources provided.

This trade-off between a large fast index and a small parallel brute-force index allows Loki to control the cost of the system. It can be flexibly configured and expanded according to your needs.

The Loki stack consists of three components: Promtail, Loki, Grafana. Promtail collects logs, processes them and sends them to Loki. Loki keeps them. And Grafana can request data from Loki and show it. In general, Loki can be used not only for storing logs and searching through them. The entire stack provides great opportunities for processing and analyzing incoming data using the Prometheus way.
A description of the installation process can be found here.

Log Search

You can search the logs in a special interface Grafana β€” Explorer. The queries use the LogQL language, which is very similar to the PromQL used by Prometheus. In principle, it can be thought of as a distributed grep.

The search interface looks like this:

Collecting logs from Loki

The query itself consists of two parts: selector and filter. Selector is a search by indexed metadata (labels) that are assigned to logs, and filter is a search string or regexp that filters out records defined by the selector. In the given example: In curly brackets - the selector, everything after - the filter.

{image_name="nginx.promtail.test"} |= "index"

Due to the way Loki works, you can't make requests without a selector, but labels can be made arbitrarily generic.

The selector is the key-value of the value in curly braces. You can combine selectors and specify different search conditions using the =, != operators or regular expressions:

{instance=~"kafka-[23]",name!="kafka-dev"} 
// Найдёт Π»ΠΎΠ³ΠΈ с Π»Π΅ΠΉΠ±Π»ΠΎΠΌ instance, ΠΈΠΌΠ΅ΡŽΡ‰ΠΈΠ΅ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ kafka-2, kafka-3, ΠΈ ΠΈΡΠΊΠ»ΡŽΡ‡ΠΈΡ‚ dev 

A filter is a text or regexp that will filter out all the data received by the selector.

It is possible to obtain ad-hoc graphs based on the received data in the metrics mode. For example, you can find out the frequency of occurrence in the nginx logs of an entry containing the index string:

Collecting logs from Loki

A full description of the features can be found in the documentation LogQL.

Log parsing

There are several ways to collect logs:

  • With the help of Promtail, a standard component of the stack for collecting logs.
  • Directly from the docker container using Loki Docker Logging Driver.
  • Use Fluentd or Fluent Bit which can send data to Loki. Unlike Promtail, they have ready-made parsers for almost any type of log and can handle multiline logs as well.

Usually Promtail is used for parsing. It does three things:

  • Finds data sources.
  • Attach labels to them.
  • Sends data to Loki.

Currently Promtail can read logs from local files and from systemd journal. It must be installed on every machine from which logs are collected.

There is integration with Kubernetes: Promtail automatically finds out the state of the cluster through the Kubernetes REST API and collects logs from a node, service or pod, immediately posting labels based on metadata from Kubernetes (pod name, file name, etc.).

You can also hang labels based on data from the log using Pipeline. Pipeline Promtail can consist of four types of stages. More details - in official documentation, I will immediately note some of the nuances.

  1. Parsing stages. This is the stage of RegEx and JSON. At this stage, we extract data from the logs into the so-called extracted map. You can extract from JSON by simply copying the fields we need into the extracted map, or through regular expressions (RegEx), where named groups are β€œmapped” into the extracted map. Extracted map is a key-value storage, where key is the name of the field, and value is its value from the logs.
  2. Transform stages. This stage has two options: transform, where we set the transformation rules, and source - the data source for the transformation from the extracted map. If there is no such field in the extracted map, then it will be created. Thus, it is possible to create labels that are not based on the extracted map. At this stage, we can manipulate the data in the extracted map using a fairly powerful golang template. In addition, we must remember that the extracted map is fully loaded during parsing, which makes it possible, for example, to check the value in it: β€œ{{if .tag}tag value exists{end}}”. Template supports conditions, loops, and some string functions such as Replace and Trim.
  3. Action stages. At this stage, you can do something with the extracted:
    • Create a label from the extracted data, which will be indexed by Loki.
    • Change or set the event time from the log.
    • Change the data (log text) that will go to Loki.
    • Create metrics.
  4. Filtering stages. The match stage, where we can either send records that we don't need to /dev/null, or send them for further processing.

Using the example of processing ordinary nginx logs, I will show how you can parse logs using Promtail.

For the test, let's take a modified nginx image jwilder/nginx-proxy:alpine as nginx-proxy and a small daemon that can query itself via HTTP. The daemon has several endpoints, to which it can give responses of different sizes, with different HTTP statuses and with different delays.

We will collect logs from docker containers, which can be found along the path /var/lib/docker/containers/ / -json.log

In docker-compose.yml we set up Promtail and specify the path to the config:

promtail:
  image: grafana/promtail:1.4.1
 // ...
 volumes:
   - /var/lib/docker/containers:/var/lib/docker/containers:ro
   - promtail-data:/var/lib/promtail/positions
   - ${PWD}/promtail/docker.yml:/etc/promtail/promtail.yml
 command:
   - '-config.file=/etc/promtail/promtail.yml'
 // ...

Add the path to the logs to promtail.yml (there is a "docker" option in the config that does the same in one line, but it would not be so obvious):

scrape_configs:
 - job_name: containers

   static_configs:
       labels:
         job: containerlogs
         __path__: /var/lib/docker/containers/*/*log  # for linux only

When this configuration is enabled, Loki will receive logs from all containers. To avoid this, we change the settings of the test nginx in docker-compose.yml - add logging to the tag field:

proxy:
 image: nginx.test.v3
//…
 logging:
   driver: "json-file"
   options:
     tag: "{{.ImageName}}|{{.Name}}"

Edit promtail.yml and set up Pipeline. The logs are as follows:

{"log":"u001b[0;33;1mnginx.1    | u001b[0mnginx.test 172.28.0.3 - - [13/Jun/2020:23:25:50 +0000] "GET /api/index HTTP/1.1" 200 0 "-" "Stub_Bot/0.1" "0.096"n","stream":"stdout","attrs":{"tag":"nginx.promtail.test|proxy.prober"},"time":"2020-06-13T23:25:50.66740443Z"}
{"log":"u001b[0;33;1mnginx.1    | u001b[0mnginx.test 172.28.0.3 - - [13/Jun/2020:23:25:50 +0000] "GET /200 HTTP/1.1" 200 0 "-" "Stub_Bot/0.1" "0.000"n","stream":"stdout","attrs":{"tag":"nginx.promtail.test|proxy.prober"},"time":"2020-06-13T23:25:50.702925272Z"}

pipeline stages:

 - json:
     expressions:
       stream: stream
       attrs: attrs
       tag: attrs.tag

We extract the stream, attrs, attrs.tag fields (if any) from the incoming JSON and put them into the extracted map.

 - regex:
     expression: ^(?P<image_name>([^|]+))|(?P<container_name>([^|]+))$
     source: "tag"

If it was possible to put the tag field in the extracted map, then using the regexp we extract the names of the image and container.

 - labels:
     image_name:
     container_name:

We assign labels. If the keys image_name and container_name are found in the extracted data, then their values ​​will be assigned to the appropriate labels.

 - match:
     selector: '{job="docker",container_name="",image_name=""}'
     action: drop

We discard all logs that do not have labels image_name and container_name set.

  - match:
     selector: '{image_name="nginx.promtail.test"}'
     stages:
       - json:
           expressions:
             row: log

For all logs whose image_name is equal to nginx.promtail.test, we extract the log field from the source log and put it in the extracted map with the row key.

  - regex:
         # suppress forego colors
         expression: .+nginx.+|.+[0m(?P<virtual_host>[a-z_.-]+) +(?P<nginxlog>.+)
         source: logrow

We clear the input string with regular expressions and pull out the nginx virtual host and the nginx log line.

     - regex:
         source: nginxlog
         expression: ^(?P<ip>[w.]+) - (?P<user>[^ ]*) [(?P<timestamp>[^ ]+).*] "(?P<method>[^ ]*) (?P<request_url>[^ ]*) (?P<request_http_protocol>[^ ]*)" (?P<status>[d]+) (?P<bytes_out>[d]+) "(?P<http_referer>[^"]*)" "(?P<user_agent>[^"]*)"( "(?P<response_time>[d.]+)")?

Parse nginx log with regular expressions.

    - regex:
           source: request_url
           expression: ^.+.(?P<static_type>jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js|flv|swf|html|htm)$
     - regex:
           source: request_url
           expression: ^/photo/(?P<photo>[^/?.]+).*$
       - regex:
           source: request_url
           expression: ^/api/(?P<api_request>[^/?.]+).*$

Parse request_url. With the help of regexp, we determine the purpose of the request: to statics, to photos, to API and set the corresponding key in the extracted map.

       - template:
           source: request_type
           template: "{{if .photo}}photo{{else if .static_type}}static{{else if .api_request}}api{{else}}other{{end}}"

Using conditional operators in Template, we check the installed fields in the extracted map and set the required values ​​for the request_type field: photo, static, API. Assign other if failed. Now request_type contains the request type.

       - labels:
           api_request:
           virtual_host:
           request_type:
           status:

We set the labels api_request, virtual_host, request_type and status (HTTP status) based on what we managed to put in the extracted map.

       - output:
           source: nginx_log_row

Change output. Now the cleaned nginx log from the extracted map goes to Loki.

Collecting logs from Loki

After running the above config, you can see that each entry is labeled based on data from the log.

Keep in mind that extracting labels with a large number of values ​​(cardinality) can significantly slow down Loki. That is, you should not put in the index, for example, user_id. Read more about this in the articleHow labels in Loki can make log queries faster and easier". But this does not mean that you cannot search by user_id without indexes. It is necessary to use filters when searching (β€œgrab” according to the data), and the index here acts as a stream identifier.

Log visualization

Collecting logs from Loki

Loki can act as a data source for Grafana charts using LogQL. The following features are supported:

  • rate - number of records per second;
  • count over time - the number of records in the given range.

There are also aggregating functions Sum, Avg and others. You can build quite complex graphs, for example, a graph of the number of HTTP errors:

Collecting logs from Loki

Loki's default data source is a bit less functional than the Prometheus data source (for example, you can't change the legend), but Loki can be connected as a Prometheus type source. I'm not sure if this is documented behavior, but judging by the response from the developers β€œHow to configure Loki as Prometheus datasource? Β· Issue #1222 Β· grafana/loki”, for example, it is perfectly legal and Loki is fully compatible with PromQL.

Add Loki as data source with type Prometheus and append URL /loki:

Collecting logs from Loki

And you can make graphs, as if we were working with metrics from Prometheus:

Collecting logs from Loki

I think that the discrepancy in functionality is temporary and the developers will fix it in the future.

Collecting logs from Loki

Metrics

Loki provides the ability to extract numerical metrics from logs and send them to Prometheus. For example, the nginx log contains the number of bytes per response, and also, with a certain modification of the standard log format, the time in seconds that it took to respond. This data can be extracted and sent to Prometheus.

Add another section to promtail.yml:

- match:
   selector: '{request_type="api"}'
   stages:
     - metrics:
         http_nginx_response_time:
           type: Histogram
           description: "response time ms"
           source: response_time
           config:
             buckets: [0.010,0.050,0.100,0.200,0.500,1.0]
- match:
   selector: '{request_type=~"static|photo"}'
   stages:
     - metrics:
         http_nginx_response_bytes_sum:
           type: Counter
           description: "response bytes sum"
           source: bytes_out
           config:
             action: add
         http_nginx_response_bytes_count:
           type: Counter
           description: "response bytes count"
           source: bytes_out
           config:
             action: inc

The option allows you to define and update metrics based on data from the extracted map. These metrics are not sent to Loki - they appear in the Promtail /metrics endpoint. Prometheus must be configured to receive data from this stage. In the above example, for request_type="api" we collect a histogram metric. With this type of metrics it is convenient to get percentiles. For statics and photos, we collect the sum of bytes and the number of rows in which we received bytes to calculate the average.

Read more about metrics here.

Open a port on Promtail:

promtail:
     image: grafana/promtail:1.4.1
     container_name: monitoring.promtail
     expose:
       - 9080
     ports:
       - "9080:9080"

We make sure that the metrics with the promtail_custom prefix have appeared:

Collecting logs from Loki

Setting up Prometheus. Add job promtail:

- job_name: 'promtail'
 scrape_interval: 10s
 static_configs:
   - targets: ['promtail:9080']

And draw a graph:

Collecting logs from Loki

This way you can find out, for example, the four slowest queries. You can also configure monitoring for these metrics.

Scaling

Loki can be in both single binary mode and sharded (horizontally-scalable mode). In the second case, it can save data to the cloud, and the chunks and index are stored separately. In version 1.5, the ability to store in one place is implemented, but it is not yet recommended to use it in production.

Collecting logs from Loki

Chunks can be stored in S3-compatible storage, and horizontally scalable databases can be used to store indexes: Cassandra, BigTable or DynamoDB. Other parts of Loki - Distributors (for writing) and Querier (for queries) - are stateless and also scale horizontally.

At the DevOpsDays Vancouver 2019 conference, one of the participants Callum Styan announced that with Loki his project has petabytes of logs with an index of less than 1% of the total size: β€œHow Loki Correlates Metrics and Logs β€” And Saves You Money".

Comparison of Loki and ELK

Index size

To test the resulting index size, I took logs from the nginx container for which the Pipeline above was configured. The log file contained 406 lines with a total volume of 624 MB. Logs were generated within an hour, approximately 109 records per second.

An example of two lines from the log:

Collecting logs from Loki

When indexed by ELK, this gave an index size of 30,3 MB:

Collecting logs from Loki

In the case of Loki, this gave about 128 KB of index and about 3,8 MB of data in chunks. It is worth noting that the log was artificially generated and did not feature a wide variety of data. A simple gzip on the original Docker JSON log with data gave a compression of 95,4%, and given that only the cleaned nginx log was sent to Loki itself, the compression to 4 MB is understandable. The total number of unique values ​​for Loki labels was 35, which explains the small size of the index. For ELK, the log was also cleared. Thus, Loki compressed the original data by 96%, and ELK by 70%.

Memory consumption

Collecting logs from Loki

If we compare the entire stack of Prometheus and ELK, then Loki "eats" several times less. It is clear that the Go service consumes less than the Java service, and comparing the size of the Heap Elasticsearch JVM and the allocated memory for Loki is incorrect, but nevertheless, it is worth noting that Loki uses much less memory. Its CPU advantage is not so obvious, but it is also present.

Speed

Loki "devours" logs faster. The speed depends on many factors - what kind of logs, how sophisticated we parse them, network, disk, etc. - but it is definitely higher than that of ELK (in my test - about two times). This is explained by the fact that Loki puts much less data into the index and, accordingly, spends less time on indexing. In this case, the situation is reversed with the search speed: Loki noticeably slows down on data larger than a few gigabytes, while for ELK, the search speed does not depend on the data size.

Log Search

Loki is significantly inferior to ELK in terms of log search capabilities. Grep with regular expressions is a strong thing, but it is inferior to an adult database. The lack of range queries, aggregation only by labels, the inability to search without labels - all this limits us in searching for information of interest in Loki. This does not imply that nothing can be found using Loki, but it defines the flow of working with logs, when you first find a problem on the Prometheus charts, and then look for what happened in the logs using these labels.

Interface

First off, it's beautiful (sorry, couldn't resist). Grafana has a nice looking interface, but Kibana is much more functional.

Loki pros and cons

Of the pluses, it can be noted that Loki integrates with Prometheus, respectively, we get metrics and alerting out of the box. It is convenient for collecting logs and storing them with Kubernetes Pods, as it has a service discovery inherited from Prometheus and automatically attaches labels.

Of the minuses - poor documentation. Some things, such as the features and capabilities of Promtail, I discovered only in the process of studying the code, the benefit of open-source. Another disadvantage is the weak parsing capabilities. For example, Loki cannot parse multiline logs. Also, the disadvantages include the fact that Loki is a relatively young technology (release 1.0 was in November 2019).

Conclusion

Loki is a 100% interesting technology that is suitable for small and medium projects, allowing you to solve many problems of log aggregation, log search, monitoring and analysis of logs.

We don't use Loki at Badoo, because we have an ELK stack that suits us and that has been overgrown with various custom solutions over the years. For us, the stumbling block is the search in the logs. With almost 100 GB of logs per day, it is important for us to be able to find everything and a little more and do it quickly. For charting and monitoring, we use other solutions that are tailored to our needs and integrated with each other. The Loki stack has tangible benefits, but it won't give us more than what we have, and its benefits won't exactly outweigh the cost of migrating.

And although after research it became clear that we cannot use Loki, we hope that this post will help you in choosing.

The repository with the code used in the article is located here.

Source: habr.com

Add a comment