Nginx Log Analytics with Amazon Athena and Cube.js

Usually, commercial products or ready-made open-source alternatives, such as Prometheus + Grafana, are used to monitor and analyze the operation of Nginx. This is a good option for monitoring or real-time analytics, but not very convenient for historical analysis. On any popular resource, the volume of data from the nginx logs is growing rapidly, and it is logical to use something more specialized to analyze a large amount of data.

In this article I will show you how you can use Athena to analyze logs, taking Nginx as an example, and show how to assemble an analytical dashboard from this data using the cube.js open-source framework. Here is the complete solution architecture:

Nginx Log Analytics with Amazon Athena and Cube.js

TL:DR;
Link to finished dashboard.

To collect information, we use fluentd, for processing β€” AWS Kinesis Data Firehose ΠΈ AWS Glue, for storage - AWS S3. Using this bundle, you can store not only nginx logs, but also other events, as well as logs of other services. You can replace some parts with similar ones for your stack, for example, you can write logs in kinesis directly from nginx, bypassing fluentd, or use logstash for this.

Collecting Nginx logs

By default, Nginx logs look something like this:

4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-up HTTP/2.0" 200 9168 "https://example.com/sign-in" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"
4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-in HTTP/2.0" 200 9168 "https://example.com/sign-up" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"

They can be parsed, but it is much easier to fix the Nginx configuration so that it outputs logs in JSON:

log_format json_combined escape=json '{ "created_at": "$msec", '
            '"remote_addr": "$remote_addr", '
            '"remote_user": "$remote_user", '
            '"request": "$request", '
            '"status": $status, '
            '"bytes_sent": $bytes_sent, '
            '"request_length": $request_length, '
            '"request_time": $request_time, '
            '"http_referrer": "$http_referer", '
            '"http_x_forwarded_for": "$http_x_forwarded_for", '
            '"http_user_agent": "$http_user_agent" }';

access_log  /var/log/nginx/access.log  json_combined;

S3 for storage

To store logs, we will use S3. This allows you to store and analyze logs in one place, since Athena can work with data in S3 directly. Later in the article I will tell you how to properly fold and process logs, but first we need a clean bucket in S3, in which nothing else will be stored. It is worth considering in advance in which region you will create a bucket, because Athena is not available in all regions.

Creating a schema in the Athena console

Let's create a table in Athena for logs. It is required for both writing and reading if you plan to use Kinesis Firehose. Open the Athena console and create a table:

SQL create table

CREATE EXTERNAL TABLE `kinesis_logs_nginx`(
  `created_at` double, 
  `remote_addr` string, 
  `remote_user` string, 
  `request` string, 
  `status` int, 
  `bytes_sent` int, 
  `request_length` int, 
  `request_time` double, 
  `http_referrer` string, 
  `http_x_forwarded_for` string, 
  `http_user_agent` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://<YOUR-S3-BUCKET>'
TBLPROPERTIES ('has_encrypted_data'='false');

Creating a Kinesis Firehose Stream

Kinesis Firehose will write the data received from Nginx to S3 in the selected format, divided into directories in the YYYY/MM/DD/DD format. This will come in handy when reading data. You can, of course, write directly to S3 from fluentd, but in this case you will have to write JSON, and this is inefficient due to the large size of the files. Also, when using PrestoDB or Athena, JSON is the slowest data format. So we open the Kinesis Firehose console, click "Create delivery stream", select "direct PUT" in the "delivery" field:

Nginx Log Analytics with Amazon Athena and Cube.js

In the next tab, select "Record format conversion" - "Enabled" and select "Apache ORC" as the recording format. According to some research Owen O'Malley, this is the optimal format for PrestoDB and Athena. As a schema, we specify the table that we created above. Please note that you can specify any S3 location in kinesis, only the scheme is used from the table. But if you specify a different S3 location, then these records will not be read from this table.

Nginx Log Analytics with Amazon Athena and Cube.js

Select S3 for storage and the bucket we created earlier. Aws Glue Crawler, which I will talk about a little later, does not work with prefixes in the S3 bucket, so it is important to leave it empty.

Nginx Log Analytics with Amazon Athena and Cube.js

The remaining options can be changed depending on your load, I usually use the default ones. Note that S3 compression is not available, but ORC uses native compression by default.

fluentd

Now that we have configured the storage and receipt of logs, we need to configure sending. We will use fluentd, because I love Ruby, but you can use Logstash or send logs to kinesis directly. Fluentd server can be started in several ways, I will talk about docker because it is simple and convenient.

First, we need the fluent.conf configuration file. Create it and add source:

type forward
port 24224
bind 0.0.0.0

Now you can start the Fluentd server. If you need more advanced configuration, Docker hub there is a detailed guide, including how to assemble your image.

$ docker run 
  -d 
  -p 24224:24224 
  -p 24224:24224/udp 
  -v /data:/fluentd/log 
  -v <PATH-TO-FLUENT-CONF>:/fluentd/etc fluentd 
  -c /fluentd/etc/fluent.conf
  fluent/fluentd:stable

This configuration uses the path /fluentd/log to cache logs before sending. You can do without it, but then when you restart, you can lose everything cached by overwork. Any port can also be used, 24224 is the default Fluentd port.

Now that we have Fluentd running, we can send Nginx logs there. We usually run Nginx in a Docker container, in which case Docker has a native logging driver for Fluentd:

$ docker run 
--log-driver=fluentd 
--log-opt fluentd-address=<FLUENTD-SERVER-ADDRESS>
--log-opt tag="{{.Name}}" 
-v /some/content:/usr/share/nginx/html:ro 
-d 
nginx

If you run Nginx otherwise, you can use the log files, Fluentd has file tail plugin.

Let's add the log parsing configured above to the Fluent configuration:

<filter YOUR-NGINX-TAG.*>
  @type parser
  key_name log
  emit_invalid_record_to_error false
  <parse>
    @type json
  </parse>
</filter>

And sending logs to Kinesis using kinesis firehose plugin:

<match YOUR-NGINX-TAG.*>
    @type kinesis_firehose
    region region
    delivery_stream_name <YOUR-KINESIS-STREAM-NAME>
    aws_key_id <YOUR-AWS-KEY-ID>
    aws_sec_key <YOUR_AWS-SEC_KEY>
</match>

Athena

If you configured everything correctly, then after a while (by default, Kinesis records the received data every 10 minutes), you should see the log files in S3. In the "monitoring" menu of the Kinesis Firehose, you can see how much data is written to S3, as well as errors. Don't forget to give write access to the S3 bucket for the Kinesis role. If Kinesis could not parse something, it will add the errors in the same bucket.

Now you can view the data in Athena. Let's find the latest requests for which we returned errors:

SELECT * FROM "db_name"."table_name" WHERE status > 499 ORDER BY created_at DESC limit 10;

Scanning all records per request

Now our logs are processed and stacked in S3 in ORC, compressed and ready for analysis. Kinesis Firehose even organized them into directories for every hour. However, as long as the table is not partitioned, Athena will load data for the entire time per request, with rare exceptions. This is a big problem for two reasons:

  • The volume of data is constantly growing, slowing down requests;
  • Athena is billed based on the amount of data scanned, with a minimum of 10MB per request.

To fix this, we use AWS Glue Crawler which will crawl the data in S3 and write the partition information to the Glue Metastore. This will allow us to use partitions as a filter when making requests to Athena, and it will only scan the directories specified in the request.

Setting up Amazon Glue Crawler

Amazon Glue Crawler crawls all the data in the S3 bucket and creates tables with partitions. Create a Glue Crawler from the AWS Glue Console and add a bucket where you store data. You can use one crawler for multiple buckets, in which case it will create tables in the specified database with names that match the names of the buckets. If you plan to use this data all the time, don't forget to adjust the Crawler launch schedule to suit your needs. We use one Crawler for all tables, which runs every hour.

Partitioned tables

After the first launch of the crawler, tables for each scanned bucket should appear in the database specified in the settings. Open the Athena console and find the Nginx log table. Let's try to read something:

SELECT * FROM "default"."part_demo_kinesis_bucket"
WHERE(
  partition_0 = '2019' AND
  partition_1 = '04' AND
  partition_2 = '08' AND
  partition_3 = '06'
  );

This query will select all records received between 6 am and 7 am on April 8, 2019. But how much more efficient is this than just reading from a non-partitioned table? Let's find out and select the same records by filtering them by timestamp:

Nginx Log Analytics with Amazon Athena and Cube.js

3.59 seconds and 244.34 megabytes of data on a dataset with only a week of logs. Let's try a partition filter:

Nginx Log Analytics with Amazon Athena and Cube.js

A little faster, but most importantly - only 1.23 megabytes of data! It would be much cheaper if it were not for the minimum 10 megabytes per request in the pricing. But it’s still much better, and on large datasets the difference will be much more impressive.

Building a dashboard with Cube.js

To assemble the dashboard, we use the Cube.js analytics framework. It has quite a lot of functions, but we are interested in two: the ability to automatically use filters on partitions and pre-aggregation of data. It uses data schema data schema, written in Javascript, to generate SQL and execute a database query. We only need to specify how to use the partition filter in the data schema.

Let's create a new Cube.js application. Since we are already using the AWS stack, it makes sense to use Lambda for deployment. You can use the express template for generation if you plan to host the Cube.js backend on Heroku or Docker. The documentation describes other hosting methods.

$ npm install -g cubejs-cli
$ cubejs create nginx-log-analytics -t serverless -d athena

Cube.js uses environment variables to configure database access. The generator will create an .env file where you can specify your keys for Athena.

Now we need data schema, in which we specify exactly how our logs are stored. There you can also specify how to calculate metrics for dashboards.

In directory schema, create a file Logs.js. Here is an example data model for nginx:

Model Code

const partitionFilter = (from, to) => `
    date(from_iso8601_timestamp(${from})) <= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d') AND
    date(from_iso8601_timestamp(${to})) >= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d')
    `

cube(`Logs`, {
  sql: `
  select * from part_demo_kinesis_bucket
  WHERE ${FILTER_PARAMS.Logs.createdAt.filter(partitionFilter)}
  `,

  measures: {
    count: {
      type: `count`,
    },

    errorCount: {
      type: `count`,
      filters: [
        { sql: `${CUBE.isError} = 'Yes'` }
      ]
    },

    errorRate: {
      type: `number`,
      sql: `100.0 * ${errorCount} / ${count}`,
      format: `percent`
    }
  },

  dimensions: {
    status: {
      sql: `status`,
      type: `number`
    },

    isError: {
      type: `string`,
      case: {
        when: [{
          sql: `${CUBE}.status >= 400`, label: `Yes`
        }],
        else: { label: `No` }
      }
    },

    createdAt: {
      sql: `from_unixtime(created_at)`,
      type: `time`
    }
  }
});

Here we use the variable FILTER_PARAMSto generate an SQL query with a partition filter.

We also set the metrics and parameters that we want to display on the dashboard and specify the pre-aggregations. Cube.js will create additional tables with pre-aggregated data and will automatically update the data as it comes in. This allows not only to speed up requests, but also to reduce the cost of using Athena.

Let's add this information to the data schema file:

preAggregations: {
  main: {
    type: `rollup`,
    measureReferences: [count, errorCount],
    dimensionReferences: [isError, status],
    timeDimensionReference: createdAt,
    granularity: `day`,
    partitionGranularity: `month`,
    refreshKey: {
      sql: FILTER_PARAMS.Logs.createdAt.filter((from, to) => 
        `select
           CASE WHEN from_iso8601_timestamp(${to}) + interval '3' day > now()
           THEN date_trunc('hour', now()) END`
      )
    }
  }
}

We indicate in this model that it is necessary to pre-aggregate data for all used metrics, and use partitioning by month. Partitioning pre-aggregations can significantly speed up the collection and updating of data.

Now we can assemble the dashboard!

The Cube.js backend provides REST API and a set of client libraries for popular front-end frameworks. We will use the React client to build the dashboard. Cube.js provides only data, so we need a library for visualizations - I like it recharts, but you can use any.

The Cube.js server accepts the request in json format, which specifies the required metrics. For example, to calculate how many errors Nginx gave per day, you need to send the following request:

{
  "measures": ["Logs.errorCount"],
  "timeDimensions": [
    {
      "dimension": "Logs.createdAt",
      "dateRange": ["2019-01-01", "2019-01-07"],
      "granularity": "day"
    }
  ]
}

Install Cube.js client and React component library via NPM:

$ npm i --save @cubejs-client/core @cubejs-client/react

Import components cubejs ΠΈ QueryRendererto upload the data, and collect the dashboard:

Dashboard Code

import React from 'react';
import { LineChart, Line, XAxis, YAxis } from 'recharts';
import cubejs from '@cubejs-client/core';
import { QueryRenderer } from '@cubejs-client/react';

const cubejsApi = cubejs(
  'YOUR-CUBEJS-API-TOKEN',
  { apiUrl: 'http://localhost:4000/cubejs-api/v1' },
);

export default () => {
  return (
    <QueryRenderer
      query={{
        measures: ['Logs.errorCount'],
        timeDimensions: [{
            dimension: 'Logs.createdAt',
            dateRange: ['2019-01-01', '2019-01-07'],
            granularity: 'day'
        }]
      }}
      cubejsApi={cubejsApi}
      render={({ resultSet }) => {
        if (!resultSet) {
          return 'Loading...';
        }

        return (
          <LineChart data={resultSet.rawData()}>
            <XAxis dataKey="Logs.createdAt"/>
            <YAxis/>
            <Line type="monotone" dataKey="Logs.errorCount" stroke="#8884d8"/>
          </LineChart>
        );
      }}
    />
  )
}

Dashboard sources are available at code sandbox.

Source: habr.com

Add a comment