Usually, commercial products or ready-made open-source alternatives, such as Prometheus + Grafana, are used to monitor and analyze the operation of Nginx. This is a good option for monitoring or real-time analytics, but not very convenient for historical analysis. On any popular resource, the volume of data from the nginx logs is growing rapidly, and it is logical to use something more specialized to analyze a large amount of data.
In this article I will show you how you can use
TL:DR;
To collect information, we use
Collecting Nginx logs
By default, Nginx logs look something like this:
4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-up HTTP/2.0" 200 9168 "https://example.com/sign-in" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"
4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-in HTTP/2.0" 200 9168 "https://example.com/sign-up" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"
They can be parsed, but it is much easier to fix the Nginx configuration so that it outputs logs in JSON:
log_format json_combined escape=json '{ "created_at": "$msec", '
'"remote_addr": "$remote_addr", '
'"remote_user": "$remote_user", '
'"request": "$request", '
'"status": $status, '
'"bytes_sent": $bytes_sent, '
'"request_length": $request_length, '
'"request_time": $request_time, '
'"http_referrer": "$http_referer", '
'"http_x_forwarded_for": "$http_x_forwarded_for", '
'"http_user_agent": "$http_user_agent" }';
access_log /var/log/nginx/access.log json_combined;
S3 for storage
To store logs, we will use S3. This allows you to store and analyze logs in one place, since Athena can work with data in S3 directly. Later in the article I will tell you how to properly fold and process logs, but first we need a clean bucket in S3, in which nothing else will be stored. It is worth considering in advance in which region you will create a bucket, because Athena is not available in all regions.
Creating a schema in the Athena console
Let's create a table in Athena for logs. It is required for both writing and reading if you plan to use Kinesis Firehose. Open the Athena console and create a table:
SQL create table
CREATE EXTERNAL TABLE `kinesis_logs_nginx`(
`created_at` double,
`remote_addr` string,
`remote_user` string,
`request` string,
`status` int,
`bytes_sent` int,
`request_length` int,
`request_time` double,
`http_referrer` string,
`http_x_forwarded_for` string,
`http_user_agent` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://<YOUR-S3-BUCKET>'
TBLPROPERTIES ('has_encrypted_data'='false');
Creating a Kinesis Firehose Stream
Kinesis Firehose will write the data received from Nginx to S3 in the selected format, divided into directories in the YYYY/MM/DD/DD format. This will come in handy when reading data. You can, of course, write directly to S3 from fluentd, but in this case you will have to write JSON, and this is inefficient due to the large size of the files. Also, when using PrestoDB or Athena, JSON is the slowest data format. So we open the Kinesis Firehose console, click "Create delivery stream", select "direct PUT" in the "delivery" field:
In the next tab, select "Record format conversion" - "Enabled" and select "Apache ORC" as the recording format. According to some research
Select S3 for storage and the bucket we created earlier. Aws Glue Crawler, which I will talk about a little later, does not work with prefixes in the S3 bucket, so it is important to leave it empty.
The remaining options can be changed depending on your load, I usually use the default ones. Note that S3 compression is not available, but ORC uses native compression by default.
fluentd
Now that we have configured the storage and receipt of logs, we need to configure sending. We will use
First, we need the fluent.conf configuration file. Create it and add source:
port 24224
bind 0.0.0.0
Now you can start the Fluentd server. If you need more advanced configuration,
$ docker run
-d
-p 24224:24224
-p 24224:24224/udp
-v /data:/fluentd/log
-v <PATH-TO-FLUENT-CONF>:/fluentd/etc fluentd
-c /fluentd/etc/fluent.conf
fluent/fluentd:stable
This configuration uses the path /fluentd/log
to cache logs before sending. You can do without it, but then when you restart, you can lose everything cached by overwork. Any port can also be used, 24224 is the default Fluentd port.
Now that we have Fluentd running, we can send Nginx logs there. We usually run Nginx in a Docker container, in which case Docker has a native logging driver for Fluentd:
$ docker run
--log-driver=fluentd
--log-opt fluentd-address=<FLUENTD-SERVER-ADDRESS>
--log-opt tag="{{.Name}}"
-v /some/content:/usr/share/nginx/html:ro
-d
nginx
If you run Nginx otherwise, you can use the log files, Fluentd has
Let's add the log parsing configured above to the Fluent configuration:
<filter YOUR-NGINX-TAG.*>
@type parser
key_name log
emit_invalid_record_to_error false
<parse>
@type json
</parse>
</filter>
And sending logs to Kinesis using
<match YOUR-NGINX-TAG.*>
@type kinesis_firehose
region region
delivery_stream_name <YOUR-KINESIS-STREAM-NAME>
aws_key_id <YOUR-AWS-KEY-ID>
aws_sec_key <YOUR_AWS-SEC_KEY>
</match>
Athena
If you configured everything correctly, then after a while (by default, Kinesis records the received data every 10 minutes), you should see the log files in S3. In the "monitoring" menu of the Kinesis Firehose, you can see how much data is written to S3, as well as errors. Don't forget to give write access to the S3 bucket for the Kinesis role. If Kinesis could not parse something, it will add the errors in the same bucket.
Now you can view the data in Athena. Let's find the latest requests for which we returned errors:
SELECT * FROM "db_name"."table_name" WHERE status > 499 ORDER BY created_at DESC limit 10;
Scanning all records per request
Now our logs are processed and stacked in S3 in ORC, compressed and ready for analysis. Kinesis Firehose even organized them into directories for every hour. However, as long as the table is not partitioned, Athena will load data for the entire time per request, with rare exceptions. This is a big problem for two reasons:
- The volume of data is constantly growing, slowing down requests;
- Athena is billed based on the amount of data scanned, with a minimum of 10MB per request.
To fix this, we use AWS Glue Crawler which will crawl the data in S3 and write the partition information to the Glue Metastore. This will allow us to use partitions as a filter when making requests to Athena, and it will only scan the directories specified in the request.
Setting up Amazon Glue Crawler
Amazon Glue Crawler crawls all the data in the S3 bucket and creates tables with partitions. Create a Glue Crawler from the AWS Glue Console and add a bucket where you store data. You can use one crawler for multiple buckets, in which case it will create tables in the specified database with names that match the names of the buckets. If you plan to use this data all the time, don't forget to adjust the Crawler launch schedule to suit your needs. We use one Crawler for all tables, which runs every hour.
Partitioned tables
After the first launch of the crawler, tables for each scanned bucket should appear in the database specified in the settings. Open the Athena console and find the Nginx log table. Let's try to read something:
SELECT * FROM "default"."part_demo_kinesis_bucket"
WHERE(
partition_0 = '2019' AND
partition_1 = '04' AND
partition_2 = '08' AND
partition_3 = '06'
);
This query will select all records received between 6 am and 7 am on April 8, 2019. But how much more efficient is this than just reading from a non-partitioned table? Let's find out and select the same records by filtering them by timestamp:
3.59 seconds and 244.34 megabytes of data on a dataset with only a week of logs. Let's try a partition filter:
A little faster, but most importantly - only 1.23 megabytes of data! It would be much cheaper if it were not for the minimum 10 megabytes per request in the pricing. But itβs still much better, and on large datasets the difference will be much more impressive.
Building a dashboard with Cube.js
To assemble the dashboard, we use the Cube.js analytics framework. It has quite a lot of functions, but we are interested in two: the ability to automatically use filters on partitions and pre-aggregation of data. It uses data schema
Let's create a new Cube.js application. Since we are already using the AWS stack, it makes sense to use Lambda for deployment. You can use the express template for generation if you plan to host the Cube.js backend on Heroku or Docker. The documentation describes other
$ npm install -g cubejs-cli
$ cubejs create nginx-log-analytics -t serverless -d athena
Cube.js uses environment variables to configure database access. The generator will create an .env file where you can specify your keys for
Now we need
In directory schema
, create a file Logs.js
. Here is an example data model for nginx:
Model Code
const partitionFilter = (from, to) => `
date(from_iso8601_timestamp(${from})) <= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d') AND
date(from_iso8601_timestamp(${to})) >= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d')
`
cube(`Logs`, {
sql: `
select * from part_demo_kinesis_bucket
WHERE ${FILTER_PARAMS.Logs.createdAt.filter(partitionFilter)}
`,
measures: {
count: {
type: `count`,
},
errorCount: {
type: `count`,
filters: [
{ sql: `${CUBE.isError} = 'Yes'` }
]
},
errorRate: {
type: `number`,
sql: `100.0 * ${errorCount} / ${count}`,
format: `percent`
}
},
dimensions: {
status: {
sql: `status`,
type: `number`
},
isError: {
type: `string`,
case: {
when: [{
sql: `${CUBE}.status >= 400`, label: `Yes`
}],
else: { label: `No` }
}
},
createdAt: {
sql: `from_unixtime(created_at)`,
type: `time`
}
}
});
Here we use the variable
We also set the metrics and parameters that we want to display on the dashboard and specify the pre-aggregations. Cube.js will create additional tables with pre-aggregated data and will automatically update the data as it comes in. This allows not only to speed up requests, but also to reduce the cost of using Athena.
Let's add this information to the data schema file:
preAggregations: {
main: {
type: `rollup`,
measureReferences: [count, errorCount],
dimensionReferences: [isError, status],
timeDimensionReference: createdAt,
granularity: `day`,
partitionGranularity: `month`,
refreshKey: {
sql: FILTER_PARAMS.Logs.createdAt.filter((from, to) =>
`select
CASE WHEN from_iso8601_timestamp(${to}) + interval '3' day > now()
THEN date_trunc('hour', now()) END`
)
}
}
}
We indicate in this model that it is necessary to pre-aggregate data for all used metrics, and use partitioning by month.
Now we can assemble the dashboard!
The Cube.js backend provides
The Cube.js server accepts the request in
{
"measures": ["Logs.errorCount"],
"timeDimensions": [
{
"dimension": "Logs.createdAt",
"dateRange": ["2019-01-01", "2019-01-07"],
"granularity": "day"
}
]
}
Install Cube.js client and React component library via NPM:
$ npm i --save @cubejs-client/core @cubejs-client/react
Import components cubejs
ΠΈ QueryRenderer
to upload the data, and collect the dashboard:
Dashboard Code
import React from 'react';
import { LineChart, Line, XAxis, YAxis } from 'recharts';
import cubejs from '@cubejs-client/core';
import { QueryRenderer } from '@cubejs-client/react';
const cubejsApi = cubejs(
'YOUR-CUBEJS-API-TOKEN',
{ apiUrl: 'http://localhost:4000/cubejs-api/v1' },
);
export default () => {
return (
<QueryRenderer
query={{
measures: ['Logs.errorCount'],
timeDimensions: [{
dimension: 'Logs.createdAt',
dateRange: ['2019-01-01', '2019-01-07'],
granularity: 'day'
}]
}}
cubejsApi={cubejsApi}
render={({ resultSet }) => {
if (!resultSet) {
return 'Loading...';
}
return (
<LineChart data={resultSet.rawData()}>
<XAxis dataKey="Logs.createdAt"/>
<YAxis/>
<Line type="monotone" dataKey="Logs.errorCount" stroke="#8884d8"/>
</LineChart>
);
}}
/>
)
}
Dashboard sources are available at
Source: habr.com