Structuring Unstructured Data with GROK
If you are using the Elastic Stack (ELK) and are interested in mapping custom Logstash logs to Elasticsearch, then this post is for you.
The ELK stack is an acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Together they form a log management platform.
- Elasticsearch is a search and analytical system.
- logstash is a server-side data processing pipeline that takes data from multiple sources at the same time, transforms it, and then sends it to a "stash" such as Elasticsearch.
- kibana allows users to visualize data using charts and graphs in Elasticsearch.
Beats appeared later and is an easy data shipper. The introduction of Beats transformed the Elk Stack into an Elastic Stack, but that's not the point.
This article is about Grok, which is a feature in Logstash that can transform your logs before they are sent to the stash. For our purposes, I will only talk about processing data from Logstash to Elasticsearch.
Grok is a filter within Logstash that is used to parse unstructured data into something structured and queryable. It sits on top of a regular expression (regex) and uses text patterns to match strings in log files.
As we'll see in the following sections, using Grok goes a long way when it comes to efficient log management.
Without Grok, your log data is unstructured
Without Grok, when logs are sent from Logstash to Elasticsearch and rendered in Kibana, they only appear in the message value.
Querying meaningful information in this situation is difficult because all log data is stored in the same key. It would be better if the log messages were better organized.
Unstructured data from logs
localhost GET /v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0
If you take a closer look at the raw data, you will see that it actually consists of different parts, each separated by a space.
For more experienced developers, you can probably guess what each of the parts means and what is the log message from the API call. The presentation of each item is set out below.
Structured view of our data
- β localhost == environment
- β GET == method
- β /v2/applink/5c2f4bb3e9fda1234edc64d == url
- 400 == response_status
- β 46ms == response_time
- β 5bc6e716b5d6cb35fc9687c0 == user_id
As we see in structured data, there is an order for unstructured logs. The next step is to programmatically process the raw data. That's where Grock shines.
Grok Templates
Built-in Grok Templates
Logstash comes with over 100 built-in templates for structuring unstructured data. You should definitely take advantage of this when possible for general syslogs like apache, linux, haproxy, aws and so on.
However, what happens when you have custom logs like in the example above? You must build your own Grok template.
Grok custom templates
Must try to build your own Grok template. I used
Note that the syntax for Grok templates is as follows: %{SYNTAX:SEMANTIC}
The first thing I tried to do was go to the tab Discover in the Grok debugger. I thought it would be great if this tool could automatically generate the Grok pattern, but it wasn't very helpful as it only found two matches.
Using this discovery, I started building my own template on the Grok debugger using the syntax found on the Elastic Github page.
After playing around with different syntaxes, I was finally able to structure the log data the way I wanted.
Link to the Grok debugger
Original text:
localhost GET /v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0
pattern:
%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}
What happened in the end
{
"environment": [
[
"localhost"
]
],
"method": [
[
"GET"
]
],
"url": [
[
"/v2/applink/5c2f4bb3e9fda1234edc64d"
]
],
"response_status": [
[
"400"
]
],
"BASE10NUM": [
[
"400"
]
],
"response_time": [
[
"46ms"
]
],
"user_id": [
[
"5bc6e716b5d6cb35fc9687c0"
]
]
}
With the Grok template and mapped data in hand, the final step is to add it to Logstash.
Update the Logstash.conf configuration file
On the server where you installed the ELK stack, go to the Logstash configuration:
sudo vi /etc/logstash/conf.d/logstash.conf
Paste your changes.
input {
file {
path => "/your_logs/*.log"
}
}
filter{
grok {
match => { "message" => "%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}"}
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
}
}
After saving the changes, restart Logstash and check its status to make sure it's still running.
sudo service logstash restart
sudo service logstash status
Finally, to make sure the changes have taken effect, be sure to update the Elasticsearch index for Logstash in Kibana!
With Grok, your log data is structured!
As we can see in the image above, Grok is able to automatically map log data to Elasticsearch. This makes it easier to manage logs and quickly query information. Instead of rummaging through log files to debug, you can simply filter out what you're looking for, such as an environment or a url.
Try giving Grok expressions a try! If you have another way to do this, or have any issues with the examples above, just drop a comment below to let me know.
Thanks for reading - and please follow me here on Medium for more interesting software engineering articles!
Resources
P.S
Telegram channel by
Source: habr.com