Being written in the Rust language, it is characterized by high performance and low RAM consumption compared to its counterparts. In addition, much attention is paid to functions related to correctness, in particular, the ability to save unsent events to a buffer on disk and file rotation.
Architecturally, Vector is an event router that accepts messages from one or more sources of, optionally applying over these messages transformations, and sending them to one or more drains.
Vector is a replacement for filebeat and logstash, it can act in both roles (receive and send logs), more details on them Online.
If in Logstash the chain is built as input → filter → output, then in Vector it is sources → transformed → sinks
Examples can be found in the documentation.
This instruction is a revised instruction from Vyacheslav Rakhinsky. The original instructions have geoip processing. When testing geoip from the internal network, vector gave me an error.
Aug 05 06:25:31.889 DEBUG transform{name=nginx_parse_rename_fields type=rename_fields}: vector::transforms::rename_fields: Field did not exist field=«geoip.country_name» rate_limit_secs=30
If someone needs to process geoip, then refer to the original instructions from Vyacheslav Rakhinsky.
We will configure Nginx (Access logs) → Vector (Client | Filebeat) → Vector (Server | Logstash) → separately in Clickhouse and separately in Elasticsearch. Set up 4 servers. Though it is possible to bypass 3 servers.
The scheme is something like this.
Turn off Selinux on all your servers
sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
reboot
Install the HTTP server emulator + utilities on all servers
ClickHouse uses the SSE 4.2 instruction set, therefore, unless otherwise noted, its support in the processor used becomes an additional system requirement. Here is the command to check if the current processor supports SSE 4.2:
Setting up Elasticsearch for single-node mode 1 shard, 0 replica. Most likely you will have a cluster of a large number of servers and you do not need to do this.
INFO vector::topology::builder: Healthcheck: Passed.
INFO vector::topology::builder: Healthcheck: Passed.
On the client (Web server) - 1st server
On the server with nginx, you need to disable ipv6, since the logs table in clickhouse uses the field upstream_addr IPv4 since I don't use ipv6 internally. If ipv6 is not disabled, then there will be errors:
DB::Exception: Invalid IPv4 value.: (while read the value of key upstream_addr)
First we need to configure the Nginx log format in the /etc/nginx/nginx.conf file
user nginx;
# you must set worker processes based on your CPU cores, nginx does not benefit from setting more than that
worker_processes auto; #some last versions calculate it automatically
# number of file descriptors used for nginx
# the limit for the maximum FDs on the server is usually set by the OS.
# if you don't set FD's then OS settings will be used which is by default 2000
worker_rlimit_nofile 100000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
# provides the configuration file context in which the directives that affect connection processing are specified.
events {
# determines how much clients will be served per worker
# max clients = worker_connections * worker_processes
# max clients is also limited by the number of socket connections available on the system (~64k)
worker_connections 4000;
# optimized to serve many clients with each thread, essential for linux -- for testing environment
use epoll;
# accept as many connections as possible, may flood worker connections if set too low -- for testing environment
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
log_format vector escape=json
'{'
'"node_name":"nginx-vector",'
'"timestamp":"$time_iso8601",'
'"server_name":"$server_name",'
'"request_full": "$request",'
'"request_user_agent":"$http_user_agent",'
'"request_http_host":"$http_host",'
'"request_uri":"$request_uri",'
'"request_scheme": "$scheme",'
'"request_method":"$request_method",'
'"request_length":"$request_length",'
'"request_time": "$request_time",'
'"request_referrer":"$http_referer",'
'"response_status": "$status",'
'"response_body_bytes_sent":"$body_bytes_sent",'
'"response_content_type":"$sent_http_content_type",'
'"remote_addr": "$remote_addr",'
'"remote_port": "$remote_port",'
'"remote_user": "$remote_user",'
'"upstream_addr": "$upstream_addr",'
'"upstream_bytes_received": "$upstream_bytes_received",'
'"upstream_bytes_sent": "$upstream_bytes_sent",'
'"upstream_cache_status":"$upstream_cache_status",'
'"upstream_connect_time":"$upstream_connect_time",'
'"upstream_header_time":"$upstream_header_time",'
'"upstream_response_length":"$upstream_response_length",'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_status": "$upstream_status",'
'"upstream_content_type":"$upstream_http_content_type"'
'}';
access_log /var/log/nginx/access.log main;
access_log /var/log/nginx/access.json.log vector; # Новый лог в формате json
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
In order not to break your current configuration, Nginx allows you to have several access_log directives
access_log /var/log/nginx/access.log main; # Стандартный лог
access_log /var/log/nginx/access.json.log vector; # Новый лог в формате json
Don't forget to add a rule to logrotate for new logs (if the log file doesn't end with .log)
And configure the Filebeat replacement in the /etc/vector/vector.toml config. IP address 172.26.10.108 is the IP address of the log server (Vector-Server)
data_dir = "/var/lib/vector"
[sources.nginx_file]
type = "file"
include = [ "/var/log/nginx/access.json.log" ]
start_at_beginning = false
fingerprinting.strategy = "device_and_inode"
[sinks.nginx_output_vector]
type = "vector"
inputs = [ "nginx_file" ]
address = "172.26.10.108:9876"
Don't forget to add user vector to the right group so that he can read log files. For example, nginx on centos creates logs with the rights of the adm group.
usermod -a -G adm vector
Let's start the vector service
systemctl enable vector
systemctl start vector
Vector logs can be viewed like this
journalctl -f -u vector
The logs should have something like this
INFO vector::topology::builder: Healthcheck: Passed.
Stress Testing
Testing is carried out using Apache benchmark.
The httpd-tools package was installed on all servers
We start testing using Apache benchmark from 4 different servers in screen. First, we launch the screen terminal multiplexer, and then we start testing with the Apache benchmark. How to work with screen you can find in article.
From the 1st server
while true; do ab -H "User-Agent: 1server" -c 100 -n 10 -t 10 http://vhost1/; sleep 1; done
From the 2st server
while true; do ab -H "User-Agent: 2server" -c 100 -n 10 -t 10 http://vhost2/; sleep 1; done
From the 3st server
while true; do ab -H "User-Agent: 3server" -c 100 -n 10 -t 10 http://vhost3/; sleep 1; done
From the 4st server
while true; do ab -H "User-Agent: 4server" -c 100 -n 10 -t 10 http://vhost4/; sleep 1; done
select concat(database, '.', table) as table,
formatReadableSize(sum(bytes)) as size,
sum(rows) as rows,
max(modification_time) as latest_modification,
sum(bytes) as bytes_size,
any(engine) as engine,
formatReadableSize(sum(primary_key_bytes_in_memory)) as primary_keys_size
from system.parts
where active
group by database, table
order by bytes_size desc;
Let's find out how many logs took in Clickhouse.
The size of the logs table is 857.19 MB.
The size of the same data in the index in Elasticsearch is 4,5 GB.
If the vector parameters are not specified in Clickhouse, the data takes 4500/857.19 = 5.24 times less than in Elasticsearch.
In vector, the compression field is used by default.