The Webalizer utility and the Google Analytics tool have helped me gain insight into what's happening on websites for many years. Now I understand that they give very little useful information. Having access to your access.log file, it is very easy to deal with statistics and to implement enough elementary tools, such as sqlite, html, the sql language, and any scripting programming language.
The data source for Webalizer is the server's access.log file. This is how its columns and numbers look, from which only the total amount of traffic is clear:
Tools such as Google Analytics collect data from the loaded page on their own. They show us a couple of diagrams and lines, on the basis of which it is often difficult to draw the right conclusions. Maybe more effort should have been made? Don't know.
So, what would I like to see in the statistics of site visits?
User and bot traffic
Often site traffic is limited and you need to see how much useful traffic is being used. For example, like this:
SQL query report
SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
The graph shows the constant activity of bots. It would be interesting to study in detail the most active representatives.
Annoying bots
We classify bots based on user agent information. Additional statistics about daily traffic, the number of successful and unsuccessful requests give a good idea of ββthe activity of the bots.
SQL query report
SELECT
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT != 'n.a.'
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10
In this case, the result of the analysis was the decision to restrict access to the site by adding to the robots.txt file
User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5
The first two bots disappeared from the table, and the MS robots moved down from the first lines.
Day and time of the most activity
There are ups and downs in traffic. In order to study them in detail, it is necessary to highlight the time of their occurrence, and it is not necessary to display all hours and days of time measurement. This will make it easier to find individual requests in the log file if you need a detailed analysis.
SQL query report
SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
SELECT
EVENT_DT,
SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
FROM FCT_ACCESS_REQUEST_REF_HH
WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY EVENT_DT
ORDER BY SUM(LINE_CNT) DESC
LIMIT 10
) ORDER BY EVENT_DT
We observe the most active hours of 11, 14 and 20 of the first day on the chart. But the next day at 13 o'clock the bots were active.
Average daily user activity by week
We sorted out a little with activity and traffic. The next question was the activity of the users themselves. For such statistics, large aggregation periods, such as a week, are desirable.
SQL query report
SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT='n.a.' /* users only */
AND HST.STATUS_GROUP IN ('Successful') /* good pages */
AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Weekly statistics show that on average one user opens 1,6 pages per day. The number of requested files per user in this case depends on the addition of new files to the site.
All requests and their statuses
Webalizer always showed specific page codes and I always wanted to see just the number of successful requests and errors.
SQL query report
SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
The report displays requests, not clicks (hits), unlike LINE_CNT, the REQUEST_CNT metric is counted as COUNT(DISTINCT STG.REQUEST_NK). The goal is to show effective events, for example, MS bots poll the robots.txt file hundreds of times a day and, in this case, such polls will be counted once. This allows you to smooth out jumps on the chart.
You can see a lot of errors from the graph - these are non-existent pages. The result of the analysis was the addition of redirects from remote pages.
Failed Requests
For detailed consideration of requests, detailed statistics can be displayed.
SQL query report
SELECT
1 AS 'Table: Top Error Requests',
REQ.REQUEST_NK AS 'Request',
'Error' AS 'Request Status',
ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
FCT_ACCESS_REQUEST_REF_HH FCT,
DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20
This list will also contain all calls, for example, a request to /wp-login.php By adjusting the rules for rewriting requests by the server, you can adjust the server's reaction to such requests and send them to the start page.
So, a few simple reports based on the server log file give a fairly complete picture of what is happening on the site.
How to get information?
The sqlite database is enough. Let's create tables: auxiliary for logging ETL processes.
Stage table where we will write log files using PHP. Two aggregate tables. Let's create a daily table with statistics on user agents and request statuses. Hourly with statistics on requests, groups of statuses and agents. Four tables of corresponding measurements.
The result is the following relational model:
Data model
Script to create object in sqlite database:
DDL object creation
DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
USER_AGENT_NK TEXT NOT NULL DEFAULT 'n.a.',
AGENT_OS TEXT NOT NULL DEFAULT 'n.a.',
AGENT_ENGINE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_DEVICE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_BOT TEXT NOT NULL DEFAULT 'n.a.',
UPDATE_DT INTEGER NOT NULL DEFAULT 0,
UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);
Stage
In the case of the access.log file, all requests must be read, parsed and written to the database. This can be done either directly using the scripting language, or using sqlite.
Log file format:
//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';
Propagation of keys
When the raw data is in the database, you need to write to the dimension tables the keys that are not there. Then it will be possible to build a link to the measurements. For example, in the DIM_REFERRER table, the key is a combination of three fields.
SQL query propagating keys
/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
CLS.HOST_NK,
CLS.PATH_NK,
CLS.QUERY_NK,
STRFTIME('%s','now') AS UPDATE_DT
FROM (
SELECT DISTINCT
REFERRER_HOST AS HOST_NK,
REFERRER_PATH AS PATH_NK,
CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* ΠΎΡΡΠ΅Π·Π°Π΅ΠΌ sid - ΡΠΏΠ΅ΡΠΈΡΠΈΠΊΠ° ΡΠΌΡ */
ELSE REFERRER_QUERY END AS QUERY_NK
FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL
Propagation to the user agent table can contain bot logic, such as the sql snippet:
CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
OR INSTR(LOWER(CLS.BROWSER),'wget')>0
THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT
Aggregate tables
Lastly, we will load the tables of aggregates, for example, the daily table can be loaded as follows:
SQL query load aggregate
/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
CASE SUBSTR(TIME_NK,5,3)
WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
BROWSER AS USER_AGENT_NK,
REQUEST_NK,
IP_NR,
STATUS,
LINE_NK,
BYTES
FROM STG_ACCESS_LOG
)
SELECT
CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
USG.DIM_USER_AGENT_ID,
HST.DIM_HTTP_STATUS_ID,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
SUM(BYTES) AS BYTES
FROM STG,
DIM_HTTP_STATUS HST,
DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID
The sqlite database allows you to write complex queries. WITH contains the preparation of data and keys. The main query collects all dimension references.
The condition will prevent the history from being loaded again: CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from, where the parameter is the result of the request
'SELECT COALESCE(MAX(EVENT_DT), '3600') AS LAST_EVENT_EPOCH FROM FCT_ACCESS_USER_AGENT_DD'
The condition will only load the full day: CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
Counting pages or files is carried out in a primitive way, by searching for a point.
Reports
In complex visualization systems, it is possible to create a meta-model based on database objects, dynamically manage filters and aggregation rules. Ultimately, all decent tools generate a SQL query.
In this example, we will create ready-made SQL queries and save them as a view in the database - these are reports.
Visualization
Bluff was used as a visualization tool: Beautiful graphs in JavaScript
To do this, it was necessary to run through all the reports using PHP and generate an html file with tables.
$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);
The tool simply renders result tables.
Hack and predictor Aviator
On the example of web analysis, the article describes the mechanisms necessary for building data warehouses. As can be seen from the results, the simplest tools are sufficient for deep analysis and data visualization.
In the future, using this repository as an example, we will try to implement structures such as slowly changing dimensions, metadata, aggregation levels, and data integration from different sources.
Also, let's take a closer look at the simplest tool for managing ETL processes based on a single table.
Let's return to the topic of measuring data quality and automating this process.
Let's study the problems of the technical environment and maintenance of data storages, for which we implement a storage server with minimal resources, for example, based on Raspberry Pi.
Source: habr.com