Webalizer ak Google Analytics te ede m 'jwenn insight sou sa k ap pase sou sit entènèt pou anpil ane. Koulye a, mwen konprann ke yo bay anpil ti enfòmasyon itil. Lè w gen aksè a fichye access.log ou a, li trè fasil pou w konprann estatistik yo epi pou w aplike zouti debaz yo, tankou sqlite, html, langaj sql ak nenpòt langaj pwogramasyon script.
Sous done pou Webalizer se fichye access.log sèvè a. Men ki jan ba li yo ak nimewo yo sanble, ki soti nan ki sèlman volim total trafik la klè:
Zouti tankou Google Analytics kolekte done ki soti nan paj la chaje tèt yo. Yo montre nou yon koup nan dyagram ak liy, ki baze sou ki li souvan difisil pou tire konklizyon kòrèk. Petèt yo ta dwe fè plis efò? pa konnen.
Se konsa, ki sa mwen te vle wè nan estatistik yo ki vizite sit entènèt?
Trafik itilizatè ak bot
Souvan trafik sit limite Et li nesesè pou wè ki kantite itil trafik ap itilize. Pou egzanp, tankou sa a:
Rapò rechèch SQL
SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Grafik la montre aktivite konstan bots yo. Li ta enteresan yo etidye an detay reprezantan ki pi aktif.
Bots anmèdan
Nou klase bots ki baze sou enfòmasyon ajan itilizatè yo. Estatistik adisyonèl sou trafik chak jou, kantite demann siksè ak san siksè bay yon bon lide sou aktivite bot.
Rapò rechèch SQL
SELECT
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT != 'n.a.'
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10
Nan ka sa a, rezilta analiz la se te desizyon an mete restriksyon sou aksè nan sit la lè w ajoute li nan fichye robots.txt la.
User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5
De premye robo yo te disparèt sou tab la, ak robo MS yo te deplase desann soti nan premye liy yo.
Jou ak lè nan pi gwo aktivite
Upswings yo vizib nan trafik la. Pou etidye yo an detay, li nesesè mete aksan sou tan an nan ensidan yo, epi li pa nesesè yo montre tout èdtan yo ak jou nan mezi tan. Sa ap rann li pi fasil pou jwenn demann endividyèl yo nan dosye log la si yo bezwen analiz detaye.
Rapò rechèch SQL
SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
SELECT
EVENT_DT,
SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
FROM FCT_ACCESS_REQUEST_REF_HH
WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY EVENT_DT
ORDER BY SUM(LINE_CNT) DESC
LIMIT 10
) ORDER BY EVENT_DT
Nou obsève èdtan ki pi aktif 11, 14 ak 20 nan premye jou a sou tablo a. Men, jou kap vini an nan 13:XNUMX bots yo te aktif.
Mwayèn aktivite itilizatè chak jou pa semèn
Nou regle bagay yo yon ti jan ak aktivite ak trafik. Kesyon kap vini an se te aktivite itilizatè yo tèt yo. Pou estatistik sa yo, peryòd agrégasyon long, tankou yon semèn, se dezirab.
Rapò rechèch SQL
SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT='n.a.' /* users only */
AND HST.STATUS_GROUP IN ('Successful') /* good pages */
AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Estatistik chak semèn yo montre ke an mwayèn yon itilizatè louvri 1,6 paj pa jou. Kantite dosye yo mande pou chak itilizatè nan ka sa a depann de adisyon nouvo dosye sou sit la.
Tout demann ak estati yo
Webalizer te toujou montre kòd paj espesifik e mwen te toujou vle wè jis kantite demann siksè ak erè.
Rapò rechèch SQL
SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Rapò a montre demann, pa klik (frape), kontrèman ak LINE_CNT, mezi REQUEST_CNT la kalkile kòm COUNT(DISTINCT STG.REQUEST_NK). Objektif la se montre evènman efikas, pou egzanp, MS bots sondaj dosye robots.txt la dè santèn de fwa pa jou epi, nan ka sa a, biwo vòt sa yo pral konte yon fwa. Sa a pèmèt ou lis soti so nan graf la.
Soti nan graf la ou ka wè anpil erè - sa yo se paj ki pa egziste. Rezilta analiz la te ajoute redireksyon ki soti nan paj aleka.
Move demann
Pou egzamine demann an detay, ou ka montre estatistik detaye.
Rapò rechèch SQL
SELECT
1 AS 'Table: Top Error Requests',
REQ.REQUEST_NK AS 'Request',
'Error' AS 'Request Status',
ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
FCT_ACCESS_REQUEST_REF_HH FCT,
DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20
Lis sa a pral genyen tou tout apèl, pou egzanp, yon demann nan /wp-login.php Lè w ajiste règ yo pou reekri demann pa sèvè a, ou ka ajiste reyaksyon sèvè a nan demann sa yo epi voye yo nan paj la kòmanse.
Kidonk, kèk rapò senp ki baze sou dosye sèvè a bay yon foto konplè sou sa k ap pase sou sit la.
Ki jan yo jwenn enfòmasyon?
Yon baz done sqlite ase. Ann kreye tab: oksilyè pou antre pwosesis ETL.
Etap tab kote nou pral ekri dosye boutèy demi lit lè l sèvi avèk PHP. De tab total. Ann kreye yon tablo chak jou ak estatistik sou ajan itilizatè yo ak estati demann yo. Chak èdtan ak estatistik sou demann, gwoup estati ak ajan yo. Kat tab mezi ki enpòtan.
Rezilta a se modèl relasyon sa a:
Modèl done
Script pou kreye yon objè nan yon baz done sqlite:
Kreyasyon objè DDL
DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
USER_AGENT_NK TEXT NOT NULL DEFAULT 'n.a.',
AGENT_OS TEXT NOT NULL DEFAULT 'n.a.',
AGENT_ENGINE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_DEVICE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_BOT TEXT NOT NULL DEFAULT 'n.a.',
UPDATE_DT INTEGER NOT NULL DEFAULT 0,
UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);
Etap
Nan ka dosye access.log la, li nesesè pou li, analize ak ekri tout demann nan baz done a. Sa a ka fè swa dirèkteman lè l sèvi avèk yon lang scripting oswa lè l sèvi avèk zouti sqlite.
Fòma dosye log:
//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';
Pwopagasyon kle
Lè done kri yo nan baz done a, ou bezwen ekri kle ki pa la nan tab mezi yo. Lè sa a, li pral posib yo bati yon referans a mezi yo. Pa egzanp, nan tablo DIM_REFERRER, kle a se yon konbinezon twa jaden.
Rekèt pwopagasyon kle SQL
/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
CLS.HOST_NK,
CLS.PATH_NK,
CLS.QUERY_NK,
STRFTIME('%s','now') AS UPDATE_DT
FROM (
SELECT DISTINCT
REFERRER_HOST AS HOST_NK,
REFERRER_PATH AS PATH_NK,
CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* отрезаем sid - специфика цмс */
ELSE REFERRER_QUERY END AS QUERY_NK
FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL
Pwopagasyon nan tab ajan itilizatè a ka genyen lojik bot, pou egzanp snippet sql la:
CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
OR INSTR(LOWER(CLS.BROWSER),'wget')>0
THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT
Tablo total
Anfen, nou pral chaje tab total yo; pou egzanp, tab la chak jou ka chaje jan sa a:
Rekèt SQL pou chaje total
/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
CASE SUBSTR(TIME_NK,5,3)
WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
BROWSER AS USER_AGENT_NK,
REQUEST_NK,
IP_NR,
STATUS,
LINE_NK,
BYTES
FROM STG_ACCESS_LOG
)
SELECT
CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
USG.DIM_USER_AGENT_ID,
HST.DIM_HTTP_STATUS_ID,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
SUM(BYTES) AS BYTES
FROM STG,
DIM_HTTP_STATUS HST,
DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID
Baz done sqlite pèmèt ou ekri demann konplèks. WITH gen preparasyon done ak kle. Rekèt prensipal la kolekte tout referans sou dimansyon.
Kondisyon an p ap pèmèt chaje istwa a ankò: CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from, kote paramèt la se rezilta demann lan.
'SELECT COALESCE(MAX(EVENT_DT), '3600') AS LAST_EVENT_EPOCH FROM FCT_ACCESS_USER_AGENT_DD'
Kondisyon an ap chaje sèlman tout jounen an: CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('kounye a, 'kòmanse jounen'))
Konte paj oswa fichye fèt nan yon fason primitif, lè w chèche yon pwen.
Rapò
Nan sistèm vizyalizasyon konplèks, li posib yo kreye yon meta-modèl ki baze sou objè baz done, dinamik jere filtè ak règleman agrégation. Finalman, tout zouti desan jenere yon rechèch SQL.
Nan egzanp sa a, nou pral kreye demann SQL pare yo epi sove yo kòm opinyon nan baz done a - sa yo se rapò.
Vizyalizasyon
Bluff: Yo te itilize bèl graf nan JavaScript kòm yon zouti vizyalizasyon
Pou fè sa, li te nesesè yo ale nan tout rapò yo lè l sèvi avèk PHP ak jenere yon dosye html ak tab.
$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);
Zouti a tou senpleman vizyalize tablo rezilta yo.
Sòti
Sèvi ak analiz entènèt kòm yon egzanp, atik la dekri mekanis ki nesesè yo bati depo done. Kòm ka wè nan rezilta yo, zouti ki pi senp yo ase pou analiz pwofon ak vizyalizasyon done yo.
Nan tan kap vini an, lè l sèvi avèk repozitwa sa a kòm yon egzanp, nou pral eseye aplike estrikti sa yo tankou dimansyon tou dousman chanje, metadata, nivo agrégasyon ak entegrasyon nan done ki soti nan diferan sous.
Epitou, ann pran yon gade pi pre nan zouti ki pi senp pou jere pwosesis ETL ki baze sou yon tab sèl.
Ann retounen nan sijè a nan mezire kalite done ak otomatize pwosesis sa a.
Nou pral etidye pwoblèm yo nan anviwònman an teknik ak antretyen nan depo done, pou ki nou pral aplike yon sèvè depo ak resous minim, pou egzanp, ki baze sou yon Franbwaz Pi.
Sous: www.habr.com