Estatistik sit ak pwòp ti depo ou

Webalizer ak Google Analytics te ede m 'jwenn insight sou sa k ap pase sou sit entènèt pou anpil ane. Koulye a, mwen konprann ke yo bay anpil ti enfòmasyon itil. Lè w gen aksè a fichye access.log ou a, li trè fasil pou w konprann estatistik yo epi pou w aplike zouti debaz yo, tankou sqlite, html, langaj sql ak nenpòt langaj pwogramasyon script.

Sous done pou Webalizer se fichye access.log sèvè a. Men ki jan ba li yo ak nimewo yo sanble, ki soti nan ki sèlman volim total trafik la klè:

Estatistik sit ak pwòp ti depo ou
Estatistik sit ak pwòp ti depo ou
Zouti tankou Google Analytics kolekte done ki soti nan paj la chaje tèt yo. Yo montre nou yon koup nan dyagram ak liy, ki baze sou ki li souvan difisil pou tire konklizyon kòrèk. Petèt yo ta dwe fè plis efò? pa konnen.

Se konsa, ki sa mwen te vle wè nan estatistik yo ki vizite sit entènèt?

Trafik itilizatè ak bot

Souvan trafik sit limite Et li nesesè pou wè ki kantite itil trafik ap itilize. Pou egzanp, tankou sa a:

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Grafik la montre aktivite konstan bots yo. Li ta enteresan yo etidye an detay reprezantan ki pi aktif.

Bots anmèdan

Nou klase bots ki baze sou enfòmasyon ajan itilizatè yo. Estatistik adisyonèl sou trafik chak jou, kantite demann siksè ak san siksè bay yon bon lide sou aktivite bot.

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT 
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
     DIM_USER_AGENT USG,
     DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT != 'n.a.'
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10

Nan ka sa a, rezilta analiz la se te desizyon an mete restriksyon sou aksè nan sit la lè w ajoute li nan fichye robots.txt la.

User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5

De premye robo yo te disparèt sou tab la, ak robo MS yo te deplase desann soti nan premye liy yo.

Jou ak lè nan pi gwo aktivite

Upswings yo vizib nan trafik la. Pou etidye yo an detay, li nesesè mete aksan sou tan an nan ensidan yo, epi li pa nesesè yo montre tout èdtan yo ak jou nan mezi tan. Sa ap rann li pi fasil pou jwenn demann endividyèl yo nan dosye log la si yo bezwen analiz detaye.

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
	SELECT
	EVENT_DT,
	SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
	SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
	FROM FCT_ACCESS_REQUEST_REF_HH
	WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
	GROUP BY EVENT_DT
	ORDER BY SUM(LINE_CNT) DESC
	LIMIT 10
) ORDER BY EVENT_DT

Nou obsève èdtan ki pi aktif 11, 14 ak 20 nan premye jou a sou tablo a. Men, jou kap vini an nan 13:XNUMX bots yo te aktif.

Mwayèn aktivite itilizatè chak jou pa semèn

Nou regle bagay yo yon ti jan ak aktivite ak trafik. Kesyon kap vini an se te aktivite itilizatè yo tèt yo. Pou estatistik sa yo, peryòd agrégasyon long, tankou yon semèn, se dezirab.

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG,
  DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT='n.a.' /* users only */
  AND HST.STATUS_GROUP IN ('Successful') /* good pages */
  AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Estatistik chak semèn yo montre ke an mwayèn yon itilizatè louvri 1,6 paj pa jou. Kantite dosye yo mande pou chak itilizatè nan ka sa a depann de adisyon nouvo dosye sou sit la.

Tout demann ak estati yo

Webalizer te toujou montre kòd paj espesifik e mwen te toujou vle wè jis kantite demann siksè ak erè.

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Rapò a montre demann, pa klik (frape), kontrèman ak LINE_CNT, mezi REQUEST_CNT la kalkile kòm COUNT(DISTINCT STG.REQUEST_NK). Objektif la se montre evènman efikas, pou egzanp, MS bots sondaj dosye robots.txt la dè santèn de fwa pa jou epi, nan ka sa a, biwo vòt sa yo pral konte yon fwa. Sa a pèmèt ou lis soti so nan graf la.

Soti nan graf la ou ka wè anpil erè - sa yo se paj ki pa egziste. Rezilta analiz la te ajoute redireksyon ki soti nan paj aleka.

Move demann

Pou egzamine demann an detay, ou ka montre estatistik detaye.

Estatistik sit ak pwòp ti depo ou

Rapò rechèch SQL

SELECT
  1 AS 'Table: Top Error Requests',
  REQ.REQUEST_NK AS 'Request',
  'Error' AS 'Request Status',
  ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
  ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
  ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
  FCT_ACCESS_REQUEST_REF_HH FCT,
  DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
  AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20

Lis sa a pral genyen tou tout apèl, pou egzanp, yon demann nan /wp-login.php Lè w ajiste règ yo pou reekri demann pa sèvè a, ou ka ajiste reyaksyon sèvè a nan demann sa yo epi voye yo nan paj la kòmanse.

Kidonk, kèk rapò senp ki baze sou dosye sèvè a bay yon foto konplè sou sa k ap pase sou sit la.

Ki jan yo jwenn enfòmasyon?

Yon baz done sqlite ase. Ann kreye tab: oksilyè pou antre pwosesis ETL.

Estatistik sit ak pwòp ti depo ou

Etap tab kote nou pral ekri dosye boutèy demi lit lè l sèvi avèk PHP. De tab total. Ann kreye yon tablo chak jou ak estatistik sou ajan itilizatè yo ak estati demann yo. Chak èdtan ak estatistik sou demann, gwoup estati ak ajan yo. Kat tab mezi ki enpòtan.

Rezilta a se modèl relasyon sa a:

Modèl doneEstatistik sit ak pwòp ti depo ou

Script pou kreye yon objè nan yon baz done sqlite:

Kreyasyon objè DDL

DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
  DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  USER_AGENT_NK     TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_OS          TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_ENGINE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_DEVICE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_BOT         TEXT NOT NULL DEFAULT 'n.a.',
  UPDATE_DT         INTEGER NOT NULL DEFAULT 0,
  UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);

Etap

Nan ka dosye access.log la, li nesesè pou li, analize ak ekri tout demann nan baz done a. Sa a ka fè swa dirèkteman lè l sèvi avèk yon lang scripting oswa lè l sèvi avèk zouti sqlite.

Fòma dosye log:

//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';

Pwopagasyon kle

Lè done kri yo nan baz done a, ou bezwen ekri kle ki pa la nan tab mezi yo. Lè sa a, li pral posib yo bati yon referans a mezi yo. Pa egzanp, nan tablo DIM_REFERRER, kle a se yon konbinezon twa jaden.

Rekèt pwopagasyon kle SQL

/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
	CLS.HOST_NK,
	CLS.PATH_NK,
	CLS.QUERY_NK,
	STRFTIME('%s','now') AS UPDATE_DT
FROM (
	SELECT DISTINCT
	REFERRER_HOST AS HOST_NK,
	REFERRER_PATH AS PATH_NK,
	CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* отрезаем sid - специфика цмс */
	ELSE REFERRER_QUERY END AS QUERY_NK
	FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL

Pwopagasyon nan tab ajan itilizatè a ka genyen lojik bot, pou egzanp snippet sql la:


CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
	THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
	THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
	THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
	THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
	THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
	OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
	OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
	OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
	OR INSTR(LOWER(CLS.BROWSER),'wget')>0
	THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT

Tablo total

Anfen, nou pral chaje tab total yo; pou egzanp, tab la chak jou ka chaje jan sa a:

Rekèt SQL pou chaje total

/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
	STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
	CASE SUBSTR(TIME_NK,5,3)
	WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
	WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
	ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
	BROWSER AS USER_AGENT_NK,
	REQUEST_NK,
	IP_NR,
	STATUS,
	LINE_NK,
	BYTES
FROM STG_ACCESS_LOG
)
SELECT
	CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
	USG.DIM_USER_AGENT_ID,
	HST.DIM_HTTP_STATUS_ID,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
	COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
	COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
	COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
	SUM(BYTES) AS BYTES
FROM STG,
	DIM_HTTP_STATUS HST,
	DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
  AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
  AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
  AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID

Baz done sqlite pèmèt ou ekri demann konplèks. WITH gen preparasyon done ak kle. Rekèt prensipal la kolekte tout referans sou dimansyon.

Kondisyon an p ap pèmèt chaje istwa a ankò: CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from, kote paramèt la se rezilta demann lan.
'SELECT COALESCE(MAX(EVENT_DT), '3600') AS LAST_EVENT_EPOCH FROM FCT_ACCESS_USER_AGENT_DD'

Kondisyon an ap chaje sèlman tout jounen an: CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('kounye a, 'kòmanse jounen'))

Konte paj oswa fichye fèt nan yon fason primitif, lè w chèche yon pwen.

Rapò

Nan sistèm vizyalizasyon konplèks, li posib yo kreye yon meta-modèl ki baze sou objè baz done, dinamik jere filtè ak règleman agrégation. Finalman, tout zouti desan jenere yon rechèch SQL.

Nan egzanp sa a, nou pral kreye demann SQL pare yo epi sove yo kòm opinyon nan baz done a - sa yo se rapò.

Vizyalizasyon

Bluff: Yo te itilize bèl graf nan JavaScript kòm yon zouti vizyalizasyon

Pou fè sa, li te nesesè yo ale nan tout rapò yo lè l sèvi avèk PHP ak jenere yon dosye html ak tab.

$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);

Zouti a tou senpleman vizyalize tablo rezilta yo.

Sòti

Sèvi ak analiz entènèt kòm yon egzanp, atik la dekri mekanis ki nesesè yo bati depo done. Kòm ka wè nan rezilta yo, zouti ki pi senp yo ase pou analiz pwofon ak vizyalizasyon done yo.

Nan tan kap vini an, lè l sèvi avèk repozitwa sa a kòm yon egzanp, nou pral eseye aplike estrikti sa yo tankou dimansyon tou dousman chanje, metadata, nivo agrégasyon ak entegrasyon nan done ki soti nan diferan sous.

Epitou, ann pran yon gade pi pre nan zouti ki pi senp pou jere pwosesis ETL ki baze sou yon tab sèl.

Ann retounen nan sijè a nan mezire kalite done ak otomatize pwosesis sa a.

Nou pral etidye pwoblèm yo nan anviwònman an teknik ak antretyen nan depo done, pou ki nou pral aplike yon sèvè depo ak resous minim, pou egzanp, ki baze sou yon Franbwaz Pi.

Sous: www.habr.com

Add nouvo kòmantè