Statistik situs sareng panyimpenan leutik anjeun nyalira

Webalizer sareng Google Analytics parantos ngabantosan abdi ngartos naon anu lumangsung dina situs wéb salami mangtaun-taun. Ayeuna kuring ngarti yén maranéhna nyadiakeun informasi mangpaat pisan saeutik. Ngabogaan aksés ka file access.log anjeun, gampang pisan ngartos statistik sareng nerapkeun alat anu cukup dasar, sapertos sqlite, html, basa sql sareng basa program skrip.

Sumber data pikeun Webalizer nyaéta file access.log server. Ieu mangrupikeun bar sareng nomerna, anu ngan ukur volume lalu lintas anu jelas:

Statistik situs sareng panyimpenan leutik anjeun nyalira
Statistik situs sareng panyimpenan leutik anjeun nyalira
Alat sapertos Google Analytics ngumpulkeun data tina halaman anu dimuat sorangan. Aranjeunna nunjukkeun kami sababaraha diagram sareng garis, dumasar kana anu sering hese ngagambar kacindekan anu leres. Panginten langkung seueur usaha kedah dilakukeun? teu nyaho.

Janten, naon anu kuring hoyong tingali dina statistik pengunjung halaman wéb?

Pamaké jeung lalulintas bot

Mindeng lalulintas situs diwatesan sarta perlu ningali sabaraha lalulintas mangpaat keur dipake. Contona, saperti kieu:

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Grafik nunjukkeun kagiatan konstan bot. Éta bakal pikaresepeun pikeun diajar sacara rinci ngeunaan wawakil anu paling aktip.

bot ngaganggu

Kami mengklasifikasikan bot dumasar kana inpormasi agén pangguna. Statistik tambahan ngeunaan lalu lintas sapopoé, jumlah pamundut anu suksés sareng gagal masihan ide anu saé pikeun kagiatan bot.

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT 
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
     DIM_USER_AGENT USG,
     DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT != 'n.a.'
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10

Dina hal ieu, hasil analisis nya éta kaputusan pikeun ngawatesan aksés ka loka ku nambahkeun kana file robots.txt.

User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5

Dua bot munggaran ngaleungit tina méja, sareng robot MS ngalih ka handap tina garis kahiji.

Poé jeung waktu kagiatan greatest

Upswings katingali dina lalu lintas. Pikeun diajar aranjeunna sacara rinci, perlu nyorot waktos kajadianana, sareng henteu kedah nunjukkeun sadaya jam sareng dinten pangukuran waktos. Ieu bakal ngagampangkeun milarian pamundut individu dina file log upami analisa lengkep diperyogikeun.

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
	SELECT
	EVENT_DT,
	SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
	SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
	FROM FCT_ACCESS_REQUEST_REF_HH
	WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
	GROUP BY EVENT_DT
	ORDER BY SUM(LINE_CNT) DESC
	LIMIT 10
) ORDER BY EVENT_DT

Urang nitenan jam paling aktif 11, 14 jeung 20 poé kahiji dina bagan. Tapi poé saterusna di 13:XNUMX bot éta aktip.

Rata-rata aktivitas pamaké poéan ku minggu

Urang nyortir hal kaluar saeutik kalayan aktivitas sarta lalulintas. Patarosan salajengna nyaéta kagiatan pangguna sorangan. Pikeun statistik sapertos kitu, période agrégasi anu panjang, sapertos saminggu, diperyogikeun.

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG,
  DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT='n.a.' /* users only */
  AND HST.STATUS_GROUP IN ('Successful') /* good pages */
  AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Statistik mingguan nunjukkeun yén rata-rata hiji pangguna muka 1,6 halaman per dinten. Jumlah file anu dipénta per pangguna dina hal ieu gumantung kana tambihan file énggal kana situs éta.

Sadaya pamundut sareng statusna

Webalizer sok nunjukkeun kode halaman khusus sareng kuring sok hoyong ningali ngan ukur jumlah pamundut sareng kasalahan anu suksés.

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Laporan nampilkeun pamundut, sanes klik (hits), teu sapertos LINE_CNT, métrik REQUEST_CNT diitung salaku COUNT(DISTINCT STG.REQUEST_NK). Tujuanana nyaéta pikeun nunjukkeun acara anu épéktip, contona, bot MS polling file robots.txt ratusan kali sapoé sareng, dina hal ieu, polling sapertos kitu bakal diitung sakali. Ieu ngidinan Anjeun pikeun ngaleutikan jumps dina grafik.

Tina grafik anjeun tiasa ningali seueur kasalahan - ieu mangrupikeun halaman anu teu aya. Hasil tina analisa nyaéta tambihan alihan tina halaman jauh.

Paménta goréng

Pikeun mariksa pamundut sacara rinci, anjeun tiasa ningalikeun statistik anu lengkep.

Statistik situs sareng panyimpenan leutik anjeun nyalira

query laporan SQL

SELECT
  1 AS 'Table: Top Error Requests',
  REQ.REQUEST_NK AS 'Request',
  'Error' AS 'Request Status',
  ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
  ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
  ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
  FCT_ACCESS_REQUEST_REF_HH FCT,
  DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
  AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20

Daptar ieu ogé bakal ngandung sadaya sauran, contona, pamundut ka /wp-login.php Ku nyaluyukeun aturan pikeun nyerat deui pamundut ku server, anjeun tiasa nyaluyukeun réaksi pangladén kana pamundut sapertos kitu sareng ngirim ka halaman awal.

Janten, sababaraha laporan saderhana dumasar kana file log server masihan gambaran anu cukup lengkep ngeunaan naon anu lumangsung dina situs éta.

Kumaha kéngingkeun inpormasi?

A database sqlite geus cukup. Hayu urang nyieun tabel: bantu pikeun logging prosés ETL.

Statistik situs sareng panyimpenan leutik anjeun nyalira

Tahap tabel dimana urang bakal nyerat file log nganggo PHP. Dua tabel agrégat. Hayu urang nyieun tabel poean kalawan statistik on agén pamaké sarta status pamundut. Jam-jaman sareng statistik dina pamundut, grup status sareng agén. Opat tabel ukuran relevan.

hasilna nyaeta model relational handap:

Modél dataStatistik situs sareng panyimpenan leutik anjeun nyalira

Skrip pikeun nyieun hiji obyék dina database sqlite:

kreasi objék DDL

DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
  DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  USER_AGENT_NK     TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_OS          TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_ENGINE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_DEVICE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_BOT         TEXT NOT NULL DEFAULT 'n.a.',
  UPDATE_DT         INTEGER NOT NULL DEFAULT 0,
  UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);

Panggung

Dina kasus file access.log, perlu maca, parse sareng nyerat sadaya pamundut kana pangkalan data. Ieu tiasa dilakukeun langsung nganggo basa skrip atanapi nganggo alat sqlite.

Format file log:

//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';

rambatan konci

Nalika data atah aya dina pangkalan data, anjeun kedah nyerat konci anu henteu aya kana tabel pangukuran. Teras bakal tiasa ngawangun rujukan pikeun pangukuran. Contona, dina tabel DIM_REFERRER, konci téh kombinasi tilu widang.

pamundut rambatan konci SQL

/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
	CLS.HOST_NK,
	CLS.PATH_NK,
	CLS.QUERY_NK,
	STRFTIME('%s','now') AS UPDATE_DT
FROM (
	SELECT DISTINCT
	REFERRER_HOST AS HOST_NK,
	REFERRER_PATH AS PATH_NK,
	CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* отрезаем sid - специфика цмс */
	ELSE REFERRER_QUERY END AS QUERY_NK
	FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL

Rambatan ka méja agén pamaké tiasa ngandung logika bot, contona snippet sql:


CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
	THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
	THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
	THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
	THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
	THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
	OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
	OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
	OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
	OR INSTR(LOWER(CLS.BROWSER),'wget')>0
	THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT

tabél agrégat

Anu pamungkas, urang bakal ngamuat tabel agrégat; contona, tabel poéan bisa dimuat saperti kieu:

query SQL pikeun loading agrégat

/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
	STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
	CASE SUBSTR(TIME_NK,5,3)
	WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
	WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
	ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
	BROWSER AS USER_AGENT_NK,
	REQUEST_NK,
	IP_NR,
	STATUS,
	LINE_NK,
	BYTES
FROM STG_ACCESS_LOG
)
SELECT
	CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
	USG.DIM_USER_AGENT_ID,
	HST.DIM_HTTP_STATUS_ID,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
	COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
	COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
	COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
	SUM(BYTES) AS BYTES
FROM STG,
	DIM_HTTP_STATUS HST,
	DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
  AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
  AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
  AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID

Database sqlite ngamungkinkeun anjeun nyerat patarosan anu kompleks. WITH ngandung persiapan data sareng konci. Patarosan utama ngumpulkeun sakabeh rujukan pikeun dimensi.

Kaayaan éta moal ngijinkeun ngamuat sajarah deui: CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from, dimana parameterna mangrupikeun hasil pamundut
'PILIH COALESCE(MAX(EVENT_DT), '3600') AS LAST_EVENT_EPOCH TI FCT_ACCESS_USER_AGENT_DD'

Kaayaanana bakal dimuat ngan sapoe: CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('ayeuna', 'start of day'))

Ngitung halaman atanapi file dilaksanakeun ku cara primitif, ku milarian titik.

Laporan

Dina sistem visualisasi kompléks, kasebut nyaéta dimungkinkeun pikeun nyieun meta-model dumasar kana objék database, dinamis ngatur saringan jeung aturan aggregation. Pamustunganana, sadaya alat anu santun ngahasilkeun pamundut SQL.

Dina conto ieu, urang bakal nyieun queries SQL siap-dijieun tur simpen salaku pintonan dina database - ieu laporan.

Visualisasi

Bluff: grafik éndah dina JavaScript ieu dipaké salaku alat visualisasi

Jang ngalampahkeun ieu, perlu ngaliwat sadaya laporan nganggo PHP sareng ngahasilkeun file html sareng tabel.

$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);

Alatna ngan ukur ningalikeun tabel hasil.

kacindekan

Ngagunakeun analisa wéb sabagé conto, tulisan éta ngajelaskeun mékanisme anu dipikabutuh pikeun ngawangun gudang data. Salaku bisa ditempo tina hasil, parabot pangbasajanna cukup pikeun analisis jero tur visualisasi data.

Dina mangsa nu bakal datang, ngagunakeun gudang ieu sabagé conto, urang bakal coba pikeun nerapkeun struktur saperti lalaunan ngarobah dimensi, metadata, tingkat aggregation jeung integrasi data ti sumber béda.

Ogé, hayu urang ningal langkung caket kana alat pangbasajanna pikeun ngatur prosés ETL dumasar kana méja tunggal.

Hayu urang balik deui ka topik ngukur kualitas data sareng ngajadikeun otomatis prosés ieu.

Urang bakal ngulik masalah lingkungan téknis sareng pangropéa panyimpen data, dimana urang bakal ngalaksanakeun server panyimpen kalayan sumber daya minimal, contona, dumasar kana Raspberry Pi.

sumber: www.habr.com

Tambahkeun komentar