Webalizer va Google Analytics menga ko'p yillar davomida veb-saytlarda nima sodir bo'layotganini tushunishga yordam berdi. Endi men ular juda kam foydali ma'lumot berishlarini tushunaman. Access.log faylingizga kirish imkoniga ega bo'lgan holda, statistik ma'lumotlarni tushunish va sqlite, html, sql tili va har qanday skript dasturlash tili kabi juda oddiy vositalarni amalga oshirish juda oson.
Webalizer uchun ma'lumotlar manbai serverning access.log faylidir. Uning chiziqlari va raqamlari shunday ko'rinadi, ulardan faqat trafikning umumiy hajmi aniq:
Google Analytics kabi vositalar yuklangan sahifadan ma'lumotlarni o'zlari to'playdi. Ular bizga bir nechta diagrammalar va chiziqlarni ko'rsatadilar, ularning asosida to'g'ri xulosalar chiqarish ko'pincha qiyin. Balki ko'proq harakat qilish kerak edi? Bilmayman.
Xo'sh, men veb-saytga tashrif buyuruvchilar statistikasida nimani ko'rishni xohlardim?
Foydalanuvchi va bot trafiki
Ko'pincha sayt trafiki cheklangan va qanchalik foydali trafik ishlatilayotganini ko'rish kerak. Masalan, bu kabi:
SQL hisobot so'rovi
SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Grafik botlarning doimiy faolligini ko'rsatadi. Eng faol vakillarni batafsil o'rganish qiziqarli bo'ladi.
Bezovta qiluvchi botlar
Biz botlarni foydalanuvchi agenti ma'lumotlari asosida tasniflaymiz. Kundalik trafik bo'yicha qo'shimcha statistika, muvaffaqiyatli va muvaffaqiyatsiz so'rovlar soni bot faoliyati haqida yaxshi tasavvur beradi.
SQL hisobot so'rovi
SELECT
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT != 'n.a.'
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10
Bunday holda, tahlil natijasi saytga kirishni robots.txt fayliga qo'shish orqali cheklash to'g'risida qaror qabul qilindi.
User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5
Dastlabki ikkita bot stoldan g'oyib bo'ldi va MS robotlari birinchi qatorlardan pastga siljidi.
Eng katta faollik kuni va vaqti
Ko'tarilishlar tirbandlikda ko'rinadi. Ularni batafsil o'rganish uchun ularning paydo bo'lish vaqtini ajratib ko'rsatish kerak va vaqtni o'lchashning barcha soatlari va kunlarini ko'rsatish shart emas. Agar batafsil tahlil zarur bo'lsa, bu jurnal faylida individual so'rovlarni topishni osonlashtiradi.
SQL hisobot so'rovi
SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
SELECT
EVENT_DT,
SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
FROM FCT_ACCESS_REQUEST_REF_HH
WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY EVENT_DT
ORDER BY SUM(LINE_CNT) DESC
LIMIT 10
) ORDER BY EVENT_DT
Grafikdagi birinchi kunning eng faol soatlari 11, 14 va 20 ni kuzatamiz. Ammo ertasi kuni soat 13:XNUMX da botlar faollashdi.
Hafta bo'yicha o'rtacha kunlik foydalanuvchi faolligi
Biz harakat va tirbandlik bilan bir oz narsalarni tartibga soldik. Keyingi savol foydalanuvchilarning o'zlarining faolligi edi. Bunday statistik ma'lumotlar uchun bir hafta kabi uzoq yig'ish davrlari maqsadga muvofiqdir.
SQL hisobot so'rovi
SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT='n.a.' /* users only */
AND HST.STATUS_GROUP IN ('Successful') /* good pages */
AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Haftalik statistika shuni ko'rsatadiki, o'rtacha bir foydalanuvchi kuniga 1,6 sahifa ochadi. Bu holda har bir foydalanuvchi uchun so'ralgan fayllar soni saytga yangi fayllar qo'shilishi bilan bog'liq.
Barcha so'rovlar va ularning holati
Webalizer har doim ma'lum sahifa kodlarini ko'rsatdi va men har doim muvaffaqiyatli so'rovlar va xatolar sonini ko'rishni xohlardim.
SQL hisobot so'rovi
SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT
Hisobot LINE_CNT dan farqli o'laroq, chertishlar (xitlar) emas, so'rovlarni ko'rsatadi, REQUEST_CNT ko'rsatkichi COUNT(DISTINCT STG.REQUEST_NK) sifatida hisoblanadi. Maqsad samarali voqealarni ko'rsatishdir, masalan, MS botlari robots.txt faylini kuniga yuzlab marta so'rov o'tkazadi va bu holda bunday so'rovlar bir marta hisobga olinadi. Bu sizga grafikdagi sakrashlarni tekislash imkonini beradi.
Grafikdan siz ko'plab xatolarni ko'rishingiz mumkin - bu mavjud bo'lmagan sahifalar. Tahlil natijasi uzoq sahifalardan qayta yo'naltirishlar qo'shildi.
Yomon so'rovlar
So'rovlarni batafsil ko'rib chiqish uchun siz batafsil statistikani ko'rsatishingiz mumkin.
SQL hisobot so'rovi
SELECT
1 AS 'Table: Top Error Requests',
REQ.REQUEST_NK AS 'Request',
'Error' AS 'Request Status',
ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
FCT_ACCESS_REQUEST_REF_HH FCT,
DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20
Ushbu ro'yxat shuningdek, barcha qo'ng'iroqlarni o'z ichiga oladi, masalan, /wp-login.php so'rovi Server tomonidan so'rovlarni qayta yozish qoidalarini o'zgartirish orqali siz serverning bunday so'rovlarga munosabatini sozlashingiz va ularni boshlang'ich sahifaga yuborishingiz mumkin.
Shunday qilib, server jurnali fayliga asoslangan bir nechta oddiy hisobotlar saytda sodir bo'layotgan voqealar haqida to'liq tasavvur beradi.
Ma'lumotni qanday olish mumkin?
Sqlite ma'lumotlar bazasi etarli. Jadvallarni yaratamiz: ETL jarayonlarini qayd qilish uchun yordamchi.
PHP yordamida jurnal fayllarini yozadigan jadval bosqichi. Ikki jamlangan jadval. Keling, foydalanuvchi agentlari va so'rov holatlari bo'yicha statistik ma'lumotlardan iborat kundalik jadval yarataylik. So'rovlar, status guruhlari va agentlar bo'yicha statistik ma'lumotlar bilan soatlik. Tegishli o'lchovlarning to'rtta jadvali.
Natijada quyidagi relyatsion model olinadi:
Ma'lumotlar modeli
Sqlite ma'lumotlar bazasida ob'ekt yaratish uchun skript:
DDL ob'ektini yaratish
DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
USER_AGENT_NK TEXT NOT NULL DEFAULT 'n.a.',
AGENT_OS TEXT NOT NULL DEFAULT 'n.a.',
AGENT_ENGINE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_DEVICE TEXT NOT NULL DEFAULT 'n.a.',
AGENT_BOT TEXT NOT NULL DEFAULT 'n.a.',
UPDATE_DT INTEGER NOT NULL DEFAULT 0,
UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);
Bosqich
Access.log faylida ma'lumotlar bazasiga barcha so'rovlarni o'qish, tahlil qilish va yozish kerak. Bu to'g'ridan-to'g'ri skript tili yoki sqlite vositalari yordamida amalga oshirilishi mumkin.
Jurnal fayl formati:
//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';
Kalitning tarqalishi
Xom ma'lumotlar ma'lumotlar bazasida bo'lganda, o'lchov jadvallariga u erda bo'lmagan kalitlarni yozishingiz kerak. Keyin o'lchovlarga mos yozuvlar qurish mumkin bo'ladi. Masalan, DIM_REFERRER jadvalida kalit uchta maydonning kombinatsiyasi hisoblanadi.
SQL kalitlarini tarqatish so'rovi
/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
CLS.HOST_NK,
CLS.PATH_NK,
CLS.QUERY_NK,
STRFTIME('%s','now') AS UPDATE_DT
FROM (
SELECT DISTINCT
REFERRER_HOST AS HOST_NK,
REFERRER_PATH AS PATH_NK,
CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* ΠΎΡΡΠ΅Π·Π°Π΅ΠΌ sid - ΡΠΏΠ΅ΡΠΈΡΠΈΠΊΠ° ΡΠΌΡ */
ELSE REFERRER_QUERY END AS QUERY_NK
FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL
Foydalanuvchi agenti jadvaliga tarqatish bot mantiqini o'z ichiga olishi mumkin, masalan, sql parchasi:
CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
OR INSTR(LOWER(CLS.BROWSER),'wget')>0
THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT
Jadvallarni jamlash
Va nihoyat, biz jamlangan jadvallarni yuklaymiz, masalan, kundalik jadvalni quyidagicha yuklash mumkin:
Agregatni yuklash uchun SQL so'rovi
/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
CASE SUBSTR(TIME_NK,5,3)
WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
BROWSER AS USER_AGENT_NK,
REQUEST_NK,
IP_NR,
STATUS,
LINE_NK,
BYTES
FROM STG_ACCESS_LOG
)
SELECT
CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
USG.DIM_USER_AGENT_ID,
HST.DIM_HTTP_STATUS_ID,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
SUM(BYTES) AS BYTES
FROM STG,
DIM_HTTP_STATUS HST,
DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID
Sqlite ma'lumotlar bazasi murakkab so'rovlarni yozish imkonini beradi. WITH ma'lumotlar va kalitlarni tayyorlashni o'z ichiga oladi. Asosiy so'rov o'lchamlarga barcha havolalarni to'playdi.
Shart tarixni qayta yuklashga ruxsat bermaydi: CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from, bunda parametr soβrov natijasidir
'FCT_ACCESS_USER_AGENT_DDDAN SO'NGI_EVENT_DAVRAN OLIB COALESCE(MAX(EVENT_DT), '3600') TANLANING'
Shart faqat butun kunni yuklaydi: CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', sana('hozir', 'kun boshi'))
Sahifalar yoki fayllarni hisoblash ibtidoiy usulda, nuqtani qidirish orqali amalga oshiriladi.
Hisobotlar
Murakkab vizualizatsiya tizimlarida ma'lumotlar bazasi ob'ektlari asosida metamodel yaratish, filtrlarni va yig'ish qoidalarini dinamik ravishda boshqarish mumkin. Oxir-oqibat, barcha munosib vositalar SQL so'rovini yaratadi.
Ushbu misolda biz tayyor SQL so'rovlarini yaratamiz va ularni ma'lumotlar bazasida ko'rinish sifatida saqlaymiz - bu hisobotlar.
Vizualizatsiya
Bluff: JavaScript-dagi chiroyli grafiklar vizualizatsiya vositasi sifatida ishlatilgan
Buning uchun PHP yordamida barcha hisobotlarni ko'rib chiqish va jadvallar bilan html faylni yaratish kerak edi.
$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);
Asbob shunchaki natijalar jadvallarini ingl.
xulosa
Misol sifatida veb-tahlildan foydalangan holda, maqolada ma'lumotlar omborlarini qurish uchun zarur bo'lgan mexanizmlar tasvirlangan. Natijalardan ko'rinib turibdiki, ma'lumotlarni chuqur tahlil qilish va vizualizatsiya qilish uchun eng oddiy vositalar etarli.
Kelajakda ushbu ombordan misol sifatida foydalanib, biz asta-sekin o'zgaruvchan o'lchamlar, metama'lumotlar, yig'ish darajalari va turli manbalardan olingan ma'lumotlarning integratsiyasi kabi tuzilmalarni amalga oshirishga harakat qilamiz.
Bundan tashqari, keling, bitta jadval asosida ETL jarayonlarini boshqarish uchun eng oddiy vositani batafsil ko'rib chiqaylik.
Keling, ma'lumotlar sifatini o'lchash va bu jarayonni avtomatlashtirish mavzusiga qaytaylik.
Biz texnik muhit va ma'lumotlar omboriga xizmat ko'rsatish muammolarini o'rganamiz, buning uchun biz minimal resurslarga ega saqlash serverini, masalan, Raspberry Pi-ga asoslangan holda amalga oshiramiz.
Manba: www.habr.com