Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Webalizer da Google Analytics sun taimaka mini in sami haske game da abubuwan da ke faruwa a gidajen yanar gizon shekaru da yawa. Yanzu na fahimci cewa suna ba da bayanai masu amfani kaɗan kaɗan. Samun damar shiga fayil ɗin access.log ɗin ku, yana da sauƙin fahimtar ƙididdiga da aiwatar da ainihin kayan aikin, kamar su sqlite, html, sql harshe da kowane harshe shirye-shirye na rubutun.

Tushen bayanai na Webalizer shine fayil access.log na uwar garken. Wannan shine yadda sandunansa da lambobi suka yi kama, wanda kawai jimlar yawan zirga-zirgar ababen hawa ke bayyana:

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku
Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku
Kayan aiki kamar Google Analytics suna tattara bayanai daga shafin da aka ɗora da kansu. Suna nuna mana zane-zane guda biyu da layi, bisa ga abin da sau da yawa yana da wahala a yanke hukunci daidai. Wataƙila ya kamata a yi ƙarin ƙoƙari? Ban sani ba.

Don haka, menene nake so in gani a cikin kididdigar maziyartan gidan yanar gizon?

Mai amfani da zirga-zirgar bot

Sau da yawa zirga-zirgar wuraren yana iyakance kuma yana da mahimmanci don ganin yawan amfanin zirga-zirgar da ake amfani da shi. Misali, kamar haka:

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT
1 as 'StackedArea: Traffic generated by Users and Bots',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN USG.AGENT_BOT!='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Bots, KB',
SUM(CASE WHEN USG.AGENT_BOT='n.a.' THEN FCT.BYTES ELSE 0 END)/1000 AS 'Users, KB'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Jadawalin yana nuna yawan ayyukan bots. Zai zama mai ban sha'awa don nazarin daki-daki da wakilai mafi yawan aiki.

Bots mai ban haushi

Muna rarraba bots bisa bayanin wakilin mai amfani. Ƙarin ƙididdiga akan zirga-zirgar yau da kullun, adadin nasara da buƙatun da ba su yi nasara ba suna ba da kyakkyawan ra'ayi na ayyukan bot.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT 
1 AS 'Table: Annoying Bots',
MAX(USG.AGENT_BOT) AS 'Bot',
ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day',
ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Client Error', 'Server Error') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Error Requests per Day',
ROUND(SUM(CASE WHEN STS.STATUS_GROUP IN ('Successful', 'Redirection') THEN FCT.REQUEST_CNT / 14.0 ELSE 0 END), 1) AS 'Success Requests per Day',
USG.USER_AGENT_NK AS 'Agent'
FROM FCT_ACCESS_USER_AGENT_DD FCT,
     DIM_USER_AGENT USG,
     DIM_HTTP_STATUS STS
WHERE FCT.DIM_USER_AGENT_ID = USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = STS.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT != 'n.a.'
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY USG.USER_AGENT_NK
ORDER BY 3 DESC
LIMIT 10

A wannan yanayin, sakamakon binciken shine yanke shawarar hana shiga shafin ta ƙara shi zuwa fayil ɗin robots.txt.

User-agent: AhrefsBot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: bingbot
Crawl-delay: 5

Bots biyu na farko sun bace daga tebur, kuma robots na MS sun sauko daga layin farko.

Rana da lokacin mafi girman aiki

Ana iya ganin abubuwan haɓakawa a cikin zirga-zirga. Don nazarin su daki-daki, wajibi ne a nuna lokacin da suka faru, kuma ba lallai ba ne a nuna duk sa'o'i da kwanakin lokacin ma'auni. Wannan zai sauƙaƙa samun buƙatun mutum ɗaya a cikin fayil ɗin log ɗin idan ana buƙatar cikakken bincike.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT
1 AS 'Line: Day and Hour of Hits from Users and Bots',
strftime('%d.%m-%H', datetime(EVENT_DT, 'unixepoch')) AS 'Date Time',
HIB AS 'Bots, Hits',
HIU AS 'Users, Hits'
FROM (
	SELECT
	EVENT_DT,
	SUM(CASE WHEN AGENT_BOT!='n.a.' THEN LINE_CNT ELSE 0 END) AS HIB,
	SUM(CASE WHEN AGENT_BOT='n.a.' THEN LINE_CNT ELSE 0 END) AS HIU
	FROM FCT_ACCESS_REQUEST_REF_HH
	WHERE datetime(EVENT_DT, 'unixepoch') >= date('now', '-14 day')
	GROUP BY EVENT_DT
	ORDER BY SUM(LINE_CNT) DESC
	LIMIT 10
) ORDER BY EVENT_DT

Muna kiyaye mafi yawan awoyi 11, 14 da 20 na ranar farko akan ginshiƙi. Amma washegari da karfe 13:XNUMX na bots suna aiki.

Matsakaicin ayyukan mai amfani na yau da kullun ta mako

Mun warware abubuwa kadan tare da aiki da zirga-zirga. Tambaya ta gaba ita ce ayyukan masu amfani da kansu. Don irin waɗannan ƙididdiga, tsawon lokacin tarawa, kamar mako guda, yana da kyawawa.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT
1 as 'Line: Average Daily User Activity by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
ROUND(1.0*SUM(FCT.PAGE_CNT)/SUM(FCT.IP_CNT),1) AS 'Pages per IP per Day',
ROUND(1.0*SUM(FCT.FILE_CNT)/SUM(FCT.IP_CNT),1) AS 'Files per IP per Day'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_USER_AGENT USG,
  DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
  AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
  AND USG.AGENT_BOT='n.a.' /* users only */
  AND HST.STATUS_GROUP IN ('Successful') /* good pages */
  AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Kididdigar mako-mako ta nuna cewa a matsakaita mai amfani daya yana buɗe shafuka 1,6 a kowace rana. Adadin fayilolin da ake buƙata ga kowane mai amfani a wannan yanayin ya dogara da ƙarin sabbin fayiloli zuwa rukunin yanar gizon.

Duk buƙatun da matsayinsu

Webalizer koyaushe yana nuna takamaiman lambobin shafi kuma koyaushe ina son ganin adadin buƙatun nasara da kurakurai koyaushe.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT
1 as 'Line: All Requests by Status',
strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
SUM(CASE WHEN STS.STATUS_GROUP='Successful' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Success',
SUM(CASE WHEN STS.STATUS_GROUP='Redirection' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Redirect',
SUM(CASE WHEN STS.STATUS_GROUP='Client Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Customer Error',
SUM(CASE WHEN STS.STATUS_GROUP='Server Error' THEN FCT.REQUEST_CNT ELSE 0 END) AS 'Server Error'
FROM
  FCT_ACCESS_USER_AGENT_DD FCT,
  DIM_HTTP_STATUS STS
WHERE FCT.DIM_HTTP_STATUS_ID=STS.DIM_HTTP_STATUS_ID
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY strftime('%d.%m', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT

Rahoton yana nuna buƙatun, ba dannawa ba, sabanin LINE_CNT, ana ƙididdige ma'aunin REQUEST_CNT azaman COUNT(DISTINCT STG.REQUEST_NK). Manufar ita ce a nuna abubuwan da suka faru masu tasiri, alal misali, MS bots suna jefa kuri'a na robots.txt fayil daruruwan sau a rana kuma, a wannan yanayin, za a ƙidaya irin waɗannan zaɓen sau ɗaya. Wannan yana ba ku damar daidaita tsalle a cikin jadawali.

Daga cikin jadawali zaka iya ganin kurakurai da yawa - waɗannan shafuka ba su wanzu. Sakamakon binciken shine ƙari na turawa daga shafuka masu nisa.

Mummunan buƙatun

Don bincika buƙatun daki-daki, zaku iya nuna cikakken kididdiga.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Tambayar rahoton SQL

SELECT
  1 AS 'Table: Top Error Requests',
  REQ.REQUEST_NK AS 'Request',
  'Error' AS 'Request Status',
  ROUND(SUM(FCT.LINE_CNT) / 14.0, 1) AS 'Hits per Day',
  ROUND(SUM(FCT.IP_CNT) / 14.0, 1) AS 'IPs per Day',
  ROUND(SUM(FCT.BYTES)/1000 / 14.0, 1) AS 'KB per Day'
FROM
  FCT_ACCESS_REQUEST_REF_HH FCT,
  DIM_REQUEST_V_ACT REQ
WHERE FCT.DIM_REQUEST_ID = REQ.DIM_REQUEST_ID
  AND FCT.STATUS_GROUP IN ('Client Error', 'Server Error')
  AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-14 day')
GROUP BY REQ.REQUEST_NK
ORDER BY 4 DESC
LIMIT 20

Wannan jeri kuma zai ƙunshi duk kira, misali, buƙatun zuwa /wp-login.php Ta hanyar daidaita ƙa'idodin sake rubuta buƙatun ta uwar garken, zaku iya daidaita martanin uwar garken ga irin waɗannan buƙatun kuma aika su zuwa shafin farawa.

Don haka, kaɗan kaɗan rahotanni masu sauƙi dangane da fayil ɗin log ɗin uwar garken suna ba da cikakken hoto na abin da ke faruwa akan rukunin yanar gizon.

Yadda ake samun bayanai?

Mahimman bayanai na sqlite ya isa. Bari mu ƙirƙiri teburi: mataimaki don shiga ayyukan ETL.

Ƙididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Matakin tebur inda za mu rubuta fayilolin log ta amfani da PHP. Tebura tara guda biyu. Bari mu ƙirƙiri tebur na yau da kullun tare da ƙididdiga kan wakilai masu amfani da neman matsayi. Sa'a tare da ƙididdiga akan buƙatun, ƙungiyoyin matsayi da wakilai. Tebura huɗu na ma'auni masu dacewa.

Sakamakon shine samfurin alaƙa mai zuwa:

Samfurin bayanaiƘididdiga na rukunin yanar gizon da ƙananan ma'ajiyar ku

Rubutun don ƙirƙirar abu a cikin bayanan sqlite:

Halittar abu DDL

DROP TABLE IF EXISTS DIM_USER_AGENT;
CREATE TABLE DIM_USER_AGENT (
  DIM_USER_AGENT_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  USER_AGENT_NK     TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_OS          TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_ENGINE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_DEVICE      TEXT NOT NULL DEFAULT 'n.a.',
  AGENT_BOT         TEXT NOT NULL DEFAULT 'n.a.',
  UPDATE_DT         INTEGER NOT NULL DEFAULT 0,
  UNIQUE (USER_AGENT_NK)
);
INSERT INTO DIM_USER_AGENT (DIM_USER_AGENT_ID) VALUES (-1);

Mataki

A cikin yanayin fayil access.log, yana da mahimmanci don karantawa, rarrabawa da rubuta duk buƙatun zuwa bayanan bayanai. Ana iya yin wannan ko dai kai tsaye ta amfani da yaren rubutun ko ta amfani da kayan aikin sqlite.

Tsarin fayil ɗin log:

//67.221.59.195 - - [28/Dec/2012:01:47:47 +0100] "GET /files/default.css HTTP/1.1" 200 1512 "https://project.edu/" "Mozilla/4.0"
//host ident auth time method request_nk protocol status bytes ref browser
$log_pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) ([[^]]+]) "(.*) (.*) (.*)" ([0-9-]+) ([0-9-]+) "(.*)" "(.*)"$/';

Mabuɗin yaɗawa

Lokacin da danyen bayanan ke cikin ma'ajin bayanai, kuna buƙatar rubuta maɓallan da babu su a cikin ma'auni. Sa'an nan kuma zai yiwu a gina tunani zuwa ma'auni. Misali, a cikin tebur na DIM_REFERRER, mabuɗin shine haɗewar filayen guda uku.

Tambayar yada maɓallin SQL

/* Propagate the referrer from access log */
INSERT INTO DIM_REFERRER (HOST_NK, PATH_NK, QUERY_NK, UPDATE_DT)
SELECT
	CLS.HOST_NK,
	CLS.PATH_NK,
	CLS.QUERY_NK,
	STRFTIME('%s','now') AS UPDATE_DT
FROM (
	SELECT DISTINCT
	REFERRER_HOST AS HOST_NK,
	REFERRER_PATH AS PATH_NK,
	CASE WHEN INSTR(REFERRER_QUERY,'&sid')>0 THEN SUBSTR(REFERRER_QUERY, 1, INSTR(REFERRER_QUERY,'&sid')-1) /* отрезаем sid - специфика цмс */
	ELSE REFERRER_QUERY END AS QUERY_NK
	FROM STG_ACCESS_LOG
) CLS
LEFT OUTER JOIN DIM_REFERRER TRG
ON (CLS.HOST_NK = TRG.HOST_NK AND CLS.PATH_NK = TRG.PATH_NK AND CLS.QUERY_NK = TRG.QUERY_NK)
WHERE TRG.DIM_REFERRER_ID IS NULL

Yadawa zuwa teburin wakilin mai amfani na iya ƙunsar dabaru na bot, misali snippet sql:


CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
	THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
	THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
	THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
	THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
	THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
	OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
	OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
	OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0
	OR INSTR(LOWER(CLS.BROWSER),'wget')>0
	THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT

Tara teburi

A }arshe, za mu ɗora jimlar Tables; misali, ana iya loda teburin yau da kullum kamar haka:

Tambayar SQL don ɗaukar jimlar

/* Load fact from access log */
INSERT INTO FCT_ACCESS_USER_AGENT_DD (EVENT_DT, DIM_USER_AGENT_ID, DIM_HTTP_STATUS_ID, PAGE_CNT, FILE_CNT, REQUEST_CNT, LINE_CNT, IP_CNT, BYTES)
WITH STG AS (
SELECT
	STRFTIME( '%s', SUBSTR(TIME_NK,9,4) || '-' ||
	CASE SUBSTR(TIME_NK,5,3)
	WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04' WHEN 'May' THEN '05' WHEN 'Jun' THEN '06'
	WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08' WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11'
	ELSE '12' END || '-' || SUBSTR(TIME_NK,2,2) || ' 00:00:00' ) AS EVENT_DT,
	BROWSER AS USER_AGENT_NK,
	REQUEST_NK,
	IP_NR,
	STATUS,
	LINE_NK,
	BYTES
FROM STG_ACCESS_LOG
)
SELECT
	CAST(STG.EVENT_DT AS INTEGER) AS EVENT_DT,
	USG.DIM_USER_AGENT_ID,
	HST.DIM_HTTP_STATUS_ID,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')=0 THEN STG.REQUEST_NK END) ) AS PAGE_CNT,
	COUNT(DISTINCT (CASE WHEN INSTR(STG.REQUEST_NK,'.')>0 THEN STG.REQUEST_NK END) ) AS FILE_CNT,
	COUNT(DISTINCT STG.REQUEST_NK) AS REQUEST_CNT,
	COUNT(DISTINCT STG.LINE_NK) AS LINE_CNT,
	COUNT(DISTINCT STG.IP_NR) AS IP_CNT,
	SUM(BYTES) AS BYTES
FROM STG,
	DIM_HTTP_STATUS HST,
	DIM_USER_AGENT USG
WHERE STG.STATUS = HST.STATUS_NK
  AND STG.USER_AGENT_NK = USG.USER_AGENT_NK
  AND CAST(STG.EVENT_DT AS INTEGER) > $param_epoch_from /* load epoch date */
  AND CAST(STG.EVENT_DT AS INTEGER) < strftime('%s', date('now', 'start of day'))
GROUP BY STG.EVENT_DT, HST.DIM_HTTP_STATUS_ID, USG.DIM_USER_AGENT_ID

Rubutun sqlite yana ba ku damar rubuta hadaddun tambayoyi. WITH ya ƙunshi shirye-shiryen bayanai da maɓallai. Babban tambayar tana tattara duk nassoshi zuwa girma.

Yanayin ba zai bari a sake loda tarihin ba: CAST(STG.EVENT_DT AS INTEGER)> $param_epoch_from, inda ma'aunin shine sakamakon buƙatun.
'Zabi COALESCE(MAX(EVENT_DT),'3600') AS LAST_EVENT_EPOCH DAGA FCT_ACCESS_USER_AGENT_DD'

Yanayin zai yi lodin cikakken ranar ne kawai: CAST(STG.EVENT_DT AS INTEGER) <strftime('%s', date('yanzu','farawar rana')))

Ana aiwatar da ƙidayar shafuka ko fayiloli ta hanyar da ta dace, ta hanyar neman batu.

Rahotanni

A cikin hadaddun tsarin gani, yana yiwuwa a ƙirƙiri samfurin meta bisa tushen bayanai, sarrafa matattara da ƙarfi da ƙa'idodin tarawa. A ƙarshe, duk kayan aikin da suka dace suna haifar da tambayar SQL.

A cikin wannan misalin, za mu ƙirƙiri shirye-shiryen tambayoyin SQL kuma mu adana su azaman ra'ayi a cikin bayanan - waɗannan rahotanni ne.

Nunawa

Bluff: Kyawawan zane-zane a JavaScript an yi amfani da su azaman kayan aikin gani

Don yin wannan, ya zama dole don shiga cikin duk rahotanni ta amfani da PHP kuma ƙirƙirar fayil ɗin html tare da tebur.

$sqls = array(
'SELECT * FROM RPT_ACCESS_USER_VS_BOT',
'SELECT * FROM RPT_ACCESS_ANNOYING_BOT',
'SELECT * FROM RPT_ACCESS_TOP_HOUR_HIT',
'SELECT * FROM RPT_ACCESS_USER_ACTIVE',
'SELECT * FROM RPT_ACCESS_REQUEST_STATUS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_PAGE',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_REFERRER',
'SELECT * FROM RPT_ACCESS_NEW_REQUEST',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_SUCCESS',
'SELECT * FROM RPT_ACCESS_TOP_REQUEST_ERROR'
);

Kayan aiki kawai yana hango allunan sakamako.

ƙarshe

Yin amfani da nazarin yanar gizo a matsayin misali, labarin ya bayyana hanyoyin da ake bukata don gina ɗakunan ajiya na bayanai. Kamar yadda ake iya gani daga sakamakon, kayan aiki mafi sauƙi sun isa don bincike mai zurfi da hangen nesa na bayanai.

A nan gaba, ta yin amfani da wannan ma'ajiyar a matsayin misali, za mu yi ƙoƙari mu aiwatar da irin waɗannan tsare-tsare a hankali a hankali canza girma, metadata, matakan tarawa da haɗakar bayanai daga tushe daban-daban.

Har ila yau, bari mu dubi kayan aiki mafi sauƙi don sarrafa hanyoyin ETL bisa tebur guda.

Bari mu koma kan batun auna ingancin bayanai da sarrafa sarrafa wannan tsari.

Za mu yi nazarin matsalolin yanayin fasaha da kuma kula da ajiyar bayanai, wanda za mu aiwatar da uwar garken ajiya tare da ƙananan albarkatu, misali, dangane da Raspberry Pi.

source: www.habr.com

Add a comment