透過分析站點統計數據,我們可以了解它發生了什麼。 我們將結果與其他有關產品或服務的知識進行比較,從而改善我們的體驗。
當第一個結果的分析完成,資訊已經被理解並得出結論時,下一階段就開始了。 想法出現:如果你從另一邊看數據會發生什麼事?
現階段的分析工具仍有其限制。 這就是 Google Analytics 對我來說不夠的原因之一,即查看和操作我的資料的能力有限。
我一直想快速加載基本資料(主資料),添加另一個聚合級別,或以不同的方式解釋現有值。
這很容易做到 基於 access.log 檔案和 SQL 語言就足夠了。
那麼,我想要回答哪些問題呢?
網站上的內容和時間發生了變化
基礎資料(主資料)的變化歷史總是令人感興趣。

SQL報表查詢
SELECT
1 as 'SideStackedBar: Content Updates by Months',
strftime('%m/%Y', datetime(UPDATE_DT, 'unixepoch')) AS 'Day',
COUNT(CASE WHEN PAGE_TITLE != 'n.a.' THEN DIM_REQUEST_ID END) AS 'Web page updates',
COUNT(CASE WHEN PAGE_DESCR = 'IMAGES' THEN DIM_REQUEST_ID END) AS 'Image uploads',
COUNT(CASE WHEN PAGE_DESCR = 'VIDEO' THEN DIM_REQUEST_ID END) AS 'Video uploads',
COUNT(CASE WHEN PAGE_DESCR = 'AUDIO' THEN DIM_REQUEST_ID END) AS 'Audio uploads'
FROM DIM_REQUEST
WHERE PAGE_TITLE != 'n.a.' OR PAGE_DESCR != 'n.a.'
GROUP BY strftime('%m/%Y', datetime(UPDATE_DT, 'unixepoch'))
ORDER BY UPDATE_DT例如,在某些時候,進行了搜尋引擎優化或為網站添加了新內容,因此預計流量會增加。
用戶組
群組最簡單的範例是使用者代理程式或作業系統名稱。
使用者代理維度已經累積了大約一千筆記錄,我有興趣了解群組內代理分佈的動態。

SQL報表查詢
SELECT
1 AS 'SideStackedBar: User Agents',
AGENT_OS AS 'OS',
SUM(CASE WHEN AGENT_BOT = 'n.a.' THEN 1 ELSE 0 END ) AS 'User Agent of Users',
SUM(CASE WHEN AGENT_BOT != 'n.a.' THEN 1 ELSE 0 END ) AS 'User Agent of Bots'
FROM DIM_USER_AGENT
WHERE DIM_USER_AGENT_ID != -1
GROUP BY AGENT_OS
ORDER BY 3 DESC來自世界各地的代理商組合數量最多。 Windows未定義的項目包括 WhatsApp、PocketImageCache、PlayStation、SmartTV 等。
按週劃分的使用者群組活動
透過組合一些組,您可以觀察他們的活動分佈。
例如,叢集用戶 Linux 該網站的流量消耗量高於其他所有網站。

SQL報表查詢
SELECT
1 as 'StackedBar: Traffic Volume by User OS and by Week',
strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Week',
SUM(CASE WHEN USG.AGENT_OS IN ('Android', 'Linux') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Android/Linux Users',
SUM(CASE WHEN USG.AGENT_OS IN ('Windows') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Windows Users',
SUM(CASE WHEN USG.AGENT_OS IN ('Macintosh', 'iOS') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Mac/iOS Users',
SUM(CASE WHEN USG.AGENT_OS IN ('n.a.', 'BlackBerry') THEN FCT.BYTES ELSE 0 END)/1000 AS 'Other'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USG,
DIM_HTTP_STATUS HST
WHERE FCT.DIM_USER_AGENT_ID=USG.DIM_USER_AGENT_ID
AND FCT.DIM_HTTP_STATUS_ID = HST.DIM_HTTP_STATUS_ID
AND USG.AGENT_BOT = 'n.a.' /* users only */
AND HST.STATUS_GROUP IN ('Successful') /* good pages */
AND datetime(FCT.EVENT_DT, 'unixepoch') > date('now', '-3 month')
GROUP BY strftime('%W week', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY FCT.EVENT_DT密集流量消耗
表格顯示了最活躍的用戶群組及其活動日期。
最活躍的那些與…有關 Linux 簇。

SQL報表查詢
SELECT
1 AS 'Table: User Agent with Havy Usage',
strftime('%d.%m.%Y', datetime(FCT.EVENT_DT, 'unixepoch')) AS 'Day',
ROUND(1.0*SUM(FCT.BYTES)/1000000, 1) AS 'Traffic MB',
ROUND(1.0*SUM(FCT.IP_CNT)/SUM(1), 1) AS 'IPs',
ROUND(1.0*SUM(FCT.REQUEST_CNT)/SUM(1), 1) AS 'Requests',
USA.DIM_USER_AGENT_ID AS 'ID',
MAX(USA.USER_AGENT_NK) AS 'User Agent',
MAX(USA.AGENT_BOT) AS 'Bot'
FROM
FCT_ACCESS_USER_AGENT_DD FCT,
DIM_USER_AGENT USA
WHERE FCT.DIM_USER_AGENT_ID = USA.DIM_USER_AGENT_ID
AND datetime(FCT.EVENT_DT, 'unixepoch') >= date('now', '-30 day')
GROUP BY USA.DIM_USER_AGENT_ID, strftime('%d.%m.%Y', datetime(FCT.EVENT_DT, 'unixepoch'))
ORDER BY SUM(FCT.BYTES) DESC, FCT.EVENT_DT
LIMIT 10使用日期和座席 ID 屬性,可以快速尋找和追蹤各個使用者群組的日期統計資料。 如果需要,您可以在階段表中快速找到詳細資訊。
如何獲取資訊?
透過整合其他資料來源並引入新層級的聚合和分組,可以提高效率。
基礎資料和實體
基本資料包括有關實體的資訊:網頁、圖片、影片和音訊內容,對於商店來說是產品。
實體本身充當測量,儲存屬性變更的過程稱為歷史化。 在資料庫中,這個過程通常以緩慢變化維度(SCD)的形式實現。
來源資料可以來自各種系統,因此它們幾乎總是需要整合。
緩慢變化的維度
DIM_REQUEST 維度將包含有關網站上歷史形式的請求的資訊。
表SCD2
CREATE TABLE DIM_REQUEST ( /* scd table for user requests */
DIM_REQUEST_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
DIM_REQUEST_ID_HIST INTEGER NOT NULL DEFAULT -1,
REQUEST_NK TEXT NOT NULL DEFAULT 'n.a.', /* request without ?parameters */
PAGE_TITLE TEXT NOT NULL DEFAULT 'n.a.',
PAGE_DESCR TEXT NOT NULL DEFAULT 'n.a.',
PAGE_KEYWORDS TEXT NOT NULL DEFAULT 'n.a.',
DELETE_FLAG INTEGER NOT NULL DEFAULT 0,
UPDATE_DT INTEGER NOT NULL DEFAULT 0,
UNIQUE (REQUEST_NK, DIM_REQUEST_ID_HIST)
);
INSERT INTO DIM_REQUEST (DIM_REQUEST_ID) VALUES (-1);除此之外,我們將建立一個始終顯示最新狀態的所有記錄的視圖。 加載測量本身所必需的。

SCD2 的當前表示
/* Content: actual view on scd table */
SELECT HI.DIM_REQUEST_ID,
HI.DIM_REQUEST_ID_HIST,
HI.REQUEST_NK,
HI.PAGE_TITLE,
HI.PAGE_DESCR,
HI.PAGE_KEYWORDS,
NK.CNT AS HIST_CNT,
HI.DELETE_FLAG,
strftime('%d.%m.%Y %H:%M', datetime(HI.UPDATE_DT, 'unixepoch')) AS UPDATE_DT
FROM
( SELECT REQUEST_NK, MAX(DIM_REQUEST_ID) AS DIM_REQUEST_ID, SUM(1) AS CNT
FROM DIM_REQUEST
GROUP BY REQUEST_NK
) NK,
DIM_REQUEST HI
WHERE 1 = 1
AND NK.REQUEST_NK = HI.REQUEST_NK
AND NK.DIM_REQUEST_ID = HI.DIM_REQUEST_ID;以及收集每個條目的歷史資訊的視圖。 必須與事實建立正確的歷史連結。

SCD2的歷史介紹
/* Content: actual view on scd table */
SELECT SCD.DIM_REQUEST_ID,
SCD.DIM_REQUEST_ID_HIST,
SCD.REQUEST_NK,
SCD.PAGE_TITLE,
SCD.PAGE_DESCR,
SCD.PAGE_KEYWORDS,
SCD.DELETE_FLAG,
CASE
WHEN HIS.UPDATE_DT IS NULL
THEN 1
ELSE 0 END ACTIVE_FLAG,
SCD.DIM_REQUEST_ID_HIST AS ID_FROM,
SCD.DIM_REQUEST_ID AS ID_TO,
CASE
WHEN SCD.DIM_REQUEST_ID_HIST=-1
THEN 3600
ELSE IFNULL(SCD.UPDATE_DT,3600)
END AS TIME_FROM,
CASE
WHEN HIS.UPDATE_DT IS NULL
THEN 253370764800
ELSE HIS.UPDATE_DT
END AS TIME_TO,
CASE
WHEN SCD.DIM_REQUEST_ID_HIST=-1
THEN STRFTIME('%d.%m.%Y %H:%M', DATETIME(3600, 'unixepoch'))
ELSE STRFTIME('%d.%m.%Y %H:%M', DATETIME(IFNULL(SCD.UPDATE_DT,3600), 'unixepoch'))
END AS ACTIVE_FROM,
CASE
WHEN HIS.UPDATE_DT IS NULL
THEN STRFTIME('%d.%m.%Y %H:%M', DATETIME(253370764800, 'unixepoch'))
ELSE STRFTIME('%d.%m.%Y %H:%M', DATETIME(HIS.UPDATE_DT, 'unixepoch'))
END AS ACTIVE_TO
FROM
DIM_REQUEST SCD
LEFT OUTER JOIN DIM_REQUEST HIS
ON SCD.REQUEST_NK = HIS.REQUEST_NK AND SCD.DIM_REQUEST_ID = HIS.DIM_REQUEST_ID_HIST;資料聚合
壓縮(聚合)可讓您在更高層級評估資料並檢測詳細報告中不可見的異常和趨勢。
例如,將一個群組新增至具有請求狀態代碼 DIM_HTTP_STATUS 的維度:
狀態/群組
0xx/na
1xx/信息
2xx/成功
3xx/重定向
4xx/客戶端錯誤
5xx/伺服器錯誤
使用者代理維度 DIM_USER_AGENT 將包含負責群組的 AGENT_OS 和 AGENT_BOT 屬性。 這些可以在 ETL 過程中填充:
正在載入 DIM_USER_AGENT
/* Propagate the user agent from access log */
INSERT INTO DIM_USER_AGENT (USER_AGENT_NK, AGENT_OS, AGENT_ENGINE, AGENT_DEVICE, AGENT_BOT, UPDATE_DT)
WITH CLS AS (
SELECT BROWSER
FROM STG_ACCESS_LOG WHERE LENGTH(BROWSER)>1
GROUP BY BROWSER
)
SELECT
CLS.BROWSER AS USER_AGENT_NK,
CASE
WHEN INSTR(CLS.BROWSER,'Macintosh')>0
THEN 'Macintosh'
WHEN INSTR(CLS.BROWSER,'iPhone')>0
OR INSTR(CLS.BROWSER,'iPad')>0
OR INSTR(CLS.BROWSER,'iPod')>0
OR INSTR(CLS.BROWSER,'Apple TV')>0
OR INSTR(CLS.BROWSER,'Darwin')>0
THEN 'iOS'
WHEN INSTR(CLS.BROWSER,'Android')>0
THEN 'Android'
WHEN INSTR(CLS.BROWSER,'X11;')>0 OR INSTR(CLS.BROWSER,'Wayland;')>0 OR INSTR(CLS.BROWSER,'linux-gnu')>0
THEN 'Linux'
WHEN INSTR(CLS.BROWSER,'BB10;')>0 OR INSTR(CLS.BROWSER,'BlackBerry')>0
THEN 'BlackBerry'
WHEN INSTR(CLS.BROWSER,'Windows')>0
THEN 'Windows'
ELSE 'n.a.' END AS AGENT_OS, -- OS
CASE
WHEN INSTR(CLS.BROWSER,'AppleCoreMedia')>0
THEN 'AppleWebKit'
WHEN INSTR(CLS.BROWSER,') ')>1 AND LENGTH(CLS.BROWSER)>INSTR(CLS.BROWSER,') ')
THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,') ')+2, LENGTH(CLS.BROWSER) - INSTR(CLS.BROWSER,') ')-1), 'N/A')
ELSE 'n.a.' END AS AGENT_ENGINE, -- Engine
CASE
WHEN INSTR(CLS.BROWSER,'iPhone')>0
THEN 'iPhone'
WHEN INSTR(CLS.BROWSER,'iPad')>0
THEN 'iPad'
WHEN INSTR(CLS.BROWSER,'iPod')>0
THEN 'iPod'
WHEN INSTR(CLS.BROWSER,'Apple TV')>0
THEN 'Apple TV'
WHEN INSTR(CLS.BROWSER,'Android ')>0 AND INSTR(CLS.BROWSER,'Build')>0
THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,'Android '), INSTR(CLS.BROWSER,'Build')-INSTR(CLS.BROWSER,'Android ')), 'n.a.')
WHEN INSTR(CLS.BROWSER,'Android ')>0 AND INSTR(CLS.BROWSER,'MIUI')>0
THEN COALESCE(SUBSTR(CLS.BROWSER, INSTR(CLS.BROWSER,'Android '), INSTR(CLS.BROWSER,'MIUI')-INSTR(CLS.BROWSER,'Android ')), 'n.a.')
ELSE 'n.a.' END AS AGENT_DEVICE, -- Device
CASE
WHEN INSTR(LOWER(CLS.BROWSER),'yandex.com')>0
THEN 'yandex'
WHEN INSTR(LOWER(CLS.BROWSER),'googlebot')>0
THEN 'google'
WHEN INSTR(LOWER(CLS.BROWSER),'bingbot')>0
THEN 'microsoft'
WHEN INSTR(LOWER(CLS.BROWSER),'ahrefsbot')>0
THEN 'ahrefs'
WHEN INSTR(LOWER(CLS.BROWSER),'jobboersebot')>0 OR INSTR(LOWER(CLS.BROWSER),'jobkicks')>0
THEN 'job.de'
WHEN INSTR(LOWER(CLS.BROWSER),'mail.ru')>0
THEN 'mail.ru'
WHEN INSTR(LOWER(CLS.BROWSER),'baiduspider')>0
THEN 'baidu'
WHEN INSTR(LOWER(CLS.BROWSER),'mj12bot')>0
THEN 'majestic-12'
WHEN INSTR(LOWER(CLS.BROWSER),'duckduckgo')>0
THEN 'duckduckgo'
WHEN INSTR(LOWER(CLS.BROWSER),'bytespider')>0
THEN 'bytespider'
WHEN INSTR(LOWER(CLS.BROWSER),'360spider')>0
THEN 'so.360.cn'
WHEN INSTR(LOWER(CLS.BROWSER),'compatible')>0 OR INSTR(LOWER(CLS.BROWSER),'http')>0
OR INSTR(LOWER(CLS.BROWSER),'libwww')>0 OR INSTR(LOWER(CLS.BROWSER),'spider')>0
OR INSTR(LOWER(CLS.BROWSER),'java')>0 OR INSTR(LOWER(CLS.BROWSER),'python')>0
OR INSTR(LOWER(CLS.BROWSER),'robot')>0 OR INSTR(LOWER(CLS.BROWSER),'curl')>0 OR INSTR(LOWER(CLS.BROWSER),'wget')>0
THEN 'other'
ELSE 'n.a.' END AS AGENT_BOT, -- Bot
STRFTIME('%s','now') AS UPDATE_DT
FROM CLS
LEFT OUTER JOIN DIM_USER_AGENT TRG
ON CLS.BROWSER = TRG.USER_AGENT_NK
WHERE TRG.DIM_USER_AGENT_ID IS NULL數據整合
包括組織從作業系統到報告系統的資料傳輸。 為此,您需要建立一個結構與來源類似的階段表。
有關網頁的資訊以插入請求的形式從 CMS 備份到達舞台。
使用基本資料載入歷史表 DIM_REQUEST 分三個步驟:載入新鍵和屬性、更新現有鍵和屬性以及提交已刪除的記錄。
載入新的 SCD2 錄音
/* Load request table SCD from master data */
INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT)
WITH CLS AS ( -- prepare keys
SELECT
'/' || NAME AS REQUEST_NK,
TITLE AS PAGE_TITLE,
CASE WHEN DESCRIPTION = '' OR DESCRIPTION IS NULL
THEN 'n.a.' ELSE DESCRIPTION
END AS PAGE_DESCR,
CASE WHEN KEYWORDS = '' OR KEYWORDS IS NULL
THEN 'n.a.' ELSE KEYWORDS
END AS PAGE_KEYWORDS
FROM STG_CMS_MENU
WHERE CONTENT_TYPE != 'folder' -- only web pages
AND PAGE_TITLE != 'n.a.' -- master data which make sense
)
/* new records from stage: CLS */
SELECT
-1 AS DIM_REQUEST_ID_HIST,
CLS.REQUEST_NK,
CLS.PAGE_TITLE,
CLS.PAGE_DESCR,
CLS.PAGE_KEYWORDS,
0 AS DELETE_FLAG,
STRFTIME('%s','now') AS UPDATE_DT
FROM CLS
LEFT OUTER JOIN
(
SELECT
DIM_REQUEST_ID,
REQUEST_NK,
PAGE_TITLE,
PAGE_DESCR,
PAGE_KEYWORDS
FROM DIM_REQUEST_V_ACT
) TRG ON CLS.REQUEST_NK = TRG.REQUEST_NK
WHERE TRG.REQUEST_NK IS NULL -- no such record in data mart更新 SCD2 屬性
/* Load request table SCD from master data */
INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT)
WITH CLS AS ( -- prepare keys
SELECT
'/' || NAME AS REQUEST_NK,
TITLE AS PAGE_TITLE,
CASE WHEN DESCRIPTION = '' OR DESCRIPTION IS NULL
THEN 'n.a.' ELSE DESCRIPTION
END AS PAGE_DESCR,
CASE WHEN KEYWORDS = '' OR KEYWORDS IS NULL
THEN 'n.a.' ELSE KEYWORDS
END AS PAGE_KEYWORDS
FROM STG_CMS_MENU
WHERE CONTENT_TYPE != 'folder' -- only web pages
AND PAGE_TITLE != 'n.a.' -- master data which make sense
)
/* updated records from stage: CLS and build reference to history: HIST */
SELECT
HIST.DIM_REQUEST_ID AS DIM_REQUEST_ID_HIST,
HIST.REQUEST_NK,
CLS.PAGE_TITLE,
CLS.PAGE_DESCR,
CLS.PAGE_KEYWORDS,
0 AS DELETE_FLAG,
STRFTIME('%s','now') AS UPDATE_DT
FROM CLS,
DIM_REQUEST_V_ACT TRG,
DIM_REQUEST HIST
WHERE CLS.REQUEST_NK = TRG.REQUEST_NK
AND TRG.DIM_REQUEST_ID = HIST.DIM_REQUEST_ID
AND ( CLS.PAGE_TITLE != HIST.PAGE_TITLE /* changes only */
OR CLS.PAGE_DESCR != HIST.PAGE_DESCR
OR CLS.PAGE_KEYWORDS != HIST.PAGE_KEYWORDS )已刪除的 SCD2 錄音
/* Load request table SCD from master data */
INSERT INTO DIM_REQUEST (DIM_REQUEST_ID_HIST, REQUEST_NK, PAGE_TITLE, PAGE_DESCR, PAGE_KEYWORDS, DELETE_FLAG, UPDATE_DT)
WITH CLS AS ( -- prepare keys
SELECT
'/' || NAME AS REQUEST_NK,
TITLE AS PAGE_TITLE
FROM STG_CMS_MENU
WHERE CONTENT_TYPE != 'folder' -- only web pages
AND PAGE_TITLE != 'n.a.' -- master data which make sense
)
/* deleted records in data mart: TRG */
SELECT
TRG.DIM_REQUEST_ID AS DIM_REQUEST_ID_HIST,
TRG.REQUEST_NK,
TRG.PAGE_TITLE,
TRG.PAGE_DESCR,
TRG.PAGE_KEYWORDS,
1 AS DELETE_FLAG,
STRFTIME('%s','now') AS UPDATE_DT
FROM (
SELECT
DIM_REQUEST_ID,
REQUEST_NK,
PAGE_TITLE,
PAGE_DESCR,
PAGE_KEYWORDS
FROM DIM_REQUEST_V_ACT
WHERE PAGE_TITLE != 'n.a.' -- track master data only
AND DELETE_FLAG = 0 -- not already deleted
) TRG
LEFT OUTER JOIN CLS ON TRG.REQUEST_NK = CLS.REQUEST_NK
WHERE CLS.REQUEST_NK IS NULL -- no such record in stage每個資料來源必須附有正式的描述,例如在 readme.txt 檔案中:
正式/技術上的資料接收者:姓名、電子郵件地址
資料提供者正式/技術上:姓名、電子郵件地址
資料來源:檔案路徑、服務名稱
資料存取資訊:使用者和密碼
資料流程圖將有助於維護和更新流程,例如以文字形式:
移動文件。 來源:ftp.domain.net:/logs/access.log 目標:/var/www/access.log
台上讀書。 目標:STG_ACCESS_LOG
載入和轉換。 目標:FCT_ACCESS_REQUEST_REF_HH
載入和轉換。 目標:FCT_ACCESS_USER_AGENT_DD
報告。 目標:/var/www/report.html
產量
因此,本文描述了底層資料的整合和新聚合層級的引入等機制。 建置資料倉儲時需要它們以獲得額外的知識並提高資訊品質。
來源: www.habr.com
