Ngawasi proses ETL ing gudang data cilik

Akeh sing nggunakake alat khusus kanggo nggawe prosedur kanggo ngekstrak, ngowahi, lan ngemot data menyang basis data relasional. Proses alat kerja dicathet, kesalahan didandani.

Yen ana kesalahan, log ngemot informasi manawa alat kasebut gagal ngrampungake tugas lan modul (asring java) mandheg ing ngendi. Ing baris pungkasan, sampeyan bisa nemokake kesalahan database, contone, Tabel nglanggar tombol unik.

Kanggo njawab pitakonan apa peran informasi kesalahan ETL, Aku wis klasifikasi kabeh masalah sing wis kedaden liwat rong taun kepungkur ing gudang rada gedhe.

Ngawasi proses ETL ing gudang data cilik

Kesalahan database kalebu papan sing ora cukup, sambungan sing ilang, sesi digantung, lsp.

Kesalahan logis kalebu kayata nglanggar tombol meja, obyek sing ora sah, kurang akses menyang obyek, lsp.
Penjadwal bisa uga ora diwiwiti ing wektu, bisa beku, lsp.

Kesalahan sing prasaja ora butuh wektu suwe kanggo ndandani. ETL sing apik bisa ngatasi umume dhewe.

Kewan omo sing rumit mbutuhake nemokake lan nguji prosedur kanggo nggarap data, kanggo njelajah sumber data. Asring mimpin kanggo perlu kanggo testing pangowahan lan panyebaran.

Dadi, setengah saka kabeh masalah ana hubungane karo database. 48% saka kabeh kesalahan iku kesalahan prasaja.
Katelu saka kabeh masalah sing ana hubungane karo ngganti logika panyimpenan utawa model, luwih saka setengah saka kesalahan kasebut rumit.

Lan kurang saka seprapat kabeh masalah sing ana gandhengane karo panjadwal tugas, 18% yaiku kesalahan sing gampang.

UmumΓ©, 22% saka kabeh kesalahan sing kedadeyan rumit, lan koreksi kasebut mbutuhake perhatian lan wektu sing paling akeh. Padha kedadeyan kira-kira sepisan seminggu. Dene kesalahan prasaja meh saben dina.

Temenan, ngawasi proses ETL bakal efektif nalika lokasi kesalahan dituduhake ing log kanthi akurat lan wektu minimal dibutuhake kanggo nemokake sumber masalah kasebut.

Ngawasi efektif

Apa aku pengin ndeleng ing proses ngawasi ETL?

Ngawasi proses ETL ing gudang data cilik
Mulai ing - nalika dheweke miwiti kerja,
Sumber - sumber data,
Lapisan - apa tingkat panyimpenan sing dimuat,
Jeneng Proyek ETL - prosedur unggahan, sing kalebu pirang-pirang langkah cilik,
Nomer Langkah - nomer langkah sing ditindakake,
Baris sing kena pengaruh - pira data sing wis diproses,
Duration sec - suwene suwene,
Status - apa kabeh apik utawa ora: OK, ERROR, RUNNING, HANGS
Pesen - Pesen sukses pungkasan utawa katrangan kesalahan.

Adhedhasar status rekaman, sampeyan bisa ngirim email. layang kanggo anggota liyane. Yen ora ana kesalahan, mula surat kasebut ora perlu.

Dadi, yen ana kesalahan, lokasi kedadeyan kasebut dituduhake kanthi jelas.

Kadhangkala alat ngawasi dhewe ora bisa digunakake. Ing kasus iki, bisa nelpon tampilan (view) langsung ing database, ing basis saka laporan dibangun.

Tabel monitoring ETL

Kanggo ngleksanakake pemantauan proses ETL, siji tabel lan siji tampilan cukup.

Kanggo nindakake iki, sampeyan bisa bali menyang panyimpenan sethitik Panjenengan lan nggawe prototipe ing database sqlite.

Tabel DDL

CREATE TABLE UTL_JOB_STATUS (
/* Table for logging of job execution log. Important that the job has the steps ETL_START and ETL_END or ETL_ERROR */
  UTL_JOB_STATUS_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  SID               INTEGER NOT NULL DEFAULT -1, /* Session Identificator. Unique for every Run of job */
  LOG_DT            INTEGER NOT NULL DEFAULT 0,  /* Date time */
  LOG_D             INTEGER NOT NULL DEFAULT 0,  /* Date */
  JOB_NAME          TEXT NOT NULL DEFAULT 'N/A', /* Job name like JOB_STG2DM_GEO */
  STEP_NAME         TEXT NOT NULL DEFAULT 'N/A', /* ETL_START, ... , ETL_END/ETL_ERROR */
  STEP_DESCR        TEXT,                        /* Description of task or error message */
  UNIQUE (SID, JOB_NAME, STEP_NAME)
);
INSERT INTO UTL_JOB_STATUS (UTL_JOB_STATUS_ID) VALUES (-1);

Ndeleng / lapor DDL

CREATE VIEW IF NOT EXISTS UTL_JOB_STATUS_V
AS /* Content: Package Execution Log for last 3 Months. */
WITH SRC AS (
  SELECT LOG_D,
    LOG_DT,
    UTL_JOB_STATUS_ID,
    SID,
	CASE WHEN INSTR(JOB_NAME, 'FTP') THEN 'TRANSFER' /* file transfer */
	     WHEN INSTR(JOB_NAME, 'STG') THEN 'STAGE' /* stage */
	     WHEN INSTR(JOB_NAME, 'CLS') THEN 'CLEANSING' /* cleansing */
	     WHEN INSTR(JOB_NAME, 'DIM') THEN 'DIMENSION' /* dimension */
	     WHEN INSTR(JOB_NAME, 'FCT') THEN 'FACT' /* fact */
		 WHEN INSTR(JOB_NAME, 'ETL') THEN 'STAGE-MART' /* data mart */
	     WHEN INSTR(JOB_NAME, 'RPT') THEN 'REPORT' /* report */
	     ELSE 'N/A' END AS LAYER,
	CASE WHEN INSTR(JOB_NAME, 'ACCESS') THEN 'ACCESS LOG' /* source */
	     WHEN INSTR(JOB_NAME, 'MASTER') THEN 'MASTER DATA' /* source */
	     WHEN INSTR(JOB_NAME, 'AD-HOC') THEN 'AD-HOC' /* source */
	     ELSE 'N/A' END AS SOURCE,
    JOB_NAME,
    STEP_NAME,
    CASE WHEN STEP_NAME='ETL_START' THEN 1 ELSE 0 END AS START_FLAG,
    CASE WHEN STEP_NAME='ETL_END' THEN 1 ELSE 0 END AS END_FLAG,
    CASE WHEN STEP_NAME='ETL_ERROR' THEN 1 ELSE 0 END AS ERROR_FLAG,
    STEP_NAME || ' : ' || STEP_DESCR AS STEP_LOG,
	SUBSTR( SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), 1, INSTR(SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), '***')-2 ) AS AFFECTED_ROWS
  FROM UTL_JOB_STATUS
  WHERE datetime(LOG_D, 'unixepoch') >= date('now', 'start of month', '-3 month')
)
SELECT JB.SID,
  JB.MIN_LOG_DT AS START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS LOG_DT,
  JB.SOURCE,
  JB.LAYER,
  JB.JOB_NAME,
  CASE
  WHEN JB.ERROR_FLAG = 1 THEN 'ERROR'
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 AND strftime('%s','now') - JB.MIN_LOG_DT > 0.5*60*60 THEN 'HANGS' /* half an hour */
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 THEN 'RUNNING'
  ELSE 'OK'
  END AS STATUS,
  ERR.STEP_LOG     AS STEP_LOG,
  JB.CNT           AS STEP_CNT,
  JB.AFFECTED_ROWS AS AFFECTED_ROWS,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS JOB_START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MAX_LOG_DT, 'unixepoch')) AS JOB_END_DT,
  JB.MAX_LOG_DT - JB.MIN_LOG_DT AS JOB_DURATION_SEC
FROM
  ( SELECT SID, SOURCE, LAYER, JOB_NAME,
           MAX(UTL_JOB_STATUS_ID) AS UTL_JOB_STATUS_ID,
           MAX(START_FLAG)       AS START_FLAG,
           MAX(END_FLAG)         AS END_FLAG,
           MAX(ERROR_FLAG)       AS ERROR_FLAG,
           MIN(LOG_DT)           AS MIN_LOG_DT,
           MAX(LOG_DT)           AS MAX_LOG_DT,
           SUM(1)                AS CNT,
           SUM(IFNULL(AFFECTED_ROWS, 0)) AS AFFECTED_ROWS
    FROM SRC
    GROUP BY SID, SOURCE, LAYER, JOB_NAME
  ) JB,
  ( SELECT UTL_JOB_STATUS_ID, SID, JOB_NAME, STEP_LOG
    FROM SRC
    WHERE 1 = 1
  ) ERR
WHERE 1 = 1
  AND JB.SID = ERR.SID
  AND JB.JOB_NAME = ERR.JOB_NAME
  AND JB.UTL_JOB_STATUS_ID = ERR.UTL_JOB_STATUS_ID
ORDER BY JB.MIN_LOG_DT DESC, JB.SID DESC, JB.SOURCE;

SQL Priksa yen bisa njaluk nomer sesi anyar

SELECT SUM (
  CASE WHEN start_job.JOB_NAME IS NOT NULL AND end_job.JOB_NAME IS NULL /* existed job finished */
	    AND NOT ( 'y' = 'n' ) /* force restart PARAMETER */
       THEN 1 ELSE 0
  END ) AS IS_RUNNING
  FROM
    ( SELECT 1 AS dummy FROM UTL_JOB_STATUS WHERE sid = -1) d_job
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID, 1 AS dummy
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG' /* job name PARAMETER */
	    AND STEP_NAME = 'ETL_START'
      GROUP BY JOB_NAME, SID
    ) start_job /* starts */
  ON d_job.dummy = start_job.dummy
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG'  /* job name PARAMETER */
	    AND STEP_NAME in ('ETL_END', 'ETL_ERROR') /* stop status */
      GROUP BY JOB_NAME, SID
    ) end_job /* ends */
  ON start_job.JOB_NAME = end_job.JOB_NAME
     AND start_job.SID = end_job.SID

Fitur Tabel:

  • wiwitan lan pungkasan prosedur pangolahan data kudu ditindakake kanthi langkah ETL_START lan ETL_END
  • yen ana kesalahan, langkah ETL_ERROR karo katrangane kudu digawe
  • jumlah data sing diproses kudu disorot, contone, karo tanda bintang
  • prosedur sing padha bisa diwiwiti bebarengan karo parameter force_restart = y, tanpa nomer sesi ditanggepi mung kanggo prosedur sing wis rampung.
  • ing mode normal, sampeyan ora bisa mbukak prosedur Processing data padha ing podo karo

Operasi sing dibutuhake kanggo nggarap tabel yaiku:

  • njupuk nomer sesi saka prosedur ETL mlaku
  • lebokake entri log menyang tabel
  • njupuk rekaman sukses pungkasan saka prosedur ETL

Ing basis data kayata Oracle utawa Postgres, operasi kasebut bisa ditindakake minangka fungsi sing dibangun. sqlite mbutuhake mekanisme eksternal, lan ing kasus iki prototipe ing PHP.

kesimpulan

Mangkono, pesen kesalahan ing piranti pangolahan data duwe peran sing penting banget. Nanging angel nyebataken optimal kanggo cepet nemokake sabab saka masalah. Nalika jumlah prosedur nyedhaki satus, banjur proses ngawasi dadi proyek sing kompleks.

Artikel menehi conto solusi sing bisa kanggo masalah ing wangun prototipe. Prototipe repositori cilik kabeh kasedhiya ing gitlab SQLite PHP ETL Utilities.

Source: www.habr.com

Add a comment