Nyochaa usoro ETL na obere ụlọ nkwakọba ihe data

Ọtụtụ ndị mmadụ na-eji ngwá ọrụ pụrụ iche mepụta usoro ihe omume maka iwepụta, ịtụgharị, na ibunye data n'ime ọdụ data mmekọrịta. A na-edebanye usoro nke ngwaọrụ ahụ, a na-edekọ njehie.

N'ọnọdụ nke njehie, ndekọ ahụ nwere ozi na ngwá ọrụ ahụ emezughị ọrụ ahụ yana nke modul (mgbe java) kwụsịrị ebe. Ahịrị ikpeazụ nwere ike ịnwe njehie nchekwa data, dị ka mmebi nke igodo pụrụ iche nke tebụl.

Iji zaa ajụjụ nke ọrụ ozi njehie ETL na-arụ, ekewara m nsogbu niile mere n'ime afọ abụọ gara aga na nnukwu ebe nchekwa.

Nyochaa usoro ETL na obere ụlọ nkwakọba ihe data

Njehie nchekwa data gụnyere dị ka: enweghị ohere zuru oke, njikọ ahụ furu efu, nnọkọ ahụ kpọgidere, wdg.

Njehie ezi uche dị na ya gụnyere imebi igodo tebụl, ihe adịghị mma, enweghị ohere ịnweta ihe, wdg.
Enwere ike ghara ịmalite onye nhazi oge n'oge, nwere ike ifriizi, wdg.

Mmejọ ndị dị mfe anaghị ewe oge dị ukwuu iji dozie ya. Ezigbo ETL nwere ike ijikwa ọtụtụ n'ime ha n'onwe ya.

Njehie mgbagwoju anya na-eme ka ọ dị mkpa imeghe ma lelee usoro njikwa data na nyochaa isi mmalite data. Ọtụtụ mgbe na-eduga na mkpa iji nwalee mgbanwe na ibuga.

Ya mere, ọkara nke nsogbu niile metụtara nchekwa data. 48% nke njehie niile bụ njehie dị mfe.
Otu ụzọ n'ụzọ atọ nke nsogbu niile metụtara mgbanwe na mgbagha nchekwa ma ọ bụ ihe nlereanya;

Na ihe na-erughị otu ụzọ n'ụzọ anọ nke nsogbu niile metụtara onye nhazi ọrụ, 18% nke bụ njehie dị mfe.

N'ozuzu, 22% nke njehie niile na-eme dị mgbagwoju anya ma chọọ nlebara anya na oge iji dozie ya. Ha na-eme ihe dị ka otu ugboro n'izu. Ọ bụ ezie na mmejọ ndị dị mfe na-eme ihe fọrọ nke nta ka ọ bụrụ ụbọchị ọ bụla.

N'ụzọ doro anya, nlekota usoro ETL ga-adị irè mgbe egosipụtara ebe njehie ahụ dị na ndekọ dịka o kwere mee na oge dị ntakịrị chọrọ ịchọta isi iyi nke nsogbu ahụ.

Nleba anya nke ọma

Kedu ihe m chọrọ ịhụ na usoro nlekota ETL?

Nyochaa usoro ETL na obere ụlọ nkwakọba ihe data
Malite na - mgbe m malitere ịrụ ọrụ,
Isi mmalite - isi iyi data,
Layer - nke ọkwa nchekwa na-ebu,
Aha ọrụ ETL bụ usoro ntinye nke nwere ọtụtụ obere usoro,
Nọmba nzọụkwụ - nọmba nke nzọụkwụ a na-eme,
Ahịrị ndị emetụtara - ole data edozilarị,
Ogologo nkeji - ogologo oge ọ na-ewe iji mezuo,
Ọnọdụ - ma ihe niile ọ dị mma ma ọ bụ na ọ bụghị: OK, ERROR, NA-agba ọsọ, HANGS
Ozi — ozi gara nke ọma ikpeazụ ma ọ bụ nkọwa njehie.

Dabere na ọkwa nke ndekọ, ị nwere ike izipu ozi-e. leta ndị ọzọ sonyere. Ọ bụrụ na enweghị njehie, mgbe ahụ akwụkwọ ozi adịghị mkpa.

N'ụzọ dị otú a, n'ọnọdụ nke njehie, a na-egosipụta n'ụzọ doro anya ebe ihe ahụ mere.

Mgbe ụfọdụ ọ na-eme na ngwaọrụ nlekota n'onwe ya adịghị arụ ọrụ. N'okwu a, ọ ga-ekwe omume ịkpọ echiche (nleba anya) ozugbo na nchekwa data, na ndabere nke e wuru akụkọ ahụ.

ETL nlekota tebụl

Iji mejuputa nlekota nke usoro ETL, otu tebụl na otu echiche zuru ezu.

Iji mee nke a ị nwere ike ịlaghachi obere nchekwa nke gị ma mepụta prototype na nchekwa data sqlite.

Ọnụ ego nke DDL

CREATE TABLE UTL_JOB_STATUS (
/* Table for logging of job execution log. Important that the job has the steps ETL_START and ETL_END or ETL_ERROR */
  UTL_JOB_STATUS_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  SID               INTEGER NOT NULL DEFAULT -1, /* Session Identificator. Unique for every Run of job */
  LOG_DT            INTEGER NOT NULL DEFAULT 0,  /* Date time */
  LOG_D             INTEGER NOT NULL DEFAULT 0,  /* Date */
  JOB_NAME          TEXT NOT NULL DEFAULT 'N/A', /* Job name like JOB_STG2DM_GEO */
  STEP_NAME         TEXT NOT NULL DEFAULT 'N/A', /* ETL_START, ... , ETL_END/ETL_ERROR */
  STEP_DESCR        TEXT,                        /* Description of task or error message */
  UNIQUE (SID, JOB_NAME, STEP_NAME)
);
INSERT INTO UTL_JOB_STATUS (UTL_JOB_STATUS_ID) VALUES (-1);

Lelee/kọpụta DDL

CREATE VIEW IF NOT EXISTS UTL_JOB_STATUS_V
AS /* Content: Package Execution Log for last 3 Months. */
WITH SRC AS (
  SELECT LOG_D,
    LOG_DT,
    UTL_JOB_STATUS_ID,
    SID,
	CASE WHEN INSTR(JOB_NAME, 'FTP') THEN 'TRANSFER' /* file transfer */
	     WHEN INSTR(JOB_NAME, 'STG') THEN 'STAGE' /* stage */
	     WHEN INSTR(JOB_NAME, 'CLS') THEN 'CLEANSING' /* cleansing */
	     WHEN INSTR(JOB_NAME, 'DIM') THEN 'DIMENSION' /* dimension */
	     WHEN INSTR(JOB_NAME, 'FCT') THEN 'FACT' /* fact */
		 WHEN INSTR(JOB_NAME, 'ETL') THEN 'STAGE-MART' /* data mart */
	     WHEN INSTR(JOB_NAME, 'RPT') THEN 'REPORT' /* report */
	     ELSE 'N/A' END AS LAYER,
	CASE WHEN INSTR(JOB_NAME, 'ACCESS') THEN 'ACCESS LOG' /* source */
	     WHEN INSTR(JOB_NAME, 'MASTER') THEN 'MASTER DATA' /* source */
	     WHEN INSTR(JOB_NAME, 'AD-HOC') THEN 'AD-HOC' /* source */
	     ELSE 'N/A' END AS SOURCE,
    JOB_NAME,
    STEP_NAME,
    CASE WHEN STEP_NAME='ETL_START' THEN 1 ELSE 0 END AS START_FLAG,
    CASE WHEN STEP_NAME='ETL_END' THEN 1 ELSE 0 END AS END_FLAG,
    CASE WHEN STEP_NAME='ETL_ERROR' THEN 1 ELSE 0 END AS ERROR_FLAG,
    STEP_NAME || ' : ' || STEP_DESCR AS STEP_LOG,
	SUBSTR( SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), 1, INSTR(SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), '***')-2 ) AS AFFECTED_ROWS
  FROM UTL_JOB_STATUS
  WHERE datetime(LOG_D, 'unixepoch') >= date('now', 'start of month', '-3 month')
)
SELECT JB.SID,
  JB.MIN_LOG_DT AS START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS LOG_DT,
  JB.SOURCE,
  JB.LAYER,
  JB.JOB_NAME,
  CASE
  WHEN JB.ERROR_FLAG = 1 THEN 'ERROR'
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 AND strftime('%s','now') - JB.MIN_LOG_DT > 0.5*60*60 THEN 'HANGS' /* half an hour */
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 THEN 'RUNNING'
  ELSE 'OK'
  END AS STATUS,
  ERR.STEP_LOG     AS STEP_LOG,
  JB.CNT           AS STEP_CNT,
  JB.AFFECTED_ROWS AS AFFECTED_ROWS,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS JOB_START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MAX_LOG_DT, 'unixepoch')) AS JOB_END_DT,
  JB.MAX_LOG_DT - JB.MIN_LOG_DT AS JOB_DURATION_SEC
FROM
  ( SELECT SID, SOURCE, LAYER, JOB_NAME,
           MAX(UTL_JOB_STATUS_ID) AS UTL_JOB_STATUS_ID,
           MAX(START_FLAG)       AS START_FLAG,
           MAX(END_FLAG)         AS END_FLAG,
           MAX(ERROR_FLAG)       AS ERROR_FLAG,
           MIN(LOG_DT)           AS MIN_LOG_DT,
           MAX(LOG_DT)           AS MAX_LOG_DT,
           SUM(1)                AS CNT,
           SUM(IFNULL(AFFECTED_ROWS, 0)) AS AFFECTED_ROWS
    FROM SRC
    GROUP BY SID, SOURCE, LAYER, JOB_NAME
  ) JB,
  ( SELECT UTL_JOB_STATUS_ID, SID, JOB_NAME, STEP_LOG
    FROM SRC
    WHERE 1 = 1
  ) ERR
WHERE 1 = 1
  AND JB.SID = ERR.SID
  AND JB.JOB_NAME = ERR.JOB_NAME
  AND JB.UTL_JOB_STATUS_ID = ERR.UTL_JOB_STATUS_ID
ORDER BY JB.MIN_LOG_DT DESC, JB.SID DESC, JB.SOURCE;

SQL Na-enyocha ikike iji nweta nọmba nnọkọ ọhụrụ

SELECT SUM (
  CASE WHEN start_job.JOB_NAME IS NOT NULL AND end_job.JOB_NAME IS NULL /* existed job finished */
	    AND NOT ( 'y' = 'n' ) /* force restart PARAMETER */
       THEN 1 ELSE 0
  END ) AS IS_RUNNING
  FROM
    ( SELECT 1 AS dummy FROM UTL_JOB_STATUS WHERE sid = -1) d_job
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID, 1 AS dummy
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG' /* job name PARAMETER */
	    AND STEP_NAME = 'ETL_START'
      GROUP BY JOB_NAME, SID
    ) start_job /* starts */
  ON d_job.dummy = start_job.dummy
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG'  /* job name PARAMETER */
	    AND STEP_NAME in ('ETL_END', 'ETL_ERROR') /* stop status */
      GROUP BY JOB_NAME, SID
    ) end_job /* ends */
  ON start_job.JOB_NAME = end_job.JOB_NAME
     AND start_job.SID = end_job.SID

Atụmatụ tebụl:

  • mmalite na njedebe nke usoro nhazi data ga-esorịrị usoro ETL_START na ETL_END
  • Ọ bụrụ na enwere njehie, ekwesịrị ịmepụta nzọụkwụ ETL_ERROR na nkọwa ya
  • Ekwesịrị ime ka ọnụọgụ data edoziri pụta ìhè, dịka ọmụmaatụ, na akara mmuke
  • Enwere ike ịmalite otu usoro ahụ n'otu oge ahụ na force_restart = y parameter na-enweghị ya, a na-enye nọmba nnọkọ naanị na usoro emechara
  • na ọnọdụ nkịtị ọ gaghị ekwe omume ịme otu usoro nhazi data n'otu oge

Ọrụ ndị dị mkpa maka ịrụ ọrụ na tebụl bụ ndị a:

  • nweta nọmba nnọkọ nke usoro ETL ka a na-amalite
  • na-etinye ntinye log n'ime tebụl
  • Inweta ndekọ ikpeazụ gara nke ọma nke usoro ETL

Na ọdụ data dị ka Oracle ma ọ bụ Postgres, enwere ike iji ọrụ arụrụ ọrụ rụọ ọrụ ndị a. sqlite chọrọ usoro mpụga na nke a prototyped na PHP.

nkwubi

Ya mere, mkpesa njehie na ngwaọrụ nhazi data na-arụ ọrụ mega-mkpa. Mana enwere ike ịkpọ ha ndị kacha mma maka ịchọta ihe kpatara nsogbu ahụ ngwa ngwa. Mgbe ọnụ ọgụgụ nke usoro na-abịaru nso otu narị, nlekota usoro na-aghọ ọrụ mgbagwoju anya.

Isiokwu ahụ na-enye ihe atụ nke ngwọta ga-ekwe omume maka nsogbu ahụ n'ụdị prototype. Nlereanya niile nke obere ebe nchekwa dị na gitlab SQLite PHP ETL Utilities.

isi: www.habr.com

Tinye a comment