Kuyang'anira njira za ETL m'nkhokwe yaying'ono ya data

Ambiri amagwiritsa ntchito zida zapadera kuti apange njira zochotsera, kusintha, ndi kuyika deta muzosungirako zokhudzana ndi ubale. Njira ya zida zogwirira ntchito imalowetsedwa, zolakwika zimakonzedwa.

Pakachitika cholakwika, chipikacho chimakhala ndi chidziwitso chomwe chida chinalephera kumaliza ntchitoyo komanso ma module (nthawi zambiri java) adayima pomwe. M'mizere yomaliza, mutha kupeza cholakwika cha database, mwachitsanzo, kuphwanya makiyi apadera patebulo.

Kuti ndiyankhe funso la momwe chidziwitso cha zolakwika za ETL chimachita, ndayika zovuta zonse zomwe zachitika zaka ziwiri zapitazi m'malo akulu kwambiri.

Kuyang'anira njira za ETL m'nkhokwe yaying'ono ya data

Zolakwika mu database zikuphatikiza kusakwanira kwa malo, kutayika kolumikizana, gawo lopachikidwa, ndi zina.

Zolakwa zomveka zimaphatikizapo monga kuphwanya makiyi a tebulo, zinthu zosavomerezeka, kusowa mwayi wopeza zinthu, ndi zina zotero.
Wokonza mapulani sangayambe pa nthawi yake, akhoza kuzizira, ndi zina zotero.

Zolakwa zosavuta sizitenga nthawi kuti zikonzedwe. ETL yabwino imatha kuthana ndi ambiri mwa iwo okha.

Nsikidzi zovuta zimapangitsa kuti pakhale kofunikira kupeza ndikuyesa njira zogwirira ntchito ndi data, kufufuza magwero a data. Nthawi zambiri kumabweretsa kufunikira koyesa kusintha ndi kutumizidwa.

Chifukwa chake, theka la zovuta zonse zimagwirizana ndi database. 48% ya zolakwa zonse ndi zolakwika zosavuta.
Gawo limodzi mwa magawo atatu a mavuto onse okhudzana ndi kusintha malingaliro osungirako kapena chitsanzo, oposa theka la zolakwikazi ndizovuta.

Ndipo zosakwana kotala la mavuto onse okhudzana ndi ndondomeko ya ntchito, 18% yomwe ndi zolakwika zosavuta.

Kawirikawiri, 22% ya zolakwa zonse zomwe zimachitika zimakhala zovuta, ndipo kuwongolera kwawo kumafuna chidwi kwambiri ndi nthawi. Zimachitika kamodzi pa sabata. Pomwe zolakwika zosavuta zimachitika pafupifupi tsiku lililonse.

Mwachiwonekere, kuyang'anira njira za ETL kudzakhala kothandiza pamene malo olakwika akuwonetsedwa mu chipika molondola momwe angathere komanso nthawi yochepa yofunikira kuti mupeze gwero la vuto.

Kuyang'anira kogwira mtima

Kodi ndimafuna kuwona chiyani pakuwunika kwa ETL?

Kuyang'anira njira za ETL m'nkhokwe yaying'ono ya data
Yambani pa - pamene anayamba ntchito,
Gwero - gwero la data,
Layer - ndi mulingo wanji wosungira womwe ukukwezedwa,
ETL Job Name - njira yokwezera, yomwe imakhala ndi masitepe ang'onoang'ono,
Gawo Nambala - nambala ya sitepe yomwe ikuchitika,
Mizere Yokhudzidwa - kuchuluka kwa deta yomwe yasinthidwa kale,
Sec - zimatenga nthawi yayitali bwanji,
Mkhalidwe - kaya zonse zili bwino kapena ayi: CHABWINO, ZOLAKWITSA, KUTHAWA, KUPITA
Uthenga - Uthenga womaliza wopambana kapena kufotokoza zolakwika.

Kutengera momwe zolembazo zilili, mutha kutumiza imelo. kalata kwa mamembala ena. Ngati palibe zolakwika, ndiye kuti kalatayo sikofunikira.

Choncho, pakachitika cholakwika, malo a chochitikacho amasonyezedwa bwino.

Nthawi zina zimachitika kuti chida chowunikira chokha sichigwira ntchito. Pankhaniyi, ndizotheka kuyitanitsa mawonedwe (mawonedwe) mwachindunji mu database, pamaziko omwe lipotilo limamangidwa.

Tebulo lowunikira la ETL

Kukhazikitsa kuwunika kwa njira za ETL, tebulo limodzi ndi mawonekedwe amodzi ndizokwanira.

Kuti muchite izi, mutha kubwerera ku kasungidwe kanu kakang'ono ndikupanga prototype mu database ya sqlite.

Zithunzi za DDL

CREATE TABLE UTL_JOB_STATUS (
/* Table for logging of job execution log. Important that the job has the steps ETL_START and ETL_END or ETL_ERROR */
  UTL_JOB_STATUS_ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  SID               INTEGER NOT NULL DEFAULT -1, /* Session Identificator. Unique for every Run of job */
  LOG_DT            INTEGER NOT NULL DEFAULT 0,  /* Date time */
  LOG_D             INTEGER NOT NULL DEFAULT 0,  /* Date */
  JOB_NAME          TEXT NOT NULL DEFAULT 'N/A', /* Job name like JOB_STG2DM_GEO */
  STEP_NAME         TEXT NOT NULL DEFAULT 'N/A', /* ETL_START, ... , ETL_END/ETL_ERROR */
  STEP_DESCR        TEXT,                        /* Description of task or error message */
  UNIQUE (SID, JOB_NAME, STEP_NAME)
);
INSERT INTO UTL_JOB_STATUS (UTL_JOB_STATUS_ID) VALUES (-1);

Onani/Nenani za DDL

CREATE VIEW IF NOT EXISTS UTL_JOB_STATUS_V
AS /* Content: Package Execution Log for last 3 Months. */
WITH SRC AS (
  SELECT LOG_D,
    LOG_DT,
    UTL_JOB_STATUS_ID,
    SID,
	CASE WHEN INSTR(JOB_NAME, 'FTP') THEN 'TRANSFER' /* file transfer */
	     WHEN INSTR(JOB_NAME, 'STG') THEN 'STAGE' /* stage */
	     WHEN INSTR(JOB_NAME, 'CLS') THEN 'CLEANSING' /* cleansing */
	     WHEN INSTR(JOB_NAME, 'DIM') THEN 'DIMENSION' /* dimension */
	     WHEN INSTR(JOB_NAME, 'FCT') THEN 'FACT' /* fact */
		 WHEN INSTR(JOB_NAME, 'ETL') THEN 'STAGE-MART' /* data mart */
	     WHEN INSTR(JOB_NAME, 'RPT') THEN 'REPORT' /* report */
	     ELSE 'N/A' END AS LAYER,
	CASE WHEN INSTR(JOB_NAME, 'ACCESS') THEN 'ACCESS LOG' /* source */
	     WHEN INSTR(JOB_NAME, 'MASTER') THEN 'MASTER DATA' /* source */
	     WHEN INSTR(JOB_NAME, 'AD-HOC') THEN 'AD-HOC' /* source */
	     ELSE 'N/A' END AS SOURCE,
    JOB_NAME,
    STEP_NAME,
    CASE WHEN STEP_NAME='ETL_START' THEN 1 ELSE 0 END AS START_FLAG,
    CASE WHEN STEP_NAME='ETL_END' THEN 1 ELSE 0 END AS END_FLAG,
    CASE WHEN STEP_NAME='ETL_ERROR' THEN 1 ELSE 0 END AS ERROR_FLAG,
    STEP_NAME || ' : ' || STEP_DESCR AS STEP_LOG,
	SUBSTR( SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), 1, INSTR(SUBSTR(STEP_DESCR, INSTR(STEP_DESCR, '***')+4), '***')-2 ) AS AFFECTED_ROWS
  FROM UTL_JOB_STATUS
  WHERE datetime(LOG_D, 'unixepoch') >= date('now', 'start of month', '-3 month')
)
SELECT JB.SID,
  JB.MIN_LOG_DT AS START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS LOG_DT,
  JB.SOURCE,
  JB.LAYER,
  JB.JOB_NAME,
  CASE
  WHEN JB.ERROR_FLAG = 1 THEN 'ERROR'
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 AND strftime('%s','now') - JB.MIN_LOG_DT > 0.5*60*60 THEN 'HANGS' /* half an hour */
  WHEN JB.ERROR_FLAG = 0 AND JB.END_FLAG = 0 THEN 'RUNNING'
  ELSE 'OK'
  END AS STATUS,
  ERR.STEP_LOG     AS STEP_LOG,
  JB.CNT           AS STEP_CNT,
  JB.AFFECTED_ROWS AS AFFECTED_ROWS,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MIN_LOG_DT, 'unixepoch')) AS JOB_START_DT,
  strftime('%d.%m.%Y %H:%M', datetime(JB.MAX_LOG_DT, 'unixepoch')) AS JOB_END_DT,
  JB.MAX_LOG_DT - JB.MIN_LOG_DT AS JOB_DURATION_SEC
FROM
  ( SELECT SID, SOURCE, LAYER, JOB_NAME,
           MAX(UTL_JOB_STATUS_ID) AS UTL_JOB_STATUS_ID,
           MAX(START_FLAG)       AS START_FLAG,
           MAX(END_FLAG)         AS END_FLAG,
           MAX(ERROR_FLAG)       AS ERROR_FLAG,
           MIN(LOG_DT)           AS MIN_LOG_DT,
           MAX(LOG_DT)           AS MAX_LOG_DT,
           SUM(1)                AS CNT,
           SUM(IFNULL(AFFECTED_ROWS, 0)) AS AFFECTED_ROWS
    FROM SRC
    GROUP BY SID, SOURCE, LAYER, JOB_NAME
  ) JB,
  ( SELECT UTL_JOB_STATUS_ID, SID, JOB_NAME, STEP_LOG
    FROM SRC
    WHERE 1 = 1
  ) ERR
WHERE 1 = 1
  AND JB.SID = ERR.SID
  AND JB.JOB_NAME = ERR.JOB_NAME
  AND JB.UTL_JOB_STATUS_ID = ERR.UTL_JOB_STATUS_ID
ORDER BY JB.MIN_LOG_DT DESC, JB.SID DESC, JB.SOURCE;

SQL Kuyang'ana ngati kuli kotheka kupeza nambala yatsopano ya gawo

SELECT SUM (
  CASE WHEN start_job.JOB_NAME IS NOT NULL AND end_job.JOB_NAME IS NULL /* existed job finished */
	    AND NOT ( 'y' = 'n' ) /* force restart PARAMETER */
       THEN 1 ELSE 0
  END ) AS IS_RUNNING
  FROM
    ( SELECT 1 AS dummy FROM UTL_JOB_STATUS WHERE sid = -1) d_job
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID, 1 AS dummy
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG' /* job name PARAMETER */
	    AND STEP_NAME = 'ETL_START'
      GROUP BY JOB_NAME, SID
    ) start_job /* starts */
  ON d_job.dummy = start_job.dummy
  LEFT OUTER JOIN
    ( SELECT JOB_NAME, SID
      FROM UTL_JOB_STATUS
      WHERE JOB_NAME = 'RPT_ACCESS_LOG'  /* job name PARAMETER */
	    AND STEP_NAME in ('ETL_END', 'ETL_ERROR') /* stop status */
      GROUP BY JOB_NAME, SID
    ) end_job /* ends */
  ON start_job.JOB_NAME = end_job.JOB_NAME
     AND start_job.SID = end_job.SID

Zomwe zili patebulo:

  • kuyambira ndi kutha kwa ndondomeko yokonza deta kuyenera kutsatiridwa ndi masitepe ETL_START ndi ETL_END
  • pakakhala cholakwika, sitepe ya ETL_ERROR yokhala ndi tanthauzo lake iyenera kupangidwa
  • kuchuluka kwa zomwe zasinthidwa ziyenera kuwonetsedwa, mwachitsanzo, ndi nyenyezi
  • njira yomweyi ikhoza kuyambika nthawi yomweyo ndi force_restart=y parameter, popanda nambala ya gawolo imaperekedwa kokha ku ndondomeko yomalizidwa.
  • mumayendedwe abwinobwino, simungayendetse njira yofananira yosinthira deta

Zofunikira zogwirira ntchito ndi tebulo ndi izi:

  • kupeza nambala ya gawo la njira yoyendetsera ETL
  • lowetsani log lolowera mu tebulo
  • kupeza mbiri yabwino yomaliza ya njira ya ETL

M'ma database monga Oracle kapena Postgres, izi zitha kukhazikitsidwa ngati ntchito zomangidwa. sqlite imafuna makina akunja, ndipo pamenepa izo zojambulidwa mu PHP.

Pomaliza

Chifukwa chake, mauthenga olakwika mu zida zosinthira deta amakhala ndi gawo lofunikira kwambiri. Koma ndizovuta kuzitchula kuti mulingo woyenera kwambiri kuti mupeze chomwe chimayambitsa vutoli. Pamene chiwerengero cha ndondomeko chikuyandikira zana, ndiye kuti kuyang'anira ndondomeko kumasanduka ntchito yovuta.

Nkhaniyi ikupereka chitsanzo cha njira yothetsera vutoli mwa mawonekedwe a prototype. Choyimira chaching'ono chonse chosungira chikupezeka mu gitlab SQLite PHP ETL Utilities.

Source: www.habr.com

Kuwonjezera ndemanga