Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ืฉืœื•ื, ื”ื‘ืจ! ื‘ืžืืžืจ ื–ื” ืื ื™ ืจื•ืฆื” ืœื“ื‘ืจ ืขืœ ื›ืœื™ ืื—ื“ ื ื”ื“ืจ ืœืคื™ืชื•ื— ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”, ืœืžืฉืœ, ื‘ืชืฉืชื™ืช ืฉืœ DWH ืืจื’ื•ื ื™ ืื• DataLake ืฉืœืš. ื ื“ื‘ืจ ืขืœ Apache Airflow (ืœื”ืœืŸ Airflow). ื”ื™ื ื ืฉืœืœืช ื‘ืื•ืคืŸ ืœื ื”ื•ื’ืŸ ืชืฉื•ืžืช ืœื‘ ืขืœ Habrรฉ, ื•ื‘ืขื™ืงืจ ืื ืกื” ืœืฉื›ื ืข ืื•ืชืš ืฉืœืคื—ื•ืช Airflow ืฉื•ื•ื” ืœื”ืกืชื›ืœ ื›ืืฉืจ ืืชื” ื‘ื•ื—ืจ ืžืชื–ืžืŸ ืœืชื”ืœื™ื›ื™ ื”-ETL/ELT ืฉืœืš.

ื‘ืขื‘ืจ ื›ืชื‘ืชื™ ืกื“ืจืช ืžืืžืจื™ื ื‘ื ื•ืฉื DWH ื›ืฉืขื‘ื“ืชื™ ื‘ื‘ื ืง ื˜ื™ื ืงื•ืฃ. ื›ืขืช ื”ืคื›ืชื™ ืœื—ืœืง ืžืฆื•ื•ืช Mail.Ru Group ื•ืื ื™ ืžืคืชื— ืคืœื˜ืคื•ืจืžื” ืœื ื™ืชื•ื— ื ืชื•ื ื™ื ื‘ืชื—ื•ื ื”ืžืฉื—ืงื™ื. ืœืžืขืฉื”, ื›ืฉื™ื•ืคื™ืขื• ื—ื“ืฉื•ืช ื•ืคืชืจื•ื ื•ืช ืžืขื ื™ื™ื ื™ื, ื”ืฆื•ื•ืช ืฉืœื™ ื•ืื ื™ ื ื“ื‘ืจ ื›ืืŸ ืขืœ ื”ืคืœื˜ืคื•ืจืžื” ืฉืœื ื• ืœื ื™ืชื•ื— ื ืชื•ื ื™ื.

ืคืจื•ืœื•ื’

ืื– ื‘ื•ืื• ื ืชื—ื™ืœ. ืžื”ื™ ื–ืจื™ืžืช ืื•ื•ื™ืจ? ื–ื•ื”ื™ ืกืคืจื™ื™ื” (ืื• ืกื˜ ืกืคืจื™ื•ืช) ืœืคืชื—, ืœืชื›ื ืŸ ื•ืœื ื˜ืจ ืชื”ืœื™ื›ื™ ืขื‘ื•ื“ื”. ื”ืชื›ื•ื ื” ื”ืขื™ืงืจื™ืช ืฉืœ Airflow: ืงื•ื“ Python ืžืฉืžืฉ ืœืชื™ืื•ืจ (ืœืคืชื—) ืชื”ืœื™ื›ื™ื. ื™ืฉ ืœื–ื” ื”ืจื‘ื” ื™ืชืจื•ื ื•ืช ืœืืจื’ื•ืŸ ื”ืคืจื•ื™ืงื˜ ื•ื”ืคื™ืชื•ื— ืฉืœืš: ื‘ืขืฆื, ืคืจื•ื™ืงื˜ ื”-ETL ืฉืœืš (ืœืžืฉืœ) ื”ื•ื ืจืง ืคืจื•ื™ืงื˜ Python, ื•ืืชื” ื™ื›ื•ืœ ืœืืจื’ืŸ ืื•ืชื• ื›ืจืฆื•ื ืš, ืชื•ืš ื”ืชื—ืฉื‘ื•ืช ื‘ืคืจื˜ื™ ื”ืชืฉืชื™ืช, ื’ื•ื“ืœ ื”ืฆื•ื•ืช ื• ื“ืจื™ืฉื•ืช ืื—ืจื•ืช. ืžื‘ื—ื™ื ื” ืื™ื ืกื˜ืจื•ืžื ื˜ืœื™ืช ื”ื›ืœ ืคืฉื•ื˜. ื”ืฉืชืžืฉ ืœืžืฉืœ PyCharm + Git. ื–ื” ื ืคืœื ื•ืžืื•ื“ ื ื•ื—!

ืขื›ืฉื™ื• ื‘ื•ืื• ื ืกืชื›ืœ ืขืœ ื”ื™ืฉื•ื™ื•ืช ื”ืขื™ืงืจื™ื•ืช ืฉืœ Airflow. ืขืœ ื™ื“ื™ ื”ื‘ื ืช ื”ืžื”ื•ืช ื•ื”ืžื˜ืจื” ืฉืœื”ื, ืืชื” ื™ื›ื•ืœ ืœืืจื’ืŸ ื‘ืฆื•ืจื” ืžื™ื˜ื‘ื™ืช ืืช ืืจื›ื™ื˜ืงื˜ื•ืจืช ื”ืชื”ืœื™ืš ืฉืœืš. ืื•ืœื™ ื”ื™ืฉื•ืช ื”ืขื™ืงืจื™ืช ื”ื™ื ื”ื’ืจืฃ ื”ืืฆื™ืงืœื™ ื”ืžื›ื•ื•ืŸ (ืœื”ืœืŸ DAG).

DAG

DAG ื”ื•ื ืื™ื–ืฉื”ื• ืฉื™ื•ืš ืžืฉืžืขื•ืชื™ ืฉืœ ื”ืžืฉื™ืžื•ืช ืฉืœืš ืฉืืชื” ืจื•ืฆื” ืœื”ืฉืœื™ื ื‘ืจืฆืฃ ืžื•ื’ื“ืจ ื‘ื”ื—ืœื˜ ืœืคื™ ืœื•ื— ื–ืžื ื™ื ืกืคืฆื™ืคื™. Airflow ืžืกืคืง ืžืžืฉืง ืื™ื ื˜ืจื ื˜ ื ื•ื— ืœืขื‘ื•ื“ื” ืขื DAGs ื•ื™ืฉื•ื™ื•ืช ืื—ืจื•ืช:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ื”-DAG ืขืฉื•ื™ ืœื”ื™ืจืื•ืช ื›ืš:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ื”ื™ื–ื, ื‘ืขืช ืชื›ื ื•ืŸ DAG, ืงื•ื‘ืข ืžืขืจืš ืžืคืขื™ืœื™ื ืฉืขืœื™ื”ื ื™ื™ื‘ื ื• ืžืฉื™ืžื•ืช ื‘ืชื•ืš ื”-DAG. ื›ืืŸ ืื ื• ืžื’ื™ืขื™ื ืœื™ืฉื•ืช ื—ืฉื•ื‘ื” ื ื•ืกืคืช: ืžืคืขื™ืœ ื–ืจื™ืžืช ืื•ื•ื™ืจ.

ืื•ืคืจื˜ื•ืจื™ื

ืžืคืขื™ืœ ื”ื•ื ื™ืฉื•ืช ืฉืขืœ ื‘ืกื™ืกื” ื ื•ืฆืจื™ื ืžื•ืคืขื™ ืขื‘ื•ื“ื”, ื”ืžืชืืจื™ื ืžื” ื™ืงืจื” ื‘ืžื”ืœืš ื‘ื™ืฆื•ืข ืžื•ืคืข ืขื‘ื•ื“ื”. ื–ืจื™ืžืช ืื•ื•ื™ืจ ืžืฉืชื—ืจืจืช ืž-GitHub ื›ื‘ืจ ืžื›ื™ืœ ืงื‘ื•ืฆื” ืฉืœ ืื•ืคืจื˜ื•ืจื™ื ืžื•ื›ื ื™ื ืœืฉื™ืžื•ืฉ. ื“ื•ื’ืžืื•ืช:

  • BashOperator - ืื•ืคืจื˜ื•ืจ ืœื‘ื™ืฆื•ืข ืคืงื•ื“ืช bash.
  • PythonOperator - ืžืคืขื™ืœ ืœืงืจื™ืื” ืœืงื•ื“ Python.
  • EmailOperator โ€” ืื•ืคืจื˜ื•ืจ ืœืฉืœื™ื—ืช ื“ื•ืืจ ืืœืงื˜ืจื•ื ื™.
  • HTTPOperator - ืื•ืคืจื˜ื•ืจ ืœืขื‘ื•ื“ื” ืขื ื‘ืงืฉื•ืช http.
  • SqlOperator - ืื•ืคืจื˜ื•ืจ ืœื‘ื™ืฆื•ืข ืงื•ื“ SQL.
  • ืกื ืกื•ืจ ื”ื•ื ืžืคืขื™ืœ ืœื”ืžืชื ื” ืœืื™ืจื•ืข (ื”ื’ืขืช ื”ืฉืขื” ื”ื ื“ืจืฉืช, ื”ื•ืคืขืช ื”ืงื•ื‘ืฅ ื”ื ื“ืจืฉ, ืฉื•ืจื” ื‘ืžืื’ืจ, ืชื’ื•ื‘ื” ืžื”-API ื•ื›ื•' ื•ื›ื•').

ื™ืฉื ื ืื•ืคืจื˜ื•ืจื™ื ืกืคืฆื™ืคื™ื™ื ื™ื•ืชืจ: DockerOperator, HiveOperator, S3FileTransferOperator, PrestoToMysqlOperator, SlackOperator.

ืืชื” ื™ื›ื•ืœ ื’ื ืœืคืชื— ืื•ืคืจื˜ื•ืจื™ื ืขืœ ืกืžืš ื”ืžืืคื™ื™ื ื™ื ืฉืœืš ื•ืœื”ืฉืชืžืฉ ื‘ื”ื ื‘ืคืจื•ื™ืงื˜ ืฉืœืš. ืœื“ื•ื’ืžื”, ื™ืฆืจื ื• ืืช MongoDBToHiveViaHdfsTransfer, ืื•ืคืจื˜ื•ืจ ืœื™ื™ืฆื•ื ืžืกืžื›ื™ื ืž-MongoDB ืœ-Hive, ื•ื›ืžื” ืื•ืคืจื˜ื•ืจื™ื ืœืขื‘ื•ื“ื” ืขื ืงืœื™ืงื”ืื•ืก: CHLoadFromHiveOperator ื•-CHTableLoaderOperator. ื‘ืขื™ืงืจื• ืฉืœ ื“ื‘ืจ, ื‘ืจื’ืข ืฉืคืจื•ื™ืงื˜ ื”ืฉืชืžืฉ ืœืขืชื™ื ืงืจื•ื‘ื•ืช ื‘ืงื•ื“ ื”ื‘ื ื•ื™ ืขืœ ื”ืฆื”ืจื•ืช ื‘ืกื™ืกื™ื•ืช, ืืชื” ื™ื›ื•ืœ ืœื—ืฉื•ื‘ ืขืœ ื‘ื ื™ื™ืชื• ืœื”ืฆื”ืจื” ื—ื“ืฉื”. ื–ื” ื™ืคืฉื˜ ืืช ื”ืžืฉืš ื”ืคื™ืชื•ื—, ื•ืชืจื—ื™ื‘ ืืช ืกืคืจื™ื™ืช ื”ืžืคืขื™ืœื™ื ืฉืœืš ื‘ืคืจื•ื™ืงื˜.

ืœืื—ืจ ืžื›ืŸ, ื™ืฉ ืœื‘ืฆืข ืืช ื›ืœ ื”ืžืงืจื™ื ื”ืœืœื• ืฉืœ ืžืฉื™ืžื•ืช, ื•ืขื›ืฉื™ื• ื ื“ื‘ืจ ืขืœ ื”ืžืชื–ืžืŸ.

ืžืชื–ืžืŸ

ืžืชื–ืžืŸ ื”ืžืฉื™ืžื•ืช ืฉืœ Airflow ื‘ื ื•ื™ ืกืœืจื™. ืกืœืจื™ ื”ื™ื ืกืคืจื™ื™ืช Python ื”ืžืืคืฉืจืช ืœืš ืœืืจื’ืŸ ืชื•ืจ ื‘ืชื•ืกืคืช ื‘ื™ืฆื•ืข ืืกื™ื ื›ืจื•ื ื™ ื•ืžื‘ื•ื–ืจ ืฉืœ ืžืฉื™ืžื•ืช. ื‘ืฆื“ Airflow, ื›ืœ ื”ืžืฉื™ืžื•ืช ืžื—ื•ืœืงื•ืช ืœื‘ืจื™ื›ื•ืช. ื‘ืจื™ื›ื•ืช ื ื•ืฆืจื•ืช ื‘ืื•ืคืŸ ื™ื“ื ื™. ื‘ื“ืจืš ื›ืœืœ, ืžื˜ืจืชื ื”ื™ื ืœื”ื’ื‘ื™ืœ ืืช ืขื•ืžืก ื”ืขื‘ื•ื“ื” ืฉืœ ื”ืขื‘ื•ื“ื” ืขื ื”ืžืงื•ืจ ืื• ืœืืคื™ื™ืŸ ืžืฉื™ืžื•ืช ื‘ืชื•ืš ื”-DWH. ื ื™ืชืŸ ืœื ื”ืœ ื‘ืจื™ื›ื•ืช ื‘ืืžืฆืขื•ืช ืžืžืฉืง ื”ืื™ื ื˜ืจื ื˜:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ืœื›ืœ ื‘ืจื™ื›ื” ื™ืฉ ื”ื’ื‘ืœื” ืขืœ ืžืกืคืจ ื”ืžืฉื‘ืฆื•ืช. ื‘ืขืช ื™ืฆื™ืจืช DAG, ื”ื•ื ืžืงื‘ืœ ืžืื’ืจ:

ALERT_MAILS =  Variable.get("gv_mail_admin_dwh")
DAG_NAME = 'dma_load'
OWNER = 'Vasya Pupkin'
DEPENDS_ON_PAST = True
EMAIL_ON_FAILURE = True
EMAIL_ON_RETRY = True
RETRIES = int(Variable.get('gv_dag_retries'))
POOL = 'dma_pool'
PRIORITY_WEIGHT = 10

start_dt = datetime.today() - timedelta(1)
start_dt = datetime(start_dt.year, start_dt.month, start_dt.day)

default_args = {
    'owner': OWNER,
    'depends_on_past': DEPENDS_ON_PAST,
    'start_date': start_dt,
    'email': ALERT_MAILS,
    'email_on_failure': EMAIL_ON_FAILURE,
    'email_on_retry': EMAIL_ON_RETRY,
    'retries': RETRIES,
    'pool': POOL,
    'priority_weight': PRIORITY_WEIGHT
}
dag = DAG(DAG_NAME, default_args=default_args)
dag.doc_md = __doc__

ื ื™ืชืŸ ืœืขืงื•ืฃ ืžืื’ืจ ืฉื”ื•ื’ื“ืจ ื‘ืจืžืช DAG ื‘ืจืžืช ื”ืžืฉื™ืžื”.
ืชื”ืœื™ืš ื ืคืจื“, Scheduler, ืื—ืจืื™ ืขืœ ืชื–ืžื•ืŸ ื›ืœ ื”ืžืฉื™ืžื•ืช ื‘-Airflow. ืœืžืขืฉื”, ืžืชื–ืžืŸ ืขื•ืกืง ื‘ื›ืœ ื”ืžื›ื ื™ืงื” ืฉืœ ื”ื’ื“ืจืช ืžืฉื™ืžื•ืช ืœื‘ื™ืฆื•ืข. ื”ืžืฉื™ืžื” ืขื•ื‘ืจืช ืžืกืคืจ ืฉืœื‘ื™ื ืœืคื ื™ ื‘ื™ืฆื•ืข:

  1. ื”ืžืฉื™ืžื•ืช ื”ืงื•ื“ืžื•ืช ื”ื•ืฉืœืžื• ื‘-DAG; ืžืฉื™ืžื•ืช ื—ื“ืฉื•ืช ื ื™ืชืŸ ืœืขืžื•ื“ ื‘ืชื•ืจ.
  2. ื”ืชื•ืจ ืžืžื•ื™ืŸ ื‘ื”ืชืื ืœืขื“ื™ืคื•ืช ื”ืžืฉื™ืžื•ืช (ื ื™ืชืŸ ื’ื ืœืฉืœื•ื˜ ื‘ืกื“ืจื™ ื”ืขื“ื™ืคื•ื™ื•ืช), ื•ืื ื™ืฉ ืžืฉื‘ืฆืช ืคื ื•ื™ื” ื‘ื‘ืจื™ื›ื”, ื ื™ืชืŸ ืœืงื—ืช ืืช ื”ืžืฉื™ืžื” ืœืคืขื•ืœื”.
  3. ืื ื™ืฉ ืกืœืจื™ ืขื•ื‘ื“ ื—ื™ื ื, ื”ืžืฉื™ืžื” ื ืฉืœื—ืช ืืœื™ื•; ื”ืขื‘ื•ื“ื” ืฉืชื›ื ืชืช ื‘ื‘ืขื™ื” ืžืชื—ื™ืœื”, ื‘ืืžืฆืขื•ืช ืื•ืคืจื˜ื•ืจ ื›ื–ื” ืื• ืื—ืจ.

ืคืฉื•ื˜ ืžืกืคื™ืง.

ืžืชื–ืžืŸ ืคื•ืขืœ ืขืœ ื”ืกื˜ ืฉืœ ื›ืœ DAGs ื•ื›ืœ ื”ืžืฉื™ืžื•ืช ื‘ืชื•ืš DAGs.

ื›ื“ื™ ืฉืžืชื–ืžืŸ ื™ืชื—ื™ืœ ืœืขื‘ื•ื“ ืขื DAG, ื”-DAG ืฆืจื™ืš ืœื”ื’ื“ื™ืจ ืœื•ื— ื–ืžื ื™ื:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='@hourly')

ื™ืฉ ืงื‘ื•ืฆื” ืฉืœ ื”ื’ื“ืจื•ืช ืžื•ื›ื ื•ืช ืžืจืืฉ: @once, @hourly, @daily, @weekly, @monthly, @yearly.

ืืชื” ื™ื›ื•ืœ ื’ื ืœื”ืฉืชืžืฉ ื‘ื‘ื™ื˜ื•ื™ื™ cron:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='*/10 * * * *')

ืชืืจื™ืš ื‘ื™ืฆื•ืข

ื›ื“ื™ ืœื”ื‘ื™ืŸ ื›ื™ืฆื“ ืคื•ืขืœืช Airflow, ื—ืฉื•ื‘ ืœื”ื‘ื™ืŸ ืžื”ื• ืชืืจื™ืš ื‘ื™ืฆื•ืข ืขื‘ื•ืจ DAG. ื‘-Airflow, ืœ-DAG ื™ืฉ ืžืžื“ ืชืืจื™ืš ื‘ื™ืฆื•ืข, ื›ืœื•ืžืจ, ื‘ื”ืชืื ืœืœื•ื— ื”ื–ืžื ื™ื ืฉืœ ื”ืขื‘ื•ื“ื” ืฉืœ ื”-DAG, ื ื•ืฆืจื™ื ืžื•ืคืขื™ ืžืฉื™ืžื•ืช ืขื‘ื•ืจ ื›ืœ ืชืืจื™ืš ื‘ื™ืฆื•ืข. ื•ืœื›ืœ ืชืืจื™ืš ื‘ื™ืฆื•ืข, ื ื™ืชืŸ ืœื‘ืฆืข ืžื—ื“ืฉ ืžืฉื™ืžื•ืช - ืื•, ืœืžืฉืœ, DAG ื™ื›ื•ืœ ืœืขื‘ื•ื“ ื‘ื•-ื–ืžื ื™ืช ื‘ืžืกืคืจ ืชืืจื™ื›ื™ ื‘ื™ืฆื•ืข. ื–ื” ืžื•ืฆื’ ื›ืืŸ ื‘ื‘ื™ืจื•ืจ:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ืœืžืจื‘ื” ื”ืฆืขืจ (ืื• ืื•ืœื™ ืœืžืจื‘ื” ื”ืžื–ืœ: ื–ื” ืชืœื•ื™ ื‘ืžืฆื‘), ืื ื™ื™ืฉื•ื ื”ืžืฉื™ืžื” ื‘-DAG ื™ืชื•ืงืŸ, ืื– ื”ื‘ื™ืฆื•ืข ื‘ืชืืจื™ืš ื”ื‘ื™ืฆื•ืข ื”ืงื•ื“ื ื™ืžืฉื™ืš ืชื•ืš ื”ืชื—ืฉื‘ื•ืช ื‘ื”ืชืืžื•ืช. ื–ื” ื˜ื•ื‘ ืื ืืชื” ืฆืจื™ืš ืœื—ืฉื‘ ืžื—ื“ืฉ ื ืชื•ื ื™ื ื‘ืชืงื•ืคื•ืช ืงื•ื“ืžื•ืช ื‘ืืžืฆืขื•ืช ืืœื’ื•ืจื™ืชื ื—ื“ืฉ, ืื‘ืœ ื–ื” ืจืข ื›ื™ ื™ื›ื•ืœืช ื”ืฉื—ื–ื•ืจ ืฉืœ ื”ืชื•ืฆืื” ืื•ื‘ื“ืช (ื›ืžื•ื‘ืŸ, ืืฃ ืื—ื“ ืœื ืžืคืจื™ืข ืœืš ืœื”ื—ื–ื™ืจ ืืช ื”ื’ืจืกื” ื”ื ื“ืจืฉืช ืฉืœ ืงื•ื“ ื”ืžืงื•ืจ ืž-Git ื•ืœื—ืฉื‘ ืžื” ืืชื” ืฆืจื™ืš ืคืขื ืื—ืช, ื›ืžื• ืฉืืชื” ืฆืจื™ืš ืื•ืชื”).

ื™ืฆื™ืจืช ืžืฉื™ืžื•ืช

ื”ื™ื™ืฉื•ื ืฉืœ ื”-DAG ื”ื•ื ืงื•ื“ ื‘-Python, ื›ืš ืฉื™ืฉ ืœื ื• ื“ืจืš ืžืื•ื“ ื ื•ื—ื” ืœื”ืคื—ื™ืช ืืช ื›ืžื•ืช ื”ืงื•ื“ ื›ืฉืขื•ื‘ื“ื™ื, ืœืžืฉืœ, ืขื ืžืงื•ืจื•ืช ืžืคื•ืฆืœื™ื. ื ื ื™ื— ืฉื™ืฉ ืœืš ืฉืœื•ืฉื” ืจืกื™ืกื™ MySQL ื›ืžืงื•ืจ, ืืชื” ืฆืจื™ืš ืœื˜ืคืก ืœื›ืœ ืื—ื“ ืžื”ื ื•ืœืืกื•ืฃ ื›ืžื” ื ืชื•ื ื™ื. ื–ืืช ื•ืขื•ื“, ื‘ืื•ืคืŸ ืขืฆืžืื™ ื•ื‘ืžืงื‘ื™ืœ. ืงื•ื“ Python ื‘-DAG ืขืฉื•ื™ ืœื”ื™ืจืื•ืช ื›ืš:

connection_list = lv.get('connection_list')

export_profiles_sql = '''
SELECT
  id,
  user_id,
  nickname,
  gender,
  {{params.shard_id}} as shard_id
FROM profiles
'''

for conn_id in connection_list:
    export_profiles = SqlToHiveViaHdfsTransfer(
        task_id='export_profiles_from_' + conn_id,
        sql=export_profiles_sql,
        hive_table='stg.profiles',
        overwrite=False,
        tmpdir='/data/tmp',
        conn_id=conn_id,
        params={'shard_id': conn_id[-1:], },
        compress=None,
        dag=dag
    )
    export_profiles.set_upstream(exec_truncate_stg)
    export_profiles.set_downstream(load_profiles)

ื”-DAG ื ืจืื” ื›ืš:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ื‘ืžืงืจื” ื–ื”, ืืชื” ื™ื›ื•ืœ ืœื”ื•ืกื™ืฃ ืื• ืœื”ืกื™ืจ ืจืกื™ืก ืคืฉื•ื˜ ืขืœ ื™ื“ื™ ื”ืชืืžืช ื”ื”ื’ื“ืจื•ืช ื•ืขื“ื›ื•ืŸ ื”-DAG. ื ื•ึนื—ึท!

ืืชื” ื™ื›ื•ืœ ื’ื ืœื”ืฉืชืžืฉ ื‘ื™ืฆื™ืจืช ืงื•ื“ ืžื•ืจื›ื‘ ื™ื•ืชืจ, ืœืžืฉืœ, ืœืขื‘ื•ื“ ืขื ืžืงื•ืจื•ืช ื‘ืฆื•ืจื” ืฉืœ ืžืกื“ ื ืชื•ื ื™ื ืื• ืœืชืืจ ืžื‘ื ื” ื˜ื‘ืœื”, ืืœื’ื•ืจื™ืชื ืœืขื‘ื•ื“ื” ืขื ื˜ื‘ืœื”, ื•ื‘ื”ืชื—ืฉื‘ ื‘ืชื›ื•ื ื•ืช ืฉืœ ืชืฉืชื™ืช DWH, ืœื™ืฆื•ืจ ืชื”ืœื™ืš ืœื˜ืขื™ื ืช N ื˜ื‘ืœืื•ืช ืœืื—ืกื•ืŸ ืฉืœืš. ืื•, ืœืžืฉืœ, ืขื‘ื•ื“ื” ืขื API ืฉืื™ื ื• ืชื•ืžืš ื‘ืขื‘ื•ื“ื” ืขื ืคืจืžื˜ืจ ื‘ืฆื•ืจืช ืจืฉื™ืžื”, ื ื™ืชืŸ ืœื™ืฆื•ืจ N ืžืฉื™ืžื•ืช ื‘-DAG ืžืชื•ืš ืจืฉื™ืžื” ื–ื•, ืœื”ื’ื‘ื™ืœ ืืช ื”ื”ืงื‘ืœื” ืฉืœ ื‘ืงืฉื•ืช ื‘-API ืœืžืื’ืจ ื•ืœื’ืจื“ ื”ื ืชื•ื ื™ื ื”ื“ืจื•ืฉื™ื ืžื”-API. ื’ึธืžึดื™ืฉื!

ืžืื’ืจ

ืœ-Airflow ื™ืฉ ืžืื’ืจ ืื—ื•ืจื™ ืžืฉืœื”, ืžืกื“ ื ืชื•ื ื™ื (ื™ื›ื•ืœ ืœื”ื™ื•ืช MySQL ืื• Postgres, ื™ืฉ ืœื ื• Postgres), ื”ืžืื—ืกืŸ ืืช ืžืฆื‘ื™ ื”ืžืฉื™ืžื•ืช, DAGs, ื”ื’ื“ืจื•ืช ื—ื™ื‘ื•ืจ, ืžืฉืชื ื™ื ื’ืœื•ื‘ืœื™ื™ื ื•ื›ื•' ื•ื›ื•'. ื›ืืŸ ื”ื™ื™ืชื™ ืจื•ืฆื” ืœื•ืžืจ ืฉ- ืžืื’ืจ ื‘-Airflow ื”ื•ื ืคืฉื•ื˜ ืžืื•ื“ (ื‘ืขืจืš 20 ื˜ื‘ืœืื•ืช) ื•ื ื•ื— ืื ืืชื” ืจื•ืฆื” ืœื‘ื ื•ืช ืชื”ืœื™ื›ื™ื ืžืฉืœืš ืขืœ ื’ื‘ื™ื•. ืื ื™ ื–ื•ื›ืจ ืืช 100500 ื”ื˜ื‘ืœืื•ืช ื‘ืžืื’ืจ ืื™ื ืคื•ืจืžื˜ื™ืงื”, ืฉื”ื™ื” ืฆืจื™ืš ืœืœืžื•ื“ ื”ืจื‘ื” ื–ืžืŸ ืœืคื ื™ ืฉื”ื‘ื™ื ื• ืื™ืš ืœื‘ื ื•ืช ืฉืื™ืœืชื”.

ื ื™ื˜ื•ืจ

ืœืื•ืจ ื”ืคืฉื˜ื•ืช ืฉืœ ื”ืžืื’ืจ, ืืชื” ื™ื›ื•ืœ ืœื‘ื ื•ืช ืชื”ืœื™ืš ื ื™ื˜ื•ืจ ืžืฉื™ืžื•ืช ืฉื ื•ื— ืœืš. ืื ื• ืžืฉืชืžืฉื™ื ื‘ืคื ืงืก ืจืฉื™ืžื•ืช ื‘-Zeppelin, ืฉื‘ื• ืื ื• ื‘ื•ื—ื ื™ื ืืช ืžืฆื‘ ื”ืžืฉื™ืžื•ืช:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ื–ื” ื™ื›ื•ืœ ืœื”ื™ื•ืช ื’ื ืžืžืฉืง ื”ืื™ื ื˜ืจื ื˜ ืฉืœ Airflow ืขืฆืžื•:

Airflow ื”ื•ื ื›ืœื™ ืœืคื™ืชื•ื— ื•ืชื—ื–ื•ืงื” ื ื•ื—ื” ื•ืžื”ื™ืจื” ืฉืœ ืชื”ืœื™ื›ื™ ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืืฆื•ื•ื”

ืงื•ื“ ื–ืจื™ืžืช ื”ืื•ื•ื™ืจ ื”ื•ื ืงื•ื“ ืคืชื•ื—, ืื– ื”ื•ืกืคื ื• ื”ืชืจืื” ืœื˜ืœื’ืจื. ื›ืœ ืžื•ืคืข ืคื•ืขืœ ืฉืœ ืžืฉื™ืžื”, ืื ืžืชืจื—ืฉืช ืฉื’ื™ืื”, ืฉื•ืœื— ื“ื•ืืจ ื–ื‘ืœ ืœืงื‘ื•ืฆื” ื‘ื˜ืœื’ืจื, ืฉื‘ื” ืžื•ืจื›ื‘ ืฆื•ื•ืช ื”ืคื™ืชื•ื— ื•ื”ืชืžื™ื›ื” ื›ื•ืœื•.

ืื ื• ืžืงื‘ืœื™ื ืžืขื ื” ืžื”ื™ืจ ื‘ืืžืฆืขื•ืช ื˜ืœื’ืจื (ืื ื ื“ืจืฉ), ื•ื“ืจืš ื–ืคืœื™ืŸ ืื ื• ืžืงื‘ืœื™ื ืชืžื•ื ื” ื›ื•ืœืœืช ืฉืœ ื”ืžืฉื™ืžื•ืช ื‘-Airflow.

ื‘ืกืš ื”ื›ืœ

ื–ืจื™ืžืช ืื•ื•ื™ืจ ื”ื™ื ื‘ืขื™ืงืจ ืงื•ื“ ืคืชื•ื—, ื•ืืชื” ืœื ืฆืจื™ืš ืœืฆืคื•ืช ืžืžื ื” ืœื ื™ืกื™ื. ื”ื™ื• ืžื•ื›ื ื™ื ืœื”ืฉืงื™ืข ื–ืžืŸ ื•ืžืืžืฅ ื›ื“ื™ ืœื‘ื ื•ืช ืคืชืจื•ืŸ ืฉืขื•ื‘ื“. ื”ืžื˜ืจื” ื‘ืจืช ื”ืฉื’ื”, ืชืืžื™ืŸ ืœื™, ื–ื” ืฉื•ื•ื” ืืช ื–ื”. ืžื”ื™ืจื•ืช ืคื™ืชื•ื—, ื’ืžื™ืฉื•ืช, ืงืœื•ืช ื”ื•ืกืคืช ืชื”ืœื™ื›ื™ื ื—ื“ืฉื™ื - ืืชื” ืชืื”ื‘ ืืช ื–ื”. ื›ืžื•ื‘ืŸ, ืืชื” ืฆืจื™ืš ืœื”ืงื“ื™ืฉ ืชืฉื•ืžืช ืœื‘ ืจื‘ื” ืœืืจื’ื•ืŸ ื”ืคืจื•ื™ืงื˜, ื”ื™ืฆื™ื‘ื•ืช ืฉืœ ื–ืจื™ืžืช ื”ืื•ื•ื™ืจ ืขืฆืžื”: ื ื™ืกื™ื ืœื ืงื•ืจื™ื.

ืขื›ืฉื™ื• ื™ืฉ ืœื ื• Airflow ืขื•ื‘ื“ ืžื“ื™ ื™ื•ื ื›-6,5 ืืœืฃ ืžืฉื™ืžื•ืช. ื”ื ื“ื™ ืฉื•ื ื™ื ื‘ืื•ืคื™ื™ื. ื™ืฉื ืŸ ืžืฉื™ืžื•ืช ืฉืœ ื˜ืขื™ื ืช ื ืชื•ื ื™ื ืœ-DWH ื”ืจืืฉื™ ืžืžืงื•ืจื•ืช ืจื‘ื™ื ื•ืฉื•ื ื™ื ื•ืžืื•ื“ ืกืคืฆื™ืคื™ื™ื, ื™ืฉ ืžืฉื™ืžื•ืช ืฉืœ ื—ื™ืฉื•ื‘ ื—ืœื•ื ื•ืช ืจืื•ื•ื” ื‘ืชื•ืš ื”-DWH ื”ืจืืฉื™, ื™ืฉ ืžืฉื™ืžื•ืช ืฉืœ ืคืจืกื•ื ื ืชื•ื ื™ื ืœ-DWH ืžื”ื™ืจ, ื™ืฉ ื”ืจื‘ื” ืžืื•ื“ ืžืฉื™ืžื•ืช ืฉื•ื ื•ืช - ื•-Airflow ืœื•ืขืก ืืช ื›ื•ืœื ื™ื•ื ืื—ืจ ื™ื•ื. ืื ืžื“ื‘ืจื™ื ื‘ืžืกืคืจื™ื, ื–ื”ื• 2,3 ืืœืฃ ืžืฉื™ืžื•ืช ELT ื‘ืžื•ืจื›ื‘ื•ืช ืžืฉืชื ื” ื‘ืชื•ืš DWH (Hadoop), ื‘ืขืจืš. 2,5 ืžืื•ืช ืžืื’ืจื™ ืžื™ื“ืข ืžืงื•ืจื•ืช, ื–ื” ืฆื•ื•ืช ืž 4 ืžืคืชื—ื™ ETL, ืืฉืจ ืžื—ื•ืœืงื™ื ืœืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ETL ื‘-DWH ื•ืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ELT ื‘ืชื•ืš DWH ื•ื›ืžื•ื‘ืŸ ืขื•ื“ ืžื ื”ืœ ืื—ื“, ื”ืขื•ืกืง ื‘ืชืฉืชื™ืช ื”ืฉื™ืจื•ืช.

ืชื•ื›ื ื™ื•ืช ืœืขืชื™ื“

ืžืกืคืจ ื”ืชื”ืœื™ื›ื™ื ื’ื“ืœ ื‘ื”ื›ืจื—, ื•ื”ื“ื‘ืจ ื”ืขื™ืงืจื™ ืฉื ืขืฉื” ื‘ืžื•ื ื—ื™ื ืฉืœ ืชืฉืชื™ืช Airflow ื”ื•ื ืงื ื” ืžื™ื“ื”. ืื ื—ื ื• ืจื•ืฆื™ื ืœื‘ื ื•ืช ืืฉื›ื•ืœ Airflow, ืœื”ืงืฆื•ืช ื–ื•ื’ ืจื’ืœื™ื™ื ืœืขื•ื‘ื“ื™ ืกืœืจื™ ื•ืœื™ืฆื•ืจ ืจืืฉ ืœืฉื›ืคื•ืœ ืขืฆืžื™ ืขื ืชื”ืœื™ื›ื™ ืชื–ืžื•ืŸ ืขื‘ื•ื“ื” ื•ืžืื’ืจ.

ืืคื™ืœื•ื’

ื–ื”, ื›ืžื•ื‘ืŸ, ืœื ื›ืœ ืžื” ืฉื”ื™ื™ืชื™ ืจื•ืฆื” ืœืกืคืจ ืขืœ Airflow, ืื‘ืœ ื ื™ืกื™ืชื™ ืœื”ื“ื’ื™ืฉ ืืช ื”ื ืงื•ื“ื•ืช ื”ืขื™ืงืจื™ื•ืช. ื”ืชื™ืื‘ื•ืŸ ื‘ื ืขื ื”ืื›ื™ืœื”, ื ืกื” ืืช ื–ื” ื•ืชืื”ื‘ ืืช ื–ื” :)

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”