Hoʻonui i ka Spark me MLflow

Aloha mai e nā Khabrovites. E like me kā mākou i kākau ai, i kēia mahina ua hoʻomaka ʻo OTUS i ʻelua papa ma ke aʻo ʻana i ka mīkini i ka manawa hoʻokahi, ʻo ia hoʻi kahua и holomua. Ma kēia mea, hoʻomau mākou e kaʻana like i nā mea pono.

ʻO ke kumu o kēia ʻatikala e kamaʻilio e pili ana i kā mākou ʻike mua me MLflow.

E hoʻomaka mākou i ka loiloi MLflow mai kāna kikowaena hoʻopalekana a hoʻolaha i nā ʻike a pau o ke aʻo ʻana. A laila e kaʻana like mākou i ka ʻike o ka hoʻopili ʻana iā Spark me MLflow me ka hoʻohana ʻana iā UDF.

Pōʻaiapili

Aia mākou i loko Ola Alpha hoʻohana mākou i ke aʻo ʻana i ka mīkini a me ka naʻauao hana e hoʻoikaika i ka poʻe e mālama i ko lākou olakino a maikaʻi. ʻO ia ke kumu o ka mīkini aʻo ʻana i ke kumu o nā huahana ʻikepili a mākou e hoʻomohala ai, a no ke aha ʻo MLflow, kahi kahua ākea ākea e uhi ana i nā ʻano āpau o ke ola aʻo ʻana i ka mīkini.

MLflow

ʻO ka pahuhopu nui o MLflow ka hāʻawi ʻana i kahi papa hou ma luna o ke aʻo ʻana i ka mīkini e hiki ai i nā ʻepekema data ke hana me kahi kokoke i nā hale waihona aʻo mīkini (h2o, paʻaumaha, mleap, pytorch, sklearn и ʻūlili uila), lawe i kāna hana i ka pae aʻe.

Hāʻawi ʻo MLflow i ʻekolu mau ʻāpana:

  • ka hoʻokoloʻana - ka hoʻopaʻa ʻana a me nā noi no nā hoʻokolohua: code, data, hoʻonohonoho a me nā hopena. He mea nui ka hahai ʻana i ke kaʻina hana o ka hana ʻana i kahi hoʻohālike.
  • hana - E holo ana ka format Packaging ma kekahi kahua (no ka laʻana, SageMaker)
  • ana hoʻohālike he ʻano maʻamau no ka waiho ʻana i nā hiʻohiʻona i nā mea hana hoʻolaha like ʻole.

ʻO MLflow (alpha i ka manawa kākau) he kahua punawai wehe e hiki ai iā ʻoe ke hoʻokele i ke ola aʻo ʻana o ka mīkini, me ka hoʻokolohua, hoʻohana hou, a me ka hoʻolālā ʻana.

Hoʻonohonoho i ka MLflow

No ka hoʻohana ʻana iā MLflow, pono ʻoe e hoʻonohonoho mua i ka honua Python holoʻokoʻa, no kēia mea mākou e hoʻohana ai PyEnv (e hoʻokomo iā Python ma kahi Mac, e nānā maanei). No laila hiki iā mākou ke hana i kahi kaiapuni virtual kahi e hoʻokomo ai mākou i nā hale waihona puke āpau e pono ai e holo.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

E hoʻouka i nā hale waihona puke i makemake ʻia.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Nānā: Ke hoʻohana nei mākou iā PyArrow e holo i nā hiʻohiʻona e like me nā UDF. Pono e hoʻopaʻa ʻia nā mana o PyArrow a me Numpy no ka mea ua kūʻē nā mana hou me kekahi.

Ke hoʻolana nei i ka UI Tracking

Hiki iā MLflow Tracking iā mākou ke hoʻopaʻa inoa a nīnau i nā hoʻokolohua me Python a koena API. Eia hou, hiki iā ʻoe ke wehewehe i kahi e mālama ai i nā kiʻi kiʻi kiʻi (localhost, Amazon S3, ʻO Azure Blob Storage, Pūnaewele Kapua Google ai ole ia, kikowaena SFTP). No ka mea, hoʻohana mākou i ka AWS ma Alpha Health, ʻo S3 ka waihona no nā mea waiwai.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

Manaʻo ʻo MLflow i ka hoʻohana ʻana i ka waihona waihona hoʻomau. ʻO ka waihona waihona kahi e mālama ai ke kikowaena holo a hoʻokolohua metadata. I ka hoʻomaka ʻana i ke kikowaena, e hōʻoia i ke kuhikuhi ʻana i ka waihona waihona hoʻomau. Maanei, no ka hoʻokolohua, e hoʻohana wale mākou /tmp.

E hoʻomanaʻo inā makemake mākou e hoʻohana i ka server mlflow e holo i nā hoʻokolohua kahiko, pono lākou i loko o ka waihona waihona. Eia naʻe, me ka ʻole o kēia, hiki iā mākou ke hoʻohana iā lākou i ka UDF, no ka mea pono mākou i ke ala i ke kumu hoʻohālike.

Hoʻomaopopo: E hoʻomanaʻo pono e loaʻa i ka UI Tracking a me ka mea kūʻai hoʻohālike ke komo i kahi o ka artifact. ʻO ia hoʻi, me ka nānā ʻole ʻana i ka Tracking UI aia ma kahi EC2 laʻana, i ka wā e holo ana i ka MLflow kūloko, pono e loaʻa i ka mīkini ke komo pololei i S3 e kākau i nā kumu hoʻohālike artifact.

Hoʻonui i ka Spark me MLflow
Mālama ʻo Tracking UI i nā mea kiʻi i loko o ka bakeke S3

Hoʻohālike holo

Ke holo nei ke kikowaena Tracking, hiki iā ʻoe ke hoʻomaka i ke aʻo ʻana i nā hiʻohiʻona.

Ma keʻano he laʻana, e hoʻohana mākou i ka hoʻololi waina mai ka hiʻohiʻona MLflow ma Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

E like me kā mākou i ʻōlelo ai, ʻae ʻo MLflow iā ʻoe e hoʻopaʻa inoa i nā ʻāpana, metric a me nā kiʻi kiʻi kiʻi i hiki iā ʻoe ke nānā i ke ʻano o ka hoʻomohala ʻana ma ke ʻano he iterations. He mea maikaʻi loa kēia hiʻohiʻona, no ka mea, hiki iā mākou ke hana hou i ke kumu hoʻohālike maikaʻi loa ma ka hoʻopili ʻana i ka server Tracking a i ʻole ka hoʻomaopopo ʻana i ke code i hana i ka hoʻololi ʻana i koi ʻia me ka hoʻohana ʻana i nā git hash logs o nā commits.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Hoʻonui i ka Spark me MLflow
ka inu waina

Hope hope no ke kŘkohu

ʻO ka MLflow tracking server i hoʻokuʻu ʻia me ke kauoha "mlflow server" he REST API no ka holo ʻana a me ke kākau ʻana i ka ʻikepili i ka ʻōnaehana faila kūloko. Hiki iā ʻoe ke kuhikuhi i ka helu wahi o ke kikowaena hoʻokolo me ka hoʻohana ʻana i ka "MLFLOW_TRACKING_URI" hoʻololi kaiapuni a e hoʻopili ʻokoʻa ʻia ka API tracking MLflow i ke kikowaena hoʻokele ma kēia helu wahi no ka hana ʻana/loaʻa i ka ʻike hoʻomaka, metric logging, etc.

Source: Docs // Ke holo nei i kahi kikowaena mālama

No ka hāʻawi ʻana i ke kumu hoʻohālike me kahi kikowaena, pono mākou i kahi kikowaena holo kaʻa (e ʻike i ka hoʻomaka ʻana) a me ka Run ID o ke kumu hoʻohālike.

Hoʻonui i ka Spark me MLflow
Holo ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

No ka lawelawe ʻana i nā hiʻohiʻona me ka hoʻohana ʻana i ka hana lawelawe MLflow, pono mākou e komo i ka Tracking UI e kiʻi i ka ʻike e pili ana i ke kumu hoʻohālike ma ka wehewehe wale ʻana. --run_id.

I ka manawa e hoʻopili ai ke kumu hoʻohālike i ka Server Tracking, hiki iā mākou ke loaʻa kahi kumu hoʻohālike hou.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Holo nā kumu hoʻohālike mai Spark

ʻOiai ka ikaika o ka server Tracking e lawelawe i nā hiʻohiʻona i ka manawa maoli, hoʻomaʻamaʻa iā lākou a hoʻohana i ka hana kikowaena (kumu: mlflow // docs // models #local), ʻo ka hoʻohana ʻana iā Spark (batch a i ʻole streaming) he hopena ʻoi aku ka ikaika ma muli o ka hāʻawi ʻana.

E noʻonoʻo ʻoe ua hana wale ʻoe i ka hoʻomaʻamaʻa pahemo a laila hoʻohana i ke kumu hoʻohālike i kāu ʻikepili āpau. ʻO kēia kahi i komo ai ʻo Spark a me MLflow i kā lākou iho.

E hoʻouka iā PySpark + Jupyter + Spark

Source: E hoʻomaka iā PySpark - Jupyter

E hōʻike i ke ʻano o kā mākou hoʻohana ʻana i nā hiʻohiʻona MLflow i ka Spark dataframes, pono mākou e hoʻonohonoho i nā puke puke Jupyter e hana pū me PySpark.

E hoʻomaka ma ke kau ʻana i ka mana paʻa hou loa Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

E hoʻouka iā PySpark a me Jupyter i kahi kaiapuni virtual:

pip install pyspark jupyter

Hoʻonohonoho i nā ʻano hoʻololi kaiapuni:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ua wehewehe notebook-dir, hiki iā mākou ke mālama i kā mākou mau puke i loko o ka waihona makemake.

Holo iā Jupyter mai PySpark

Ma muli o ka hiki iā mākou ke hoʻonohonoho iā Jupiter ma ke ʻano he mea hoʻokele PySpark, hiki iā mākou ke holo i ka puke Jupyter ma kahi ʻano PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Hoʻonui i ka Spark me MLflow

E like me ka mea i ʻōlelo ʻia ma luna nei, hāʻawi ʻo MLflow i ka hana o ka logging model artifacts ma S3. Ke loaʻa koke iā mākou ka hiʻohiʻona i koho ʻia ma ko mākou mau lima, loaʻa iā mākou ka manawa e hoʻokomo iā ia ma ke ʻano he UDF me ka hoʻohana ʻana i ka module mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Hoʻonui i ka Spark me MLflow
PySpark - Ke wānana i ka maikaʻi o ka waina

A hiki i kēia manawa, ua kamaʻilio mākou e pili ana i ka hoʻohana ʻana iā PySpark me MLflow ma ka holo ʻana i ka wānana maikaʻi o ka waina ma ka ʻikepili piha waina. Akā pehea inā pono ʻoe e hoʻohana i nā modula Python MLflow mai Scala Spark?

Ua hoʻāʻo mākou i kēia ma ka hoʻokaʻawale ʻana i ka pōʻaiapili Spark ma waena o Scala a me Python. ʻO ia hoʻi, ua hoʻopaʻa inoa mākou i ka MLflow UDF ma Python, a hoʻohana iā ia mai Scala (ʻae, ʻaʻole paha ka hopena maikaʻi loa, akā ʻo kā mākou).

Scala Spark + MLflow

No kēia laʻana, e hoʻohui mākou Toree Kernel i loko o kahi Jupiter e noho nei.

E hoʻouka i ka Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

E like me kāu e ʻike ai mai ka puke i hoʻopili ʻia, ua kaʻana like ʻo UDF ma waena o Spark a me PySpark. Manaʻo mākou e lilo kēia ʻāpana i mea pono no ka poʻe makemake iā Scala a makemake e kau i nā hiʻohiʻona aʻo mīkini i ka hana.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Nā ʻanuʻu aʻe

ʻOiai ʻo MLflow i Alpha i ka manawa kākau, ʻike maikaʻi ʻia. ʻO ka hiki ke holo i nā ʻōnaehana aʻo mīkini he nui a hoʻohana iā lākou mai kahi hopena hoʻokahi e lawe i nā ʻōnaehana paipai i ka pae aʻe.

Eia hou, lawe mai ʻo MLflow i nā ʻenekinia ʻikepili a me nā ʻepekema ʻIkepili e pili kokoke ana, e waiho ana i kahi papa maʻamau ma waena o lākou.

Ma hope o kēia ʻimi ʻana o MLflow, maopopo mākou e hele i mua a hoʻohana ia no kā mākou Spark pipelines a me nā ʻōnaehana paipai.

He mea maikaʻi e hoʻonohonoho i ka waihona waihona me ka waihona ma kahi o ka ʻōnaehana faila. Pono kēia e hāʻawi iā mākou i nā helu hope he nui i hiki ke hoʻohana i ka mahele faila like. No ka laʻana, e hoʻohana i nā manawa he nui Presto и ʻO Athena me ka Glue metastore like.

I ka hōʻuluʻulu ʻana, makemake wau e ʻōlelo hoʻomaikaʻi i ke kaiāulu MLFlow no ka hana ʻana i kā mākou hana me ka ʻikepili i ʻoi aku ka hoihoi.

Inā pāʻani ʻoe me MLflow, e ʻoluʻolu e kākau iā mākou a haʻi mai iā mākou pehea ʻoe e hoʻohana ai, a ʻoi aku hoʻi inā ʻoe e hoʻohana iā ia i ka hana.

E aʻo hou e pili ana i nā papa:
aʻo mīkini. Papa kumu
aʻo mīkini. papa holomua

E heluhelu hou:

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka