Ukwandiswa kweSpark ngeMLflow

Molweni, abahlali baseKhabrovsk. Njengoko sele sibhale, kule nyanga i-OTUS isungula iikhosi zokufunda zoomatshini ezimbini ngaxeshanye, ezizezi isiseko ΠΈ phambili. Ngokuphathelele oku, siyaqhubeka sabelana nabanye ngezinto eziluncedo.

Injongo yeli nqaku kukuthetha ngamava ethu okuqala usebenzisa MLflow.

Siza kuqalisa uphononongo MLflow ukusuka kumncedisi wayo wokulandelela kwaye ungene kuzo zonke iiiterations zophononongo. Emva koko siya kwabelana ngamava ethu okudibanisa iSpark neMLflow sisebenzisa i-UDF.

Umxholo

Singaphakathi Alpha Health Sisebenzisa ukufundwa koomatshini kunye nobukrelekrele bokwenziwa ukuxhobisa abantu ukuba balawule impilo yabo kunye nokuphila kakuhle. Yiyo loo nto iimodeli zokufunda ngoomatshini zisembindini weemveliso zesayensi yedatha esiyiphuhlisayo, kwaye yiyo loo nto siye satsalelwa kwi-MLflow, iqonga lomthombo elivulekileyo eliquka yonke imiba yokufunda komatshini kumjikelo wobomi.

MLflow

Eyona njongo iphambili yeMLflow kukubonelela ngoluhlu olongezelelweyo phezu komatshini wokufunda oza kuvumela izazinzulu zedatha ukuba zisebenze phantse naliphi na ithala leencwadi lokufunda koomatshini (h2o, Iikhamera, mleap, ipitotshi, funda ΠΈ tensorflow), ethatha umsebenzi wakhe ukuya kwinqanaba elilandelayo.

I-MLflow ibonelela ngamacandelo amathathu:

  • umkhondo -Ukurekhoda kunye nezicelo zovavanyo: ikhowudi, idatha, uqwalaselo kunye neziphumo. Ukubeka iliso kwinkqubo yokudala imodeli kubaluleke kakhulu.
  • iiprojekthi -Ifomati yokupakisha ukuze iqhutywe kulo naliphi na iqonga (umzekelo. Umenzi weSage)
  • imifuziselo -ifomathi eqhelekileyo yokuhambisa iimodeli kwizixhobo ezahlukeneyo zokuhambisa.

I-MLflow (kwi-alpha ngexesha lokubhala) liqonga lomthombo ovulekileyo elikuvumela ukuba ulawule umjikelo wobomi bokufunda komatshini, kubandakanya ukuvavanywa, ukusetyenziswa kwakhona, kunye nokuthunyelwa.

Ukumisela i-MLflow

Ukusebenzisa i-MLflow kufuneka uqale usete yonke indawo yakho yePython, kuba siya kuyisebenzisa I-PyEnv (ukufaka iPython kwiMac, jonga apha). Ngale ndlela sinokwenza imeko-bume yenyani apho siya kufaka onke amathala eencwadi ayimfuneko ukuyiqhuba.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Masifakele amathala eencwadi afunekayo.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Qaphela: Sisebenzisa iPyArrow ukuqhuba imifuziselo efana ne-UDF. Iinguqulelo zePyArrow kunye neNumpy bekufuneka zilungiswe kuba iinguqulelo zamva bezingqubana enye kwenye.

Qalisa i-UI yokuKhangela

I-MLflow Tracking ivumela ukuba singene kwaye sibuze iimvavanyo usebenzisa iPython kunye ABANYE API. Ukongeza, unokugqiba ukuba ungayigcina phi imodeli yobugcisa (ihostela yendawo, Amazon S3, UGcino lweAzure Blob, Ugcino lwamafu kuGoogle okanye iseva ye-SFTP). Ekubeni sisebenzisa i-AWS kwi-Alpha Health, ukugcinwa kwethu kwe-artifact kuya kuba yi-S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

I-MLflow incoma ukusebenzisa ugcino lwefayile oluzingisileyo. Ugcino lwefayile kulapho umncedisi aya kugcina i-i run kunye nemetadata yovavanyo. Xa uqalisa umncedisi, qiniseka ukuba ikhomba kwivenkile yefayile eqhubekayo. Apha kuvavanyo siza kusebenzisa ngokulula /tmp.

Khumbula ukuba sifuna ukusebenzisa iseva ye-mlflow ukuqhuba imifuniselo emidala, kufuneka ibekho kugcino lwefayile. Nangona kunjalo, nangaphandle koku besingazisebenzisa kwi-UDF, kuba sifuna kuphela indlela eya kumzekelo.

Qaphela: Gcina ukhumbula ukuba i-UI yokuKhangela kunye nomxhasi wemodeli kufuneka babe nokufikelela kwindawo ye-artifact. Oko kukuthi, kungakhathaliseki ukuba i-UI yokulandelela ihlala kwimeko ye-EC2, xa uqhuba i-MLflow ekuhlaleni, umatshini kufuneka ube nokufikelela ngokuthe ngqo kwi-S3 ukubhala iimodeli ze-artifact.

Ukwandiswa kweSpark ngeMLflow
Ukulandelela i-UI igcina izinto zakudala kwibhakethi ye-S3

Iimodeli ezibalekayo

Ngokukhawuleza ukuba i-server yokulandelela isebenza, ungaqala ukuqeqesha iimodeli.

Njengomzekelo, siya kusebenzisa ukuguqulwa kwewayini kumzekelo weMLflow kwi Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Njengoko sele sixoxile, i-MLflow ikuvumela ukuba ungene kwimodeli yeeparamitha, iimethrikhi, kunye nezinto zakudala ukuze ukwazi ukulandelela ukuba zivela njani na ngaphezulu kokuphindaphindwa. Olu phawu luluncedo kakhulu kuba ngale ndlela sinokuphinda sivelise eyona modeli ingcono ngokuqhagamshelana neseva yokuKhangela okanye ukuqonda ukuba yeyiphi ikhowudi eyenze iphindaphindo efunekayo sisebenzisa igit hash logs zokuzibophelela.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Ukwandiswa kweSpark ngeMLflow
Uphindaphindo lwewayini

Inxalenye yeseva yemodeli

Iseva yokulandelela i-MLflow, iqaliswe ngokusebenzisa i-"mlflow server" umyalelo, ine-REST API yokulandela umkhondo kunye nokubhala idatha kwinkqubo yefayile yendawo. Ungacacisa idilesi yomncedisi wokulandelela usebenzisa i-mobile variable "MLFLOW_TRACKING_URI" kunye ne-MLflow yokulandelela i-API iya kuqhagamshelana ngokuzenzekelayo nomncedisi wokulandelela kule dilesi ukwenza / ukufumana ulwazi lokuqaliswa, i-log metrics, njl.

umthombo: Amaxwebhu// Ukusebenzisa iseva yokulandela umkhondo

Ukubonelela ngemodeli ngomncedisi, sifuna umncedisi wokulandelela osebenzayo (jonga ujongano lokuqalisa) kunye ne-ID ye-Run yemodeli.

Ukwandiswa kweSpark ngeMLflow
Qhuba isazisi

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Ukusebenzela imifuziselo usebenzisa i-MLflow ukukhonza usebenziso, siyakufuna ufikelelo kwi-UI yokuKhangela ngokufumana ulwazi malunga nemodeli ngokulula ngokukhankanya. --run_id.

Nje ukuba imodeli iqhagamshelane neseva yokuKhangela, sinokufumana isiphelo semodeli entsha.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Iimodeli ezibalekayo ezivela eSpark

Ngaphandle kwenyani yokuba iseva yokuKhangela inamandla ngokwaneleyo okusebenzela imifuziselo ngexesha lokwenyani, baqeqeshe kwaye basebenzise ukusebenza kweseva (umthombo: mlflow // amaxwebhu // imifuziselo # yendawo), usebenzisa i-Spark (ibhetshi okanye ukusasaza) sisisombululo esinamandla ngakumbi ngenxa yokuhanjiswa kwayo.

Khawufane ucinge ukuba wenze uqeqesho ngaphandle kweintanethi emva koko wasebenzisa imodeli yokuphuma kuyo yonke idatha yakho. Kulapho iSpark kunye neMLflow zikhanya khona.

Faka iPySpark + Jupyter + Spark

umthombo: Qalisa iPySpark-Jupyter

Ukubonisa indlela esisebenzisa ngayo iimodeli zeMLflow kwiispark dataframes, kufuneka siseke iincwadi zamanqaku zeJupyter ukuze zisebenze kunye nePySpark.

Qala ngokufakela inguqulelo yamva nje ezinzileyo Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/sparkΜ€

Faka iPySpark kunye neJupyter kwindawo ebonakalayo:

pip install pyspark jupyter

Seta izinto eziguquguqukayo zokusingqongileyo:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ukuba uzimisele notebook-dir, sinokugcina iincwadana zethu kwifolda esiyifunayo.

Ukuphehlelela iJupyter esuka kwiPySpark

Ekubeni sikwazile ukuqwalasela iJupiter njengomqhubi wePySpark, ngoku sinokuqhuba i-Jupyter notebook kumxholo wePySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Ukwandiswa kweSpark ngeMLflow

Njengoko kukhankanyiwe ngasentla, i-MLflow ibonelela ngenqaku lemodeli yokugawulwa kwezinto zakudala kwi-S3. Nje ukuba sinomzekelo okhethiweyo ezandleni zethu, sinethuba lokuyingenisa njenge-UDF sisebenzisa imodyuli mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Ukwandiswa kweSpark ngeMLflow
I-PySpark-Imveliso yoqikelelo lomgangatho wewayini

Ukuza kuthi ga ngoku, sithethile malunga nendlela yokusebenzisa iPySpark ngeMLflow, iqhuba uqikelelo lomgangatho wewayini kuyo yonke idatha yedatha. Kodwa kuthekani ukuba ufuna ukusebenzisa iimodyuli zePython MLflow ukusuka kwiScala Spark?

Sivavanye oku kwakhona ngokwahlula umxholo weSpark phakathi kweScala kunye nePython. Oko kukuthi, sabhalisa i-MLflow UDF kwi-Python, kwaye sayisebenzisa ukusuka kwi-Scala (ewe, mhlawumbi kungekhona isisombululo esihle, kodwa into esinayo).

Scala Spark + MLflow

Kulo mzekelo siya kongeza UToree Kernel kwiJupiter ekhoyo.

Faka iSpark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Njengoko unokubona kwincwadana eqhotyoshelweyo, i-UDF kwabelwana ngayo phakathi kweSpark nePySpark. Siyathemba ukuba le nxalenye iya kuba luncedo kwabo bathanda iScala kwaye bafuna ukuthumela iimodeli zokufunda ngomatshini kwimveliso.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Amanyathelo alandelayo

Nangona i-MLflow ikwinguqulelo ye-Alpha ngexesha lokubhalwa, ibonakala ithembisa kakhulu. Ukukwazi nje ukuqhuba iinkqubo ezininzi zokufunda koomatshini kwaye uzisebenzise ukusuka kwisiphelo esinye kuthatha iinkqubo zokuncoma ukuya kwinqanaba elilandelayo.

Ukongeza, i-MLflow izisa iiNjineli zeDatha kunye neengcali zeSayensi yeDatha ngokusondeleyo kunye, ibeka umaleko oqhelekileyo phakathi kwabo.

Emva kolu phononongo lwe-MLflow, siqinisekile ukuba siya kuqhubela phambili kwaye siyisebenzisele imibhobho yethu ye-Spark kunye neenkqubo zokuncoma.

Kuya kuba kuhle ukulungelelanisa ugcino lwefayile kunye nesiseko sedatha endaweni yenkqubo yefayile. Oku kufuneka kusinike iziphelo ezininzi ezinokusebenzisa ugcino lwefayile efanayo. Ngokomzekelo, sebenzisa iimeko ezininzi Qotha ΠΈ Athena kunye neGlue metastore efanayo.

Ukushwankathela, ndingathanda ukuthi enkosi kuluntu lwe-MLFlow ngokwenza umsebenzi wethu ngedatha ube nomdla ngakumbi.

Ukuba udlala ngeMLflow, ungathandabuzi ukusibhalela kwaye usixelele ukuba uyisebenzisa njani, kwaye ngakumbi ukuba uyisebenzisa kwimveliso.

Fumana ngakumbi malunga nezifundo:
Ukufunda ngoomatshini. Ikhosi esisiseko
Ukufunda ngoomatshini. Ikhosi ekwinqanaba eliphezulu

Funda ngokugqithisileyo:

umthombo: www.habr.com

Yongeza izimvo