Inweba i-Spark nge-MLflow

Sawubona, izakhamuzi zaseKhabrovsk. Njengoba sesike sabhala, kule nyanga i-OTUS yethula izifundo ezimbili zokufunda ngomshini ngesikhathi esisodwa, okungukuthi isisekelo ΠΈ kuthuthukile. Mayelana nalokhu, siyaqhubeka sihlanganyela ukwaziswa okuwusizo.

Inhloso yalesi sihloko ukukhuluma ngesipiliyoni sethu sokuqala sisebenzisa I-MLflow.

Sizoqala isibuyekezo I-MLflow kusuka kuseva yayo yokulandelela futhi ungene kuzo zonke iziphindaphindo zocwaningo. Bese sizokwabelana ngolwazi lwethu lokuxhuma i-Spark ne-MLflow sisebenzisa i-UDF.

Umongo

Singaphakathi I-Alpha Health Sisebenzisa ukufunda komshini kanye nobuhlakani bokwenziwa ukuze sinikeze abantu amandla okuphatha impilo nokuphila kahle kwabo. Kungakho amamodeli okufunda ngomshini asenhliziyweni yemikhiqizo yesayensi yedatha esiyithuthukisayo, yingakho sidonswe yi-MLflow, inkundla yomthombo ovulekile ehlanganisa zonke izici zomjikelezo wokuphila wokufunda komshini.

I-MLflow

Umgomo oyinhloko we-MLflow ukuhlinzeka ngesendlalelo esengeziwe phezu kokufundwa komshini esingavumela ososayensi bedatha ukuthi basebenze cishe nanoma yimuphi umtapo wolwazi wokufunda womshini (h2o, amakhamera, mlep, i-pytorch, sklearn ΠΈ ukuhluma kwemifula), ebeka umsebenzi wakhe kwelinye izinga.

I-MLflow inikeza izingxenye ezintathu:

  • Tracking - ukurekhoda nezicelo zokuhlolwa: ikhodi, idatha, ukumisa kanye nemiphumela. Ukuqapha inqubo yokudala imodeli kubaluleke kakhulu.
  • Projects - Ifomethi yokupakisha ezosebenza kunoma iyiphi iplatifomu (isb. I-SageMaker)
  • models - ifomethi evamile yokuhambisa amamodeli kumathuluzi ahlukahlukene okuthunyelwa.

I-MLflow (ku-alpha ngesikhathi sokubhala) iyinkundla yomthombo ovulekile ekuvumela ukuthi ulawule umjikelezo wokuphila wokufunda komshini, okuhlanganisa ukuhlola, ukusebenzisa kabusha, kanye nokusetshenziswa.

Isetha i-MLflow

Ukuze usebenzise i-MLflow udinga kuqala usethe yonke indawo yakho yePython, kulokhu sizokusebenzisa I-PyEnv (ukufaka iPython ku-Mac, bheka lapha). Ngale ndlela singakwazi ukudala indawo ebonakalayo lapho sizofaka khona yonke imitapo yolwazi edingekayo ukuze siyiqhube.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Masifake imitapo yolwazi edingekayo.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Qaphela: Sisebenzisa i-PyArrow ukusebenzisa amamodeli afana ne-UDF. Izinguqulo ze-PyArrow ne-Numpy zazidinga ukulungiswa ngoba izinguqulo zakamuva zazingqubuzana zodwa.

Yethula i-UI yokulandela ngomkhondo

Ukulandelela kwe-MLflow kusivumela ukuthi singene futhi sibuze izivivinyo sisebenzisa i-Python kanye I-REST I-API. Ngaphezu kwalokho, unganquma ukuthi ungagcina kuphi ama-artifact angamamodeli (localhost, I-Amazon S3, I-Azure Blob Storage, Isitoreji samafu se-Google noma Iseva ye-SFTP). Njengoba sisebenzisa i-AWS e-Alpha Health, isitoreji sethu se-artifact sizoba yi-S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

I-MLflow incoma ukusebenzisa isitoreji sefayela esiqhubekayo. Isitoreji sefayela yilapho iseva izogcina khona imethadatha esebenzayo neyokuhlola. Uma uqala iseva, qiniseka ukuthi ikhomba esitolo samafayela eziphikelelayo. Lapha ngokuhlolwa sizomane sisebenzise /tmp.

Khumbula ukuthi uma sifuna ukusebenzisa iseva ye-mlflow ukuze senze izivivinyo ezindala, kufanele zibe khona endaweni yokugcina ifayela. Nokho, nangaphandle kwalokhu besingawasebenzisa ku-UDF, njengoba sidinga indlela eya kumodeli kuphela.

Qaphela: Khumbula ukuthi i-UI yokulandelela kanye neklayenti eliyimodeli kufanele bakwazi ukufinyelela indawo ye-artifact. Okungukuthi, kungakhathaliseki ukuthi i-UI Yokulandelela ihlala kusenzakalo se-EC2, lapho usebenzisa i-MLflow endaweni, umshini kufanele ube nokufinyelela okuqondile ku-S3 ukuze ubhale amamodeli e-artifact.

Inweba i-Spark nge-MLflow
Ukulandelela i-UI kugcina ama-artifact ebhakedeni le-S3

Amamodeli asebenzayo

Ngokushesha nje lapho iseva yokulandelela isebenza, ungaqala ukuqeqesha amamodeli.

Njengesibonelo, sizosebenzisa ukuguqulwa kwewayini kusuka kusibonelo se-MLflow ku Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Njengoba sesixoxile kakade, i-MLflow ikuvumela ukuthi ungene kumapharamitha wemodeli, ama-metrics, nama-artifacts ukuze ukwazi ukulandelela ukuthi avela kanjani ngokuphindaphinda. Lesi sici siwusizo kakhulu ngoba ngale ndlela singakwazi ukukhiqiza kabusha imodeli engcono kakhulu ngokuthinta iseva Yokulandelela noma ukuqonda ukuthi iyiphi ikhodi ephindaphindayo edingekayo sisebenzisa amalogi we-git hash wokuzibophezela.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Inweba i-Spark nge-MLflow
Ukuphindaphinda kwewayini

Ingxenye yeseva yemodeli

Iseva yokulandelela i-MLflow, eyethulwe kusetshenziswa umyalo othi β€œmlflow server”, ine-REST API yokulandelela ukugijima nokubhala idatha ohlelweni lwamafayela wendawo. Ungacacisa ikheli leseva yokulandelela usebenzisa okuguquguqukayo kwendawo ethi β€œMLFLOW_TRACKING_URI” futhi i-MLflow Tracking API izoxhumana ngokuzenzakalela neseva yokulandelela kuleli kheli ukuze udale/uthole ulwazi lokuqalisa, amamethrikhi wokungena, njll.

Source: Amadokhumenti// Isebenzisa iseva yokulandelela

Ukuze sinikeze imodeli ngeseva, sidinga iseva yokulandelela esebenzayo (bona isixhumi esibonakalayo sokuqalisa) kanye ne-Run ID yemodeli.

Inweba i-Spark nge-MLflow
Qalisa i-ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Ukuze sinikeze amamodeli sisebenzisa ukusebenza kwe-MLflow, sizodinga ukufinyelela ku-UI Yokulandelela ukuze sithole ulwazi mayelana nemodeli ngokumane sicacise. --run_id.

Uma imodeli isithinta iseva yokulandelela, singathola iphoyinti lokugcina lemodeli.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Amamodeli agijimayo avela ku-Spark

Ngaphandle kweqiniso lokuthi iseva yokulandelela inamandla ngokwanele ukuthi inikeze amamodeli ngesikhathi sangempela, yiqeqeshe futhi usebenzise ukusebenza kweseva (umthombo: mlflow // amadokhumenti // amamodeli # wendawo), ukusetshenziswa kwe-Spark (inqwaba noma ukusakaza) kuyisixazululo esinamandla nakakhulu ngenxa yokusabalalisa.

Cabanga ukuthi uvele wenze ukuqeqeshwa ungaxhunyiwe ku-inthanethi wabe esesebenzisa imodeli yokuphuma kuyo yonke idatha yakho. Lapha yilapho i-Spark ne-MLflow kukhanya khona.

Faka i-PySpark + Jupyter + Spark

Source: Qalisa i-PySpark - Jupyter

Ukuze sibonise ukuthi siwasebenzisa kanjani amamodeli e-MLflow kuma-dataframe e-Spark, sidinga ukusetha amabhukumaka e-Jupyter ukuze sisebenze ndawonye ne-PySpark.

Qala ngokufaka inguqulo yakamuva ezinzile I-Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/sparkΜ€

Faka i-PySpark ne-Jupyter endaweni ebonakalayo:

pip install pyspark jupyter

Setha okuguquguqukayo kwemvelo:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ngemva kokunquma notebook-dir, singagcina izincwadi zethu zokubhalela kufolda esiyifunayo.

Kwethulwa i-Jupyter kusuka ku-PySpark

Njengoba sikwazile ukumisa i-Jupiter njengomshayeli we-PySpark, manje sesingakwazi ukusebenzisa i-Jupyter notebook kumongo we-PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Inweba i-Spark nge-MLflow

Njengoba kushiwo ngenhla, i-MLflow inikeza isici sezinto zokwenziwa zemodeli yokungena ku-S3. Ngokushesha nje lapho sesiphethe imodeli ekhethiwe ezandleni zethu, sinethuba lokuyingenisa njenge-UDF sisebenzisa imojuli mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Inweba i-Spark nge-MLflow
I-PySpark - Ikhipha izibikezelo zekhwalithi yewayini

Kuze kube manje, sikhulume ngendlela yokusebenzisa i-PySpark nge-MLflow, esebenzisa izibikezelo zekhwalithi yewayini kuyo yonke idathasethi yewayini. Kepha kuthiwani uma udinga ukusebenzisa amamojula wePython MLflow kusuka ku-Scala Spark?

Sikuvivinye nalokhu ngokuhlukanisa umongo we-Spark phakathi kwe-Scala ne-Python. Okusho ukuthi, sibhalise i-MLflow UDF ku-Python, futhi sayisebenzisa ku-Scala (yebo, mhlawumbe akusona isixazululo esingcono kakhulu, kodwa esinakho).

I-Scala Spark + MLflow

Kulesi sibonelo sizokwengeza I-Toree Kernel ku-Jupiter ekhona.

Faka i-Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Njengoba ubona ebhukwini lokubhalela elinamathiselwe, i-UDF yabelwa phakathi kwe-Spark ne-PySpark. Sithemba ukuthi le ngxenye izoba wusizo kulabo abathanda i-Scala futhi abafuna ukusebenzisa amamodeli okufunda ngomshini ekukhiqizeni.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Izinyathelo ezilandelayo

Noma i-MLflow ikunguqulo ye-Alpha ngesikhathi sokubhala, ibukeka ithembisa impela. Amandla nje wokusebenzisa izinhlaka zokufunda zemishini eminingi futhi uwasebenzise kusuka endaweni eyodwa athatha amasistimu wokuncoma awayise ezingeni elilandelayo.

Ngaphezu kwalokho, i-MLflow iletha Onjiniyela Bedatha kanye Nochwepheshe Besayensi Yedatha eduze, ibeka isendlalelo esifanayo phakathi kwabo.

Ngemva kwalokhu kuhlolwa kwe-MLflow, siyaqiniseka ukuthi sizoqhubekela phambili futhi siyisebenzisele amapayipi ethu e-Spark namasistimu okuncoma.

Kungaba kuhle ukuvumelanisa isitoreji sefayela nesizindalwazi esikhundleni sesistimu yefayela. Lokhu kufanele kusinikeze iziphetho eziningi ezingasebenzisa isitoreji sefayela esifanayo. Isibonelo, sebenzisa izimo eziningi presto ΠΈ Athena nge-Glue metastore efanayo.

Ukufingqa, ngithanda ukubonga umphakathi we-MLFlow ngokwenza umsebenzi wethu ngedatha uthandeke kakhulu.

Uma udlala nge-MLflow, ungangabazi ukusibhalela futhi usitshele ukuthi uyisebenzisa kanjani, futhi nakakhulu uma uyisebenzisa ekukhiqizeni.

Thola okwengeziwe mayelana nezifundo:
Ukufunda ngomshini. Inkambo eyisisekelo
Ukufunda ngomshini. Izifundo ezithuthukile

Funda kabanzi:

Source: www.habr.com

Engeza amazwana