Extending Spark pẹlu MLflow

Kaabo, awọn olugbe Khabrovsk. Gẹgẹbi a ti kọ tẹlẹ, oṣu yii OTUS n ṣe ifilọlẹ awọn iṣẹ ikẹkọ ẹrọ meji ni ẹẹkan, eyun ipilẹ и to ti ni ilọsiwaju. Ni idi eyi, a tẹsiwaju lati pin awọn ohun elo ti o wulo.

Idi ti nkan yii ni lati sọrọ nipa iriri akọkọ wa nipa lilo MLflow sisan.

A yoo bẹrẹ atunyẹwo naa MLflow sisan lati olupin ipasẹ rẹ ati wọle gbogbo awọn iterations ti iwadi naa. Lẹhinna a yoo pin iriri wa ti sisopọ Spark pẹlu MLflow nipa lilo UDF.

Àyíká

A wa ninu Alfa Ilera A lo ẹkọ ẹrọ ati oye atọwọda lati fun eniyan ni agbara lati ṣe idiyele ti ilera ati alafia wọn. Iyẹn ni idi ti awọn awoṣe ikẹkọ ẹrọ wa ni ọkan ti awọn ọja imọ-jinlẹ data ti a ṣe idagbasoke, ati pe iyẹn ni idi ti a fi fa si MLflow, pẹpẹ orisun ṣiṣi ti o bo gbogbo awọn aaye ti igbesi aye ikẹkọ ẹrọ.

MLflow sisan

Ibi-afẹde akọkọ ti MLflow ni lati pese ipele afikun lori oke ikẹkọ ẹrọ ti yoo gba awọn onimọ-jinlẹ data laaye lati ṣiṣẹ pẹlu o fẹrẹ to eyikeyi ile-ikawe ikẹkọ ẹrọ (h2o, keira, aburu, ògùṣọ̀, sklearn и tensorflow), mu iṣẹ rẹ lọ si ipele ti o tẹle.

MLflow pese awọn paati mẹta:

  • titele - gbigbasilẹ ati awọn ibeere fun awọn adanwo: koodu, data, iṣeto ni ati awọn abajade. Mimojuto ilana ti ṣiṣẹda awoṣe jẹ pataki pupọ.
  • ise agbese - Ọna kika lati ṣiṣẹ lori pẹpẹ eyikeyi (fun apẹẹrẹ. SageMaker)
  • si dede - ọna kika ti o wọpọ fun fifiranṣẹ awọn awoṣe si ọpọlọpọ awọn irinṣẹ imuṣiṣẹ.

MLflow (ni alpha ni akoko kikọ) jẹ ipilẹ orisun ṣiṣi ti o fun ọ laaye lati ṣakoso igbesi aye ẹkọ ẹrọ, pẹlu idanwo, atunlo, ati imuṣiṣẹ.

Ṣiṣeto MLflow

Lati lo MLflow o nilo lati kọkọ ṣeto gbogbo agbegbe Python rẹ, fun eyi a yoo lo PyEnv (lati fi Python sori Mac, ṣayẹwo nibi). Ni ọna yii a le ṣẹda agbegbe foju kan nibiti a yoo fi sori ẹrọ gbogbo awọn ile-ikawe pataki lati ṣiṣẹ.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Jẹ ki a fi awọn ile-ikawe ti a beere sori ẹrọ.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Akiyesi: A lo PyArrow lati ṣiṣe awọn awoṣe bii UDF. Awọn ẹya ti PyArrow ati Numpy nilo lati wa titi nitori awọn ẹya ti o kẹhin ni ikọlura pẹlu ara wọn.

Ifilọlẹ UI Itọpa

Titọpa MLflow gba wa laaye lati wọle ati awọn adanwo ibeere nipa lilo Python ati REST API. Ni afikun, o le pinnu ibiti o ti fipamọ awọn ohun-ọṣọ awoṣe (localhost, Amazon S3, Azure Blob Ibi ipamọ, Ibi ipamọ awọsanma Google tabi olupin SFTP). Niwọn igba ti a ti lo AWS ni Ilera Alpha, ibi ipamọ ohun-ini wa yoo jẹ S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow ṣeduro lilo ibi ipamọ faili ti o tẹpẹlẹ. Ibi ipamọ faili jẹ ibi ti olupin yoo fipamọ ṣiṣe ati idanwo metadata. Nigbati o ba bẹrẹ olupin naa, rii daju pe o tọka si ile itaja faili ti o tẹpẹlẹ. Nibi fun idanwo a yoo rọrun lo /tmp.

Ranti pe ti a ba fẹ lo olupin mlflow lati ṣiṣe awọn idanwo atijọ, wọn gbọdọ wa ni ibi ipamọ faili. Sibẹsibẹ, paapaa laisi eyi a le lo wọn ni UDF, nitori a nilo ọna nikan si awoṣe.

Akiyesi: Jeki ni lokan pe Titọpa UI ati alabara awoṣe gbọdọ ni iwọle si ipo artifact. Iyẹn ni, laibikita otitọ pe UI Titele n gbe ni apẹẹrẹ EC2 kan, nigbati o nṣiṣẹ MLflow ni agbegbe, ẹrọ naa gbọdọ ni iwọle taara si S3 lati kọ awọn awoṣe artifact.

Extending Spark pẹlu MLflow
Titọpa UI n tọju awọn ohun-ọṣọ sinu garawa S3 kan

Awọn awoṣe nṣiṣẹ

Ni kete ti olupin Titele n ṣiṣẹ, o le bẹrẹ ikẹkọ awọn awoṣe.

Gẹgẹbi apẹẹrẹ, a yoo lo iyipada ọti-waini lati apẹẹrẹ MLflow ni Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Gẹgẹbi a ti sọrọ tẹlẹ, MLflow ngbanilaaye lati wọle si awọn aye awoṣe, awọn metiriki, ati awọn ohun-ọṣọ ki o le tọpa bii wọn ṣe dagbasoke lori awọn aṣetunṣe. Ẹya yii wulo pupọ nitori ni ọna yii a le ṣe ẹda awoṣe ti o dara julọ nipa kikan si olupin Ipasẹ tabi oye iru koodu wo ni o ṣe aṣetunṣe ti a beere nipa lilo awọn iforukọsilẹ git hash ti awọn iṣẹ.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Extending Spark pẹlu MLflow
Waini iterations

Apakan olupin fun awoṣe

Olupin ipasẹ MLflow, ti a ṣe ifilọlẹ nipa lilo aṣẹ “olupin mlflow”, ni API REST kan fun awọn ṣiṣe ipasẹ ati kikọ data si eto faili agbegbe. O le pato adirẹsi olupin ipasẹ nipa lilo oniyipada ayika “MLFLOW_TRACKING_URI” ati MLflow titele API yoo kan si olupin ipasẹ laifọwọyi ni adirẹsi yii lati ṣẹda/gba alaye ifilọlẹ, awọn metiriki log, ati bẹbẹ lọ.

orisun: Docs// Nṣiṣẹ olupin ipasẹ

Lati pese awoṣe pẹlu olupin, a nilo olupin ipasẹ nṣiṣẹ (wo wiwo ifilọlẹ) ati ID Run ti awoṣe naa.

Extending Spark pẹlu MLflow
Ṣiṣe ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Lati sin awọn awoṣe ni lilo iṣẹ iṣẹ iranṣẹ MLflow, a yoo nilo iraye si UI Titele lati gba alaye nipa awoṣe nirọrun nipa sisọ pato --run_id.

Ni kete ti awoṣe ba kan si olupin Titele, a le gba aaye ipari awoṣe tuntun kan.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Ṣiṣe awọn awoṣe lati Spark

Paapaa otitọ pe olupin Ipasẹ lagbara to lati sin awọn awoṣe ni akoko gidi, kọ wọn ki o lo iṣẹ ṣiṣe olupin (orisun: mlflow // docs // awọn awoṣe # agbegbe), lilo Spark (ipele tabi ṣiṣanwọle) jẹ ojutu ti o lagbara paapaa nitori pinpin.

Fojuinu pe o kan ṣe ikẹkọ ni aisinipo ati lẹhinna lo awoṣe iṣelọpọ si gbogbo data rẹ. Eyi ni ibi ti Spark ati MLflow n tan.

Fi PySpark + Jupyter + Spark sori ẹrọ

orisun: Bẹrẹ PySpark - Jupyter

Lati ṣafihan bi a ṣe lo awọn awoṣe MLflow si awọn fireemu data Spark, a nilo lati ṣeto awọn iwe ajako Jupyter lati ṣiṣẹ pọ pẹlu PySpark.

Bẹrẹ nipa fifi ẹya iduroṣinṣin tuntun sori ẹrọ Agbejade Afun:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Fi PySpark ati Jupyter sori ẹrọ ni agbegbe foju:

pip install pyspark jupyter

Ṣeto awọn oniyipada ayika:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ti pinnu notebook-dir, a le fipamọ awọn iwe ajako wa sinu folda ti o fẹ.

Ifilọlẹ Jupyter lati PySpark

Niwọn bi a ti ni anfani lati tunto Jupiter gẹgẹbi awakọ PySpark, a le ni bayi ṣiṣe iwe akiyesi Jupyter ni aaye ti PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Extending Spark pẹlu MLflow

Gẹgẹbi a ti sọ loke, MLflow n pese ẹya kan fun gedu awọn ohun-ọṣọ awoṣe ni S3. Ni kete ti a ba ni awoṣe ti a yan ni ọwọ wa, a ni aye lati gbe wọle bi UDF nipa lilo module mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Extending Spark pẹlu MLflow
PySpark - Awọn asọtẹlẹ didara didara waini

Titi di aaye yii, a ti sọrọ nipa bi o ṣe le lo PySpark pẹlu MLflow, ṣiṣe awọn asọtẹlẹ didara waini lori gbogbo dataset waini. Ṣugbọn kini ti o ba nilo lati lo awọn modulu Python MLflow lati Scala Spark?

A ṣe idanwo eyi paapaa nipa pipin ipo Spark laarin Scala ati Python. Iyẹn ni, a forukọsilẹ MLflow UDF ni Python, ati lo lati Scala (bẹẹni, boya kii ṣe ojutu ti o dara julọ, ṣugbọn ohun ti a ni).

Scala sipaki + MLflow

Fun apẹẹrẹ yii a yoo ṣafikun Toree ekuro sinu Jupiter ti o wa.

Fi sori ẹrọ Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Gẹgẹbi o ti le rii lati iwe ajako ti o somọ, UDF ti pin laarin Spark ati PySpark. A nireti pe apakan yii yoo wulo fun awọn ti o nifẹ Scala ati fẹ lati ran awọn awoṣe ikẹkọ ẹrọ ni iṣelọpọ.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Next awọn igbesẹ

Paapaa botilẹjẹpe MLflow wa ninu ẹya Alpha ni akoko kikọ, o dabi ohun ti o ni ileri pupọ. Nikan ni agbara lati ṣiṣe awọn ilana ikẹkọ ẹrọ pupọ ati jẹ wọn lati aaye ipari kan gba awọn eto oludamoran si ipele ti atẹle.

Ni afikun, MLflow mu Awọn Onimọ-ẹrọ Data ati awọn alamọja Imọ Imọ-jinlẹ sunmọ papọ, fifi ipele ti o wọpọ laarin wọn.

Lẹhin iṣawakiri yii ti MLflow, a ni igboya pe a yoo lọ siwaju ati lo fun awọn opo gigun ti Spark ati awọn ọna ṣiṣe iṣeduro.

Yoo dara lati mu ibi ipamọ faili ṣiṣẹpọ pẹlu ibi ipamọ data dipo eto faili naa. Eyi yẹ ki o fun wa ni awọn aaye ipari pupọ ti o le lo ibi ipamọ faili kanna. Fun apẹẹrẹ, lo ọpọ igba Ya и Athena pẹlu kanna Lẹ pọ metastore.

Lati ṣe akopọ, Emi yoo fẹ lati sọ ọpẹ si agbegbe MLFlow fun ṣiṣe iṣẹ wa pẹlu data diẹ sii ti o nifẹ si.

Ti o ba n ṣiṣẹ ni ayika pẹlu MLflow, ma ṣe ṣiyemeji lati kọwe si wa ki o sọ fun wa bi o ṣe lo, ati paapaa diẹ sii ti o ba lo ni iṣelọpọ.

Wa diẹ sii nipa awọn iṣẹ ikẹkọ:
Ẹkọ ẹrọ. Ẹkọ ipilẹ
Ẹkọ ẹrọ. To ti ni ilọsiwaju dajudaju

Ka siwaju:

orisun: www.habr.com

Fi ọrọìwòye kun