Kuwedzera Spark neMLflow

Mhoro, Khabrovites. Sezvatakatonyora, mwedzi uno OTUS inotangisa makosi maviri pamushini kudzidza kamwechete, zvinoti base ΠΈ advanced. Panyaya iyi, tinoramba tichiudza vamwe mashoko anobatsira.

Chinangwa chechinyorwa chino ndechekutaura nezve chiitiko chedu chekutanga ne MLflow.

Tichatanga kuongorora MLflow kubva kune yayo yekutevera server uye prolog zvese zvinodzokororwa zvechidzidzo. Zvadaro tichagovana ruzivo rwekubatanidza Spark neMLflow uchishandisa UDF.

Pfungwa

Isu tiri Alpha Health tinoshandisa kudzidzira kwemichina uye hungwaru hwekugadzira kupa simba vanhu kuti vatarisire hutano hwavo uye kugara kwavo zvakanaka. Ndosaka mamodheru ekudzidza muchina ari pamwoyo wezvigadzirwa zve data zvatinogadzira, ndosaka MLflow, yakavhurika sosi chikuva chinotenderera zvese zvemuchina kudzidza lifecycle, yakauya kwatiri.

MLflow

Chinangwa chikuru cheMLflow ndechekupa imwezve dhizaini pamusoro pekudzidza kwemichina iyo inobvumira data masayendisiti kushanda nechero raibhurari yekudzidza muchina (h2o, kera, mleap, pytorch, sklearn ΠΈ tensorflow), achiendesa basa rake padanho rinotevera.

MLflow inopa zvinhu zvitatu:

  • Tracking -kurekodha uye zvikumbiro zvekuyedza: kodhi, data, kumisikidza uye mhedzisiro. Zvinonyanya kukosha kutevera nzira yekugadzira muenzaniso.
  • Projects -Kurongedza fomati yekumhanya pane chero chikuva (semuenzaniso, SageMaker)
  • Models chimiro chakajairika chekutumira mamodheru kune akasiyana maturusi ekutumira.

MLflow (alpha panguva yekunyora) ipuratifomu yakavhurika sosi iyo inokutendera iwe kubata muchina wekudzidza hupenyu, kusanganisira kuyedza, kushandisazve, uye kutumira.

Kugadzirisa MLflow

Kuti ushandise MLflow, unofanirwa kutanga wamisa iyo yese Python nharaunda, pane izvi isu tichashandisa PyEnv (kuisa Python paMac, tarisa pano) Saka isu tinokwanisa kugadzira nharaunda chaiyo apo isu tichaisa ese maraibhurari anodiwa kumhanya.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Isa maraibhurari anodiwa.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Cherechedza: Tiri kushandisa PyArrow kumhanyisa modhi seUDFs. Iwo mavhezheni ePyArrow uye Numpy aifanira kugadziriswa nekuti shanduro dzichangoburwa dzaipokana.

Kutangisa Tracking UI

MLflow Tracking inotibvumira kunyora uye kubvunza bvunzo nePython uye REST API. Mukuwedzera, iwe unogona kutsanangura kwaungachengetera modhi zvigadzirwa (localhost, Amazon S3, Azure Blob Storage, Google Cloud Kuchengetedza kana SFTP server) Sezvo isu tichishandisa AWS kuAlpha Health, S3 ichave chengetedzo yezvigadzirwa.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow inokurudzira kushandisa inoenderera faira kuchengetedza. Iyo faira yekuchengetedza ndipo iyo sevha inochengeta metadata yekumhanya uye kuyedza. Paunotanga sevha, ita shuwa kuti inonongedzera kune inoenderera faira kuchengetedza. Pano, nekuda kwekuedza, isu tichangoshandisa /tmp.

Ramba uchifunga kuti kana isu tichida kushandisa iyo mlflow server kumhanyisa zviedzo zvekare, zvinofanirwa kunge zviripo muchitoro chefaira. Nekudaro, kunyangwe pasina izvi, isu taizokwanisa kuzvishandisa muUDF, sezvo isu tichingoda nzira yemuenzaniso.

Ongorora: Ramba uchifunga kuti Tracking UI uye modhi mutengi anofanira kuwana nzvimbo yeartifact. Ndokunge, zvisinei nekuti iyo Tracking UI iri mune EC2 semuenzaniso, kana uchimhanyisa MLflow munharaunda, muchina unofanirwa kuwana wakananga kuS3 kunyora mamodheru.

Kuwedzera Spark neMLflow
Kutsvaga UI inochengetedza zvigadzirwa muS3 bucket

Running Models

Pangosvika iyo Tracking server ichimhanya, unogona kutanga kudzidzisa mamodheru.

Semuenzaniso, isu tichashandisa iyo waini shanduko kubva kuMLflow muenzaniso mu Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Sezvatakataura, MLflow inobvumidza iwe kuti utore ma paramita, metrics, uye modhi zvigadzirwa kuitira kuti iwe ugone kuteedzera magadzirirwo azvinoita sekudzokororwa. Iyi ficha inobatsira zvakanyanya, nekuti inotibvumidza kuburitsa yakanakisa modhi nekubata iyo Tracking server kana kunzwisisa kuti ndeipi kodhi yakaita iyo inodiwa iteration uchishandisa git hash matanda ezvibatiso.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Kuwedzera Spark neMLflow
kudzokororwa kwewaini

Kumashure kwekupedzisira kwemuenzaniso

Iyo MLflow yekutevera sevha yakatangwa ne "mlflow server" yekuraira ine REST API yekutevera inomhanya uye kunyora data kune yemuno faira system. Unogona kudoma kero yesevha yekutevera uchishandisa "MLFLOW_TRACKING_URI" nharaunda inosiyana uye iyo MLflow tracking API inozobata yega sevha yekutevera pakero iyi kugadzira/kuwana ruzivo rwekumhanya, matanda ematanda, nezvimwe.

Source: Docs// Kumhanyisa sevha yekutevera

Kuti tipe modhi nesevha, tinoda inomhanya yekutevera server (ona iyo yekutanga interface) uye iyo Run ID yemuenzaniso.

Kuwedzera Spark neMLflow
Runza ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Kushandira mamodheru uchishandisa iyo MLflow inoshanda kushanda, tinoda kuwana iyo Tracking UI kuti tiwane ruzivo nezve modhi nekungotsanangura. --run_id.

Kana iyo modhi yangobata iyo Tracking Server, tinogona kuwana nyowani yekupedzisira yekupedzisira.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Kumhanya modhi kubva kuSpark

Kunyangwe chokwadi chekuti Tracking server ine simba rakakwana kuti ishumire modhi munguva chaiyo, dzidzidzise uye shandisa sevha mashandiro (mabviro: mlflow // docs // models #local), kushandisa Spark (batch kana kutenderera) ndiyo yakatosimba mhinduro nekuda kwekugovera kwayo.

Fungidzira kuti iwe uchangodzidzira kunze kwenyika uye wozoshandisa iyo yekubuda modhi kune yako yese data. Apa ndipo panouya Spark neMLflow mune yavo.

Isa PySpark + Jupyter + Spark

Source: Tanga PySpark - Jupyter

Kuratidza mashandisiro atinoita MLflow modhi kuSpark dataframes, tinoda kuseta zvinyorwa zveJupyter kuti ushande nePySpark.

Tanga nekuisa yazvino yakatsiga vhezheni Apache spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/sparkΜ€

Isa PySpark uye Jupyter munzvimbo chaiyo:

pip install pyspark jupyter

Gadzirisa mamiriro ekunze:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ave atsanangura notebook-dir, tichakwanisa kuchengeta zvinyorwa zvedu muforodha yatinoda.

Kumhanya Jupyter kubva kuPySpark

Sezvo isu takakwanisa kumisikidza Jupiter semutyairi wePySpark, isu tinokwanisa ikozvino kumhanya Jupyter notebook mune yePySpark mamiriro.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Kuwedzera Spark neMLflow

Sezvambotaurwa pamusoro apa, MLflow inopa basa rekugadzira matanda emhando zvigadzirwa muS3. Patinongove nemuenzaniso wakasarudzwa mumaoko edu, tine mukana wekuipinza kunze seUDF tichishandisa module mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Kuwedzera Spark neMLflow
PySpark - Kufanotaura mhando yewaini

Kusvika panguva ino, takataura nezve mashandisiro ePySpark neMLflow nekumhanyisa waini yemhando yekufungidzira pane yese dataset yewaini. Asi ko kana iwe uchida kushandisa iyo Python MLflow modules kubva kuScala Spark?

Takaedza izvi zvakare nekutsemura mamiriro eSpark pakati peScala nePython. Ndokunge, isu takanyoresa MLflow UDF muPython, uye takaishandisa kubva kuScala (hongu, pamwe isiri iyo yakanyanya mhinduro, asi yatinayo).

Scala Spark + MLflow

Nokuda kwemuenzaniso uyu, tichawedzera Toree Kernel muJupiter iripo.

Isa Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Sezvauri kuona kubva kubhuku rakabatanidzwa, UDF inogoverwa pakati peSpark nePySpark. Isu tinovimba kuti chikamu ichi chichabatsira kune avo vanoda Scala uye vanoda kuendesa muchina kudzidza modhi mukugadzira.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Matanho anotevera

Kunyangwe MLflow iri muAlfa panguva yekunyora, inoita seinovimbisa. Kungokwanisa kumhanyisa akawanda muchina ekudzidza masisitimu uye nekuashandisa kubva kune imwechete endpoint kunotora recommender masisitimu kune inotevera nhanho.

Pamusoro pezvo, MLflow inounza mainjiniya eData neData Scientists padhuze pamwe chete, vachiisa yakajairika layer pakati pavo.

Mushure mekuongorora uku kweMLflow, tine chivimbo chekuti tichaenda kumberi nekuishandisa kune yedu Spark mapaipi uye anorumbidza masisitimu.

Zvingave zvakanaka kuwiriranisa faira rekuchengetedza nedatabase panzvimbo yefaira system. Izvi zvinofanirwa kutipa magumo akawanda anogona kushandisa yakafanana faira kugovera. Somuenzaniso, shandisa maitiro akawanda Presto ΠΈ Athena pamwe chete Glue metastore.

Muchidimbu, ndinoda kutaura kutenda kunharaunda yeMLFlow nekuita kuti basa redu nedata riwedzere kunakidza.

Kana iwe uchitamba neMLflow, inzwa wakasununguka kutinyorera uye utiudze mashandisiro aunoita, uye zvakatonyanya kana ukaishandisa mukugadzira.

Dzidza zvakawanda nezvemakosi:
kudzidza muchina. Basic course
kudzidza muchina. advanced course

Verenga zvimwe:

Source: www.habr.com

Voeg