Fa'alautele Spark ma le MLflow

Talofa, tagata o Khabrovsk. E pei ona uma ona matou tusia, o le masina lenei o loʻo faʻalauiloa e le OTUS ni aʻoaʻoga se lua aʻoaʻoga masini i le taimi e tasi, o lona uiga faavae и alualu i luma. I lenei tulaga, matou te faʻaauau pea ona faʻasoa mea aoga.

O le faʻamoemoega o lenei tusiga o le talanoa lea e uiga i le matou faʻaaogaina muamua MLflow.

O le a tatou amata le iloiloga MLflow mai lana 'au'aunaga su'esu'e ma fa'amau uma fa'amatalaga o le su'esu'ega. Ona matou faʻasoa atu lea o matou poto masani i le faʻafesoʻotaʻi o Spark ma MLflow e faʻaaoga ai le UDF.

Anotusi

Ua tatou i totonu Soifua Maloloina Alefa Matou te faʻaaogaina masini aʻoaʻoga ma atamai faʻapitoa e faʻamalosia ai tagata e pulea lo latou soifua maloloina ma le manuia. O le mafua'aga lea o fa'ata'ita'iga a'oa'oga masini i le fatu o oloa fa'asaienisi fa'amaumauga o lo'o matou atia'e, ma o le mafua'aga fo'i lena na tosina atu ai i matou i le MLflow, o se fa'asalalauga fa'alauiloa e aofia uma ai vaega o le olaga a'oa'oga masini.

MLflow

O le sini autu o le MLflow o le tuʻuina atu lea o se faʻaopoopoga faʻaopoopo i luga o le aʻoaʻoina o masini e mafai ai e saienitisi faʻamatalaga ona galulue ma toetoe lava o soʻo se faletusi aʻoaʻoga masini (h2o, faigata, mleap, pytorch, sklearn и tensorflow), ave lana galuega i le isi tulaga.

MLflow e maua ai vaega e tolu:

  • e Siaki - faʻamaumauga ma talosaga mo faʻataʻitaʻiga: code, faʻamaumauga, faʻatulagaina ma iʻuga. O le mataʻituina o le faagasologa o le fatuina o se faʻataʻitaʻiga e taua tele.
  • galuega faatino - Faiga faʻapipiʻi e taʻavale i luga o soʻo se tulaga (faʻataʻitaʻiga. SageMaker)
  • faataitaiga - o se faatulagaga masani mo le tuʻuina atu o faʻataʻitaʻiga i meafaigaluega faʻapipiʻi eseese.

MLflow (i le alpha i le taimi o le tusitusi) o se faʻamatalaga avanoa e mafai ai e oe ona faʻatautaia le faʻataʻitaʻiga o le olaga o le aʻoaʻoina o masini, e aofia ai le faʻataʻitaʻiga, toe faʻaaogaina, ma le faʻaogaina.

Fa'atulaga MLflow

Mo le faʻaaogaina o le MLflow e manaʻomia ona e faʻatulaga muamua lau siosiomaga Python atoa, mo lenei mea o le a matou faʻaogaina PyEnv (e faʻapipiʻi le Python i le Mac, siaki iinei). O le auala lea e mafai ai ona tatou fatuina se siosiomaga faʻapitoa e faʻapipiʻi ai faletusi uma e manaʻomia e faʻatautaia ai.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Tatou fa'apipi'i faletusi mana'omia.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Manatua: Matou te faʻaogaina le PyArrow e faʻataʻitaʻi ai faʻataʻitaʻiga e pei ole UDF. O lomiga o PyArrow ma Numpy e manaʻomia ona faʻaleleia ona o faʻamaumauga mulimuli e feteʻenaʻi le tasi ma le isi.

Fa'alauiloa UI Su'e

MLflow Tracking e mafai ai ona matou faʻamauina ma suʻesuʻe faʻataʻitaʻiga e faʻaaoga ai le Python ma mapu API. E le gata i lea, e mafai ona e fuafuaina poʻo fea e teu ai mea faʻataʻitaʻi (localhost, Amazon S3, Azure Blob Teuga, Google Cloud Storage poʻo SFTP server). Talu ai matou te faʻaogaina le AWS i le Alpha Health, o le matou mea e teu ai mea o le a S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

E fautuaina e le MLflow le faʻaaogaina o faila faila. O le teuina o faila o le mea lea e teu ai e le 'auʻaunaga le taʻavale ma faʻataʻitaʻi metadata. A amata le server, ia mautinoa e faasino i le faleoloa faila tumau. O iinei mo le faʻataʻitaʻiga o le a matou faʻaaogaina /tmp.

Manatua afai tatou te mananaʻo e faʻaoga le mlflow server e faʻataʻitaʻi ai suʻega tuai, e tatau ona i ai i le faila faila. Ae ui i lea, e tusa lava pe leai lenei mea e mafai ona matou faʻaaogaina i le UDF, talu ai matou te manaʻomia le ala i le faʻataʻitaʻiga.

Fa'aaliga: Ia manatua o le Su'esu'ega UI ma le tagata fa'ata'ita'i fa'ata'ita'i e tatau ona maua le avanoa i le nofoaga o mea fa'ameamea. O lona uiga, e tusa lava po o le a le mea moni o le Tracking UI o loʻo nofo i se faʻataʻitaʻiga EC2, pe a faʻatautaia MLflow i le lotoifale, e tatau i le masini ona maua saʻo i le S3 e tusi ai faʻataʻitaʻiga faʻataʻitaʻiga.

Fa'alautele Spark ma le MLflow
Su'e UI teuina mea taua i totonu ole pakete S3

Fa'ata'ita'iga tamo'e

O le taimi lava e taʻavale ai le server Tracking, e mafai ona e amata aʻoaʻoina faʻataʻitaʻiga.

Mo se faʻataʻitaʻiga, o le a matou faʻaogaina le suiga o le uaina mai le faʻataʻitaʻiga MLflow i Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

E pei ona uma ona tatou talanoaina, MLflow faʻatagaina oe e faʻamauina faʻataʻitaʻiga faʻataʻitaʻiga, metrics, ma mea taua ina ia mafai ai ona e vaʻai pe faʻafefea ona latou faʻasolosolo i luga o faʻasologa. O lenei vaega e matua aoga tele aua o le auala lea e mafai ai ona tatou toe gaosia le ata sili ona lelei e ala i le faʻafesoʻotaʻi o le Suʻega suʻesuʻe poʻo le malamalama poʻo le fea code na faʻamaeʻaina le manaʻomia e faʻaaoga ai le git hash logs of commits.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Fa'alautele Spark ma le MLflow
Uiga uaina

Vaega server mo le faʻataʻitaʻiga

O le MLflow tracking server, faʻalauiloaina i le faʻaaogaina o le "mlflow server", o loʻo i ai le REST API mo le siakiina o tamoʻe ma le tusiaina o faʻamatalaga i le faila faila i le lotoifale. E mafai ona e faʻamaonia le tuatusi o le suʻega suʻesuʻe e faʻaaoga ai le fesuiaiga o le siosiomaga "MLFLOW_TRACKING_URI" ma le MLflow tracking API o le a otometi lava ona faʻafesoʻotaʻi le server tracking i lenei tuatusi e fatu ai / mauaina faʻamatalaga faʻalauiloa, metric metrics, ma isi.

puna: Docs // Fa'agaioia se 'au'aunaga su'e

Ina ia tuʻuina atu le faʻataʻitaʻiga i se 'auʻaunaga, matou te manaʻomia se 'auʻaunaga suʻesuʻe faʻatautaia (vaʻai faʻalauiloa faʻalauiloa) ma le Run ID o le faʻataʻitaʻiga.

Fa'alautele Spark ma le MLflow
Fa'agasolo ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Ina ia tu'uina atu fa'ata'ita'iga e fa'aaoga ai le MLflow serve functionality, matou te mana'omia le avanoa i le Tracking UI e maua ai fa'amatalaga e uiga i le fa'ata'ita'iga na'o le fa'amaoti. --run_id.

O le taimi lava e faʻafesoʻotaʻi ai e le faʻataʻitaʻiga le server Tracking, e mafai ona matou maua se faʻataʻitaʻiga fou.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Fa'ata'ita'iga tamo'e mai Spark

E ui lava i le mea moni o le Tracking server e lava le malosi e faʻataʻitaʻi ai faʻataʻitaʻiga i le taimi moni, aʻoaʻo i latou ma faʻaoga galuega a le server (puna: mlflow // docs // faʻataʻitaʻiga # local), o le faʻaaogaina o Spark (batch poʻo le tafe) o se fofo sili atu ona mamana ona o le tufatufaina.

Va'ai faalemafaufau na e faia le a'oa'oga tuusao ona fa'aogaina lea o le fa'ata'ita'iga fa'atusa i au fa'amaumauga uma. O le mea lea e susulu ai Spark ma MLflow.

Faʻapipiʻi PySpark + Jupyter + Spark

puna: Amata PySpark - Jupyter

Ina ia faʻaalia pe faʻapefea ona matou faʻaogaina faʻataʻitaʻiga MLflow i Spark dataframes, matou te manaʻomia le setiina o api Jupyter e galulue faʻatasi ma PySpark.

Amata i le fa'apipi'iina o le lomiga mautu lata mai Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Faʻapipiʻi PySpark ma Jupyter i le siosiomaga faʻapitoa:

pip install pyspark jupyter

Seti suiga ole siosiomaga:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Ua uma ona filifili notebook-dir, e mafai ona matou teuina a matou api i totonu o le pusa e manaʻomia.

Tatala Jupyter mai PySpark

Talu ai na mafai ona matou faʻatulagaina Jupiter e avea ma avetaʻavale PySpark, ua mafai nei ona matou faʻatautaia le api Jupyter i le tulaga o PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Fa'alautele Spark ma le MLflow

E pei ona taʻua i luga, o loʻo tuʻuina atu e le MLflow se faʻaaliga mo le taina o faʻataʻitaʻiga faʻataʻitaʻiga i le S3. O le taimi lava e maua ai le faʻataʻitaʻiga filifilia i o matou lima, matou te maua le avanoa e faʻaulufale mai ai o se UDF e faʻaaoga ai le module mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Fa'alautele Spark ma le MLflow
PySpark - Fa'auluina va'aiga lelei o le uaina

E oʻo mai i le taimi nei, ua matou talanoa e uiga i le faʻaogaina o le PySpark ma le MLflow, faʻataʻitaʻiina le lelei o le uaina i luga o faʻamaumauga uma o le uaina. Ae faʻapefea pe afai e te manaʻomia le faʻaogaina o le Python MLflow modules mai Scala Spark?

Na matou faʻataʻitaʻiina foi lenei mea e ala i le vaeluaina o le Spark context i le va o Scala ma Python. O lona uiga, na matou resitalaina le MLflow UDF i le Python, ma faʻaaogaina mai Scala (ioe, atonu e le o le fofo sili, ae o le a le mea o loʻo ia i matou).

Scala Spark + MLflow

Mo lenei faʻataʻitaʻiga o le a matou faʻaopoopoina Toree Kernel i totonu o le Jupiter o iai nei.

Faʻapipiʻi Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

E pei ona mafai ona e vaʻai mai le api faʻapipiʻi, o le UDF e faʻasoa i le va o Spark ma PySpark. Matou te faʻamoemoe o le a aoga lenei vaega ia i latou e fiafia ia Scala ma manaʻo e faʻapipiʻi faʻataʻitaʻiga aʻoaʻoga masini i le gaosiga.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Laasaga muamua

E ui lava o le MLflow o loʻo i le Alpha version i le taimi o le tusitusi, e foliga mai e matua lelei lava. Na'o le mafai lava ona fa'atautaia le tele o fa'aa'oa'oga masini ma fa'aaogaina mai se tasi pito e ave ai faiga fa'atonu i le isi tulaga.

E le gata i lea, o le MLflow e aumaia Faʻamatalaga Inisinia ma Faʻamatalaga Saienisi faʻapitoa faʻapitoa faʻatasi, faʻapipiʻi se tulaga masani i le va oi latou.

A maeʻa lenei suʻesuʻega o le MLflow, matou te mautinoa o le a matou agai i luma ma faʻaaogaina mo matou Spark pipelines ma faiga faʻapitoa.

E manaia le fa'amaopoopoina o le teuina o faila ma le database nai lo le faila faila. O lenei mea e tatau ona tatou maua ai le tele o fa'ai'uga e mafai ona fa'aogaina le faila e tasi. Mo se faʻataʻitaʻiga, faʻaaoga le tele o faʻataʻitaʻiga Presto и Athena fa'atasi ai ma le Kelu metastore.

I le aotelega, ou te fia fai atu faafetai i le MLFlow community mo le faia o la matou galuega ma faʻamatalaga sili atu ona manaia.

Afai o loʻo e taʻalo faʻatasi ma le MLflow, aua le faʻatuai e tusi mai ia i matou ma taʻu mai ia i matou pe faʻapefea ona e faʻaogaina, ma sili atu pe a e faʻaaogaina i le gaosiga.

Saili atili e uiga i kosi:
A'oa'oga masini. A'oa'oga faavae
A'oa'oga masini. A'oa'oga maualuga

Faitau atili:

puna: www.habr.com

Faaopoopo i ai se faamatalaga