Ngluwihi Spark karo MLflow

Halo, warga Khabrovsk. Kaya sing wis kita tulis, wulan iki OTUS ngluncurake rong kursus pembelajaran mesin sekaligus, yaiku dhasar и majeng. Ing babagan iki, kita terus nuduhake materi sing migunani.

Tujuan saka artikel iki kanggo pirembagan bab pengalaman pisanan kita nggunakake MLflow.

Kita bakal miwiti review MLflow saka server nelusuri lan log kabeh iterasi sinau. Banjur kita bakal nuduhake pengalaman kita nyambungake Spark karo MLflow nggunakake UDF.

Konteks

Kita mlebu Kesehatan Alpha Kita nggunakake machine learning lan intelijen buatan kanggo nguatake wong supaya bisa ngurus kesehatan lan kesejahteraane. Pramila model pembelajaran mesin minangka inti produk ilmu data sing dikembangake, lan mula kita ditarik menyang MLflow, platform open source sing nyakup kabeh aspek siklus urip machine learning.

MLflow

Tujuan utama MLflow yaiku nyedhiyakake lapisan tambahan ing ndhuwur pembelajaran mesin sing ngidini para ilmuwan data bisa nggarap meh kabeh perpustakaan pembelajaran mesin (h2o, keras, mlempem, pytorch, sklearn и tensorflow), njupuk karyane menyang tingkat sabanjure.

MLflow nyedhiyakake telung komponen:

  • ranging - rekaman lan panjalukan kanggo eksperimen: kode, data, konfigurasi lan asil. Ngawasi proses nggawe model iku penting banget.
  • projects - Format kemasan kanggo mbukak ing platform apa wae (contone. SageMaker)
  • model - format umum kanggo ngirim model menyang macem-macem alat panyebaran.

MLflow (ing alpha nalika nulis) minangka platform open source sing ngidini sampeyan ngatur siklus urip machine learning, kalebu eksperimen, nggunakake maneh, lan panyebaran.

Nyetel MLflow

Kanggo nggunakake MLflow sampeyan kudu nyiyapake kabeh lingkungan Python, iki bakal digunakake PyEnv (kanggo nginstal Python ing Mac, priksa kene). Kanthi cara iki kita bisa nggawe lingkungan virtual ing ngendi kita bakal nginstal kabeh perpustakaan sing dibutuhake kanggo mbukak.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Ayo nginstal perpustakaan sing dibutuhake.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Cathetan: Kita nggunakake PyArrow kanggo mbukak model kayata UDF. Versi PyArrow lan Numpy kudu didandani amarga versi sing terakhir saling bertentangan.

Bukak UI Tracking

MLflow Tracking ngidini kita log lan takon eksperimen nggunakake Python lan REST API. Kajaba iku, sampeyan bisa nemtokake ngendi kanggo nyimpen artefak model (localhost, Amazon S3, Panyimpenan Blob Azure, Google Cloud Storage utawa server SFTP). Amarga kita nggunakake AWS ing Alpha Health, panyimpenan artefak kita bakal dadi S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow nyaranake nggunakake panyimpenan file sing terus-terusan. Panyimpenan file ing ngendi server bakal nyimpen metadata run lan eksperimen. Nalika miwiti server, priksa manawa nuduhake menyang nyimpen file sing terus-terusan. Ing kene kanggo eksperimen kita mung bakal nggunakake /tmp.

Elinga yen kita pengin nggunakake server mlflow kanggo mbukak eksperimen lawas, kudu ana ing panyimpenan file. Nanging, sanajan tanpa iki, kita bisa digunakake ing UDF, amarga kita mung butuh dalan menyang model kasebut.

Cathetan: Elinga yen UI Tracking lan klien model kudu nduweni akses menyang lokasi artefak. Sing, preduli saka kasunyatan sing Tracking UI manggon ing conto EC2, nalika mbukak MLflow lokal, mesin kudu akses langsung menyang S3 kanggo nulis model artefak.

Ngluwihi Spark karo MLflow
Tracking UI nyimpen artefak ing ember S3

Model mlaku

Sanalika server Tracking mlaku, sampeyan bisa miwiti latihan model.

Minangka conto, kita bakal nggunakake modifikasi anggur saka conto MLflow ing Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Kaya sing wis kita rembugan, MLflow ngidini sampeyan nyathet paramèter, metrik, lan artefak model supaya sampeyan bisa nglacak cara mekar liwat iterasi. Fitur iki migunani banget amarga kanthi cara iki kita bisa ngasilake model sing paling apik kanthi ngubungi server Pelacakan utawa ngerti kode endi sing nindakake iterasi sing dibutuhake nggunakake log hash git commits.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Ngluwihi Spark karo MLflow
Pengulangan anggur

Bagian server kanggo model

Server pelacakan MLflow, sing diluncurake nggunakake perintah "server mlflow", duwe API REST kanggo nelusuri lan nulis data menyang sistem file lokal. Sampeyan bisa nemtokake alamat server pelacakan nggunakake variabel lingkungan "MLFLOW_TRACKING_URI" lan API pelacakan MLflow bakal kanthi otomatis ngubungi server pelacak ing alamat iki kanggo nggawe / nampa informasi peluncuran, metrik log, lsp.

Source: Docs // Nganggo server pelacakan

Kanggo nyedhiyani model karo server, kita kudu server ranging mlaku (ndeleng antarmuka Bukak) lan Run ID model.

Ngluwihi Spark karo MLflow
Run ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Kanggo ngladeni model nggunakake fungsi MLflow serve, kita mbutuhake akses menyang UI Tracking kanggo nampa informasi babagan model mung kanthi nemtokake --run_id.

Sawise model ngubungi server Tracking, kita bisa entuk titik pungkasan model anyar.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Mlaku model saka Spark

Sanajan kasunyatane server Tracking cukup kuat kanggo ngladeni model kanthi nyata, nglatih lan nggunakake fungsi server (sumber: mlflow // docs // model # lokal), panggunaan Spark (batch utawa streaming) minangka solusi sing luwih kuat amarga distribusi.

Bayangake sampeyan mung nindakake latihan offline lan banjur ngetrapake model output menyang kabeh data sampeyan. Iki ngendi Spark lan MLflow sumunar.

Instal PySpark + Jupyter + Spark

Source: Miwiti PySpark - Jupyter

Kanggo nuduhake carane kita aplikasi model MLflow kanggo dataframes Spark, kita kudu nyiyapake notebook Jupyter bisa bebarengan karo PySpark.

Miwiti kanthi nginstal versi stabil paling anyar Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Instal PySpark lan Jupyter ing lingkungan virtual:

pip install pyspark jupyter

Setel variabel lingkungan:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Wis ditemtokake notebook-dir, kita bisa nyimpen notebook ing folder sing dikarepake.

Bukak Jupyter saka PySpark

Amarga kita bisa ngatur Jupiter minangka pembalap PySpark, saiki kita bisa mbukak notebook Jupyter ing konteks PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Ngluwihi Spark karo MLflow

Kaya kasebut ing ndhuwur, MLflow menehi fitur kanggo ngangkut barang model artefak ing S3. Sanalika kita duwe model sing dipilih ing tangan kita, kita duwe kesempatan kanggo ngimpor minangka UDF nggunakake modul kasebut mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Ngluwihi Spark karo MLflow
PySpark - Outputing prediksi kualitas anggur

Nganti titik iki, kita wis ngomong babagan carane nggunakake PySpark karo MLflow, nglakokake prediksi kualitas anggur ing kabeh set data anggur. Nanging apa yen sampeyan kudu nggunakake modul Python MLflow saka Scala Spark?

Kita uga nyoba iki kanthi misahake konteks Spark antarane Scala lan Python. Sing, kita ndhaftar MLflow UDF ing Python, lan digunakake saka Scala (ya, mbok menawa ora solusi sing paling apik, nanging apa kita duwe).

Scala Spark + MLflow

Kanggo conto iki kita bakal nambah Toree Kernel menyang Jupiter sing ana.

Instal Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Nalika sampeyan bisa ndeleng saka notebook ditempelake, UDF dienggo bareng antarane Spark lan PySpark. Muga-muga bagean iki migunani kanggo wong sing seneng Scala lan pengin nyebarake model pembelajaran mesin ing produksi.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Sabanjure langkah

Sanajan MLflow ana ing versi Alpha nalika nulis, katon cukup janjeni. Mung kemampuan kanggo mbukak macem-macem kerangka learning machine lan nggunakake saka siji endpoint njupuk sistem rekomendasi kanggo tingkat sabanjuré.

Kajaba iku, MLflow ndadekke Insinyur Data lan spesialis Ilmu Data luwih cedhak, nggawe lapisan umum ing antarane.

Sawise eksplorasi MLflow iki, kita yakin bakal maju lan digunakake kanggo pipa Spark lan sistem rekomendasi.

Iku bakal becik kanggo nyinkronake panyimpenan file karo database tinimbang sistem file. Iki kudu menehi sawetara titik pungkasan sing bisa nggunakake panyimpenan file sing padha. Contone, gunakake sawetara conto Presto и Athena karo metastore Lem padha.

Kanggo ngringkes, aku arep ngucapake matur nuwun marang komunitas MLFlow kanggo nggawe karya kita karo data luwih menarik.

Yen sampeyan lagi muter-muter karo MLflow, aja ragu-ragu kanggo nulis kanggo kita lan marang kita carane nggunakake, lan malah luwih yen sampeyan nggunakake ing produksi.

Sinau luwih lengkap babagan kursus:
Pembelajaran mesin. Kursus dhasar
Pembelajaran mesin. Kursus lanjutan

Waca liyane:

Source: www.habr.com

Add a comment