Na-agbasa Spark na MLflow

Ndewo, ndị bi na Khabrovsk. Dị ka anyị na-edebu, ọnwa a OTUS na-amalite ọmụmụ igwe abụọ n'otu oge, ya bụ isi и elu. N'akụkụ a, anyị na-aga n'ihu na-ekerịta ihe bara uru.

Ebumnuche nke isiokwu a bụ ikwu maka ahụmịhe mbụ anyị na-eji MLflow.

Anyị ga-amalite nyocha MLflow si ya nsochi nkesa na dekọọ niile iterations nke ọmụmụ. Mgbe ahụ, anyị ga-ekekọrịta ahụmịhe anyị nke iji UDF jikọọ Spark na MLflow.

Agaba

Anyị nọ Ahụike Alfa Anyị na-eji mmụta igwe na ọgụgụ isi na-eme ka ndị mmadụ nwee ike ilekọta ahụike na ọdịmma ha. Ọ bụ ya mere ụdị mmụta igwe ji dị n'etiti ngwaahịa sayensị data anyị na-etolite, ya mere e ji dọta anyị na MLflow, ikpo okwu mepere emepe nke na-ekpuchi akụkụ niile nke igwe mmụta ndụ okirikiri.

MLflow

Ebumnuche bụ isi nke MLflow bụ inye ihe mgbakwunye ọzọ n'elu mmụta igwe nke ga-eme ka ndị sayensị data rụọ ọrụ na ihe fọrọ nke nta ka ọ bụrụ ọbá akwụkwọ mmụta igwe ọ bụla (Oboe, keras, mmuo, egwu, sklearn и tensorflow), na-ewega ọrụ ya n'ọkwa ọzọ.

MLflow na-enye ihe atọ:

  • Ndepụta - ndekọ na arịrịọ maka nnwale: koodu, data, nhazi na nsonaazụ. Nyochaa usoro nke ịmepụta ihe nlereanya dị ezigbo mkpa.
  • Projects - Usoro nkwakọ ngwaahịa iji na-agba ọsọ n'elu ikpo okwu ọ bụla (dịka ọmụmaatụ. SageMaker)
  • ụdị - usoro a na-ahụkarị maka ịnyefe ụdị na ngwaọrụ ntinye dị iche iche.

MLflow (na alfa n'oge ederede) bụ ikpo okwu mepere emepe na-enye gị ohere ijikwa igwe mmụta ndụ okirikiri, gụnyere nnwale, ijigharị, na mbugharị.

Ịtọlite ​​​​MLflow

Iji jiri MLflow ịkwesịrị ibu ụzọ guzobe gburugburu Python gị niile, maka nke a anyị ga-eji PyEnv (iji tinye Python na Mac, lelee ebe a). N'ụzọ dị otú a, anyị nwere ike ịmepụta ebe mebere ebe anyị ga-etinye ụlọ akwụkwọ niile dị mkpa iji mee ya.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Ka anyị tinye ọba akwụkwọ achọrọ.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Mara: Anyị na-eji PyArrow na-eme ụdị dị ka UDF. Ụdị PyArrow na Numpy kwesịrị idozi n'ihi na ụdị nke ikpeazụ na-emegide ibe ha.

Mwepụta UI Tracking

MLflow Tracking na-enye anyị ohere ịbanye na nyocha ajụjụ site na iji Python na REST API. Na mgbakwunye, ị nwere ike ikpebi ebe ị ga-echekwa ihe ngosi ihe ngosi (localhost, Amazon S3, Nchekwa Azure Blob, Nchekwa Google Cloud ma ọ bụ Ihe nkesa SFTP). Ebe anyị na-eji AWS na Alfa Health, nchekwa ihe anyị ga-abụ S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow kwadoro iji nchekwa faịlụ na-adịgide adịgide. Nchekwa faịlụ bụ ebe ihe nkesa ga-echekwa ọsọ wee nwalee metadata. Mgbe ị na-amalite ihe nkesa, jide n'aka na ọ na-atụ aka na ụlọ ahịa faịlụ na-adịgide adịgide. Ebe a maka nnwale anyị ga-eji naanị /tmp.

Cheta na ọ bụrụ na anyị chọrọ iji ihe nkesa mlflow iji mee nnwale ochie, ha ga-adịrịrị na nchekwa faịlụ. Otú ọ dị, ọbụna na-enweghị nke a, anyị nwere ike iji ha na UDF, ebe ọ bụ na anyị chọrọ naanị ụzọ na nlereanya.

Mara: Buru n'uche na nsuso UI na onye ahịa ihe nlereanya ga-enwerịrị ohere ịnweta ebe ihe arụrụ arụ. Nke ahụ bụ, n'agbanyeghị eziokwu na Tracking UI bi na ihe atụ EC2, mgbe ọ na-agba ọsọ MLflow na mpaghara, igwe ga-enwerịrị ohere ịnweta S3 ozugbo iji dee ụdị artifact.

Na-agbasa Spark na MLflow
Isochi UI na-echekwa arịa n'ime bọket S3

Ụdị na-agba ọsọ

Ozugbo ihe nkesa na-arụ ọrụ, ị nwere ike ịmalite ịzụ ụdị.

Dịka ọmụmaatụ, anyị ga-eji mgbanwe mmanya site na ihe atụ MLflow na Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Dịka anyị tụlerela, MLflow na-enye gị ohere ịdebanye paramita ihe atụ, metrik, na artifacts ka ị nwee ike soro otu ha si etolite n'usoro. Njirimara a bara uru nke ukwuu n'ihi na otu a anyị nwere ike imepụtaghachi ụdị kachasị mma site na ịkpọtụrụ ihe nkesa nsochi ma ọ bụ ịghọta koodu nke rụrụ usoro a chọrọ site na iji git hash logs of commitments.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Na-agbasa Spark na MLflow
Mmanya ugboro ugboro

Akụkụ nkesa maka ihe nlereanya

Ihe nkesa nsochi MLflow, nke ewepụtara site na iji iwu "mlflow server", nwere REST API maka nsochi ọsọ na ide data na sistemụ faịlụ mpaghara. Ị nwere ike dee adreesị ihe nkesa nsochi site na iji mgbanwe gburugburu ebe obibi "MLFLOW_TRACKING_URI" na MLflow nsochi API ga-akpọtụrụ ihe nkesa nsochi na adreesị a ozugbo iji mepụta/nata ozi mmalite, log metrics, wdg.

isi: Docs// Na-agba ọsọ nkesa nsochi

Iji nye ihe nlereanya na ihe nkesa, anyị chọrọ ihe nkesa na-agba ọsọ (lee mmalite interface) na Run ID nke ihe nlereanya.

Na-agbasa Spark na MLflow
Gbaa ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Iji jee ozi ụdị na iji ọrụ MLflow na-eje ozi, anyị ga-achọ ịnweta UI Tracking iji nweta ozi gbasara ihe nlereanya ahụ naanị site na ịkọwapụta. --run_id.

Ozugbo ihe nlereanya ahụ kpọtụụrụ ihe nkesa nsochi, anyị nwere ike nweta akara njedebe ụdị ọhụrụ.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Ụdị na-agba ọsọ sitere na Spark

N'agbanyeghị na ihe nkesa Tracking dị ike nke ukwuu iji jee ozi ụdị n'otu oge, zụọ ha ma jiri ọrụ nkesa (isi iyi: mlflow // docs // ụdị # local), iji Spark (ogbe ma ọ bụ nkwanye) bụ ihe ngwọta dị ike karịa n'ihi nkesa ya.

Cheedị ma ọ bụrụ na ị mere ọzụzụ ahụ na-anọghị n'ịntanetị wee tinye usoro mmepụta na data gị niile. Nke a bụ ebe Spark na MLflow na-enwu.

Wụnye PySpark + Jupyter + Spark

isi: Bido PySpark - Jupyter

Iji gosi otu anyị si etinye ụdị MLflow na Spark dataframes, anyị kwesịrị ịtọ akwụkwọ ndetu Jupyter ka anyị na PySpark rụkọọ ọrụ.

Malite site na ịwụnye ụdị kwụsiri ike kachasị ọhụrụ Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Wụnye PySpark na Jupyter na mpaghara mebere:

pip install pyspark jupyter

Hazie mgbanwe gburugburu:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

N'ịbụ ndị kpebisiri ike notebook-dir, anyị nwere ike ịchekwa akwụkwọ ndetu anyị na folda achọrọ.

Na-ebupụta Jupyter site na PySpark

Ebe ọ bụ na anyị nwere ike ịhazi Jupiter dị ka onye ọkwọ ụgbọ ala PySpark, anyị nwere ike na-agba ọsọ Jupyter notebook na ọnọdụ PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Na-agbasa Spark na MLflow

Dịka e kwuru n'elu, MLflow na-enye atụmatụ maka ịdebanye ihe ngosi ihe ngosi na S3. Ozugbo anyị nwere ụdị ahọpụtara n'aka anyị, anyị nwere ohere ibubata ya dị ka UDF site na iji modul mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Na-agbasa Spark na MLflow
PySpark - Mpụta amụma ịdịmma mmanya

Ruo ugbu a, anyị ekwuola otu esi eji PySpark na MLflow, na-agba amụma àgwà mmanya na dataset mmanya niile. Mana gịnị ma ọ bụrụ na ịchọrọ iji Python MLflow modul sitere na Scala Spark?

Anyị nwalekwara nke a site na ikewa ọnọdụ Spark n'etiti Scala na Python. Nke ahụ bụ, anyị debanyere aha MLflow UDF na Python, wee jiri ya na Scala (ee, ikekwe ọ bụghị ngwọta kachasị mma, mana ihe anyị nwere).

Scala Spark + MLflow

Maka ihe atụ a, anyị ga-agbakwunye Toree kernel banye Jupita dị adị.

Wụnye Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Dịka ị na-ahụ site na akwụkwọ ndetu agbakwunyere, UDF na-ekekọrịta n'etiti Spark na PySpark. Anyị na-atụ anya na akụkụ a ga-aba uru nye ndị hụrụ Scala n'anya ma chọọ itinye ụdị mmụta igwe na mmepụta.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Nzọụkwụ ndị ọzọ

Ọ bụ ezie na MLflow dị na ụdị Alfa n'oge edere, ọ dị ka ihe na-ekwe nkwa. Naanị ike ịme ọtụtụ usoro mmụta igwe ma na-eri ha site na otu njedebe na-ewe usoro ndị na-akwado ya n'ọkwa ọzọ.

Na mgbakwunye, MLflow na-eweta ndị injinia data na ndị ọkachamara sayensị sayensị nso, na-edobe otu oyi akwa n'etiti ha.

Ka emechara nyocha a nke MLflow, anyị nwere obi ike na anyị ga-aga n'ihu wee jiri ya maka pipeline Spark na sistemu nkwado anyị.

Ọ ga-adị mma ịmekọrịta nchekwa faịlụ na nchekwa data kama iji sistemụ faịlụ. Nke a kwesịrị inye anyị ọtụtụ njedebe nwere ike iji otu nchekwa faịlụ ahụ. Dịka ọmụmaatụ, jiri ọtụtụ ihe atụ Presto и Athena na otu Glue metastore.

Iji chịkọta, ọ ga-amasị m ịsị daalụ ndị obodo MLFlow maka ime ka ọrụ anyị na data na-atọ ụtọ karị.

Ọ bụrụ na ị na-egwu gburugburu na MLflow, egbula oge ịdegara anyị akwụkwọ ma gwa anyị otu esi eji ya, na ọbụna karịa ma ọ bụrụ na ị na-eji ya na mmepụta.

Chọpụta ihe gbasara nkuzi:
Ịmụ igwe. Usoro nkuzi
Ịmụ igwe. Ọzụzụ dị elu

GỤKWUO:

isi: www.habr.com

Tinye a comment