Ho Atolosa Spark ka MLflow

Lumela, baahi ba Khabrovsk. Joalokaha re se re ngotse, khoeling ena OTUS e qala lithuto tse peli tsa ho ithuta ka mochini hang, e leng motheo и tsoetse pele. Tabeng ena, re tsoela pele ho arolelana boitsebiso bo molemo.

Morero oa sengoloa sena ke ho bua ka boiphihlelo ba rona ba pele ba ho sebelisa MLflow.

Re tla qala tlhahlobo MLflow ho tsoa ho seva sa eona sa ho lata 'me u ngole linako tsohle tsa thuto. Ebe re arolelana boiphihlelo ba rona ba ho hokahanya Spark le MLflow re sebelisa UDF.

Mohopolo

Re kene Alpha Health Re sebelisa ho ithuta ka mochini le bohlale ba maiketsetso ho matlafatsa batho hore ba hlokomele bophelo ba bona le boiketlo ba bona. Ke ka lebaka leo mefuta ea ho ithuta ka mochini e leng khubu ea lihlahisoa tsa mahlale a data ao re a hlahisang, ke ka lebaka leo re ileng ra khahloa ke MLflow, sethala sa mohloli o bulehileng o akaretsang likarolo tsohle tsa mokhoa oa bophelo oa ho ithuta mochini.

MLflow

Sepheo se seholo sa MLflow ke ho fana ka karolo e eketsehileng holim'a thuto ea mochine e tla lumella bo-rasaense ba data ho sebetsa le hoo e batlang e le laebrari leha e le efe ea ho ithuta mochine (h2o, kerata, mleap, pytorch, sklearn и tensorflow), ho isa mosebetsi oa hae boemong bo bong.

MLflow e fana ka likarolo tse tharo:

  • Tracking - ho rekota le ho kopa liteko: khoutu, data, tlhophiso le liphetho. Ho beha leihlo mokhoa oa ho etsa mohlala ho bohlokoa haholo.
  • Projects - Sebopeho sa ho paka ho sebetsa sethaleng sefe kapa sefe (mohlala SageMaker)
  • dikai - sebopeho se tloaelehileng sa ho fana ka mehlala ho lisebelisoa tse fapaneng tsa ho tsamaisa.

MLflow (ho alpha nakong ea ho ngola) ke sethala sa mohloli o bulehileng o o lumellang ho laola mokhoa oa bophelo oa ho ithuta mochini, ho kenyeletsoa liteko, ho sebelisa hape, le ho tsamaisa.

Ho theha MLflow

Ho sebelisa MLflow o hloka ho qala ho theha tikoloho ea hau eohle ea Python, bakeng sa sena re tla se sebelisa PyEnv (ho kenya Python ho Mac, hlahloba mona). Ka tsela ena re ka theha tikoloho ea sebele moo re tla kenya lilaebrari tsohle tse hlokahalang ho e tsamaisa.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Ha re kenye lilaebrari tse hlokahalang.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Tlhokomeliso: Re sebelisa PyArrow ho tsamaisa mefuta e kang UDF. Liphetolelo tsa PyArrow le Numpy li ne li hloka ho lokisoa hobane liphetolelo tsa morao-rao li ne li loantšana.

Qala Tracking UI

MLflow Tracking e re lumella ho kena le ho botsa liteko ka Python le LULA API. Ntle le moo, o ka tseba hore na o ka boloka li-artifacts tsa mohlala hokae (localhost, Amazon S3, Azure Blob Storage, Google Cloud Storage kapa Seva ea SFTP). Kaha re sebelisa AWS ho Alpha Health, polokelo ea rona ea maiketsetso e tla ba S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow e khothaletsa ho sebelisa polokelo ea faele e tsitsitseng. Bobolokelo ba faele ke moo seva e tla boloka metadata ea ho sebetsa le liteko. Ha u qala seva, etsa bonnete ba hore e supa lebenkeleng le tsitsitseng la lifaele. Mona bakeng sa teko re tla sebelisa feela /tmp.

Hopola hore haeba re batla ho sebelisa seva sa mlflow ho etsa liteko tsa khale, li tlameha ho ba teng polokelong ea lifaele. Leha ho le joalo, ntle le sena re ne re ka li sebelisa UDF, kaha re hloka feela tsela ea mohlala.

Tlhokomeliso: Hopola hore Tracking UI le moreki oa mohlala ba tlameha ho fihlella sebaka sa maiketsetso. Ke hore, ho sa tsotelehe taba ea hore Tracking UI e lula sebakeng sa EC2, ha o tsamaisa MLflow sebakeng sa heno, mochini o tlameha ho ba le phihlello e tobileng ho S3 ho ngola mefuta ea maiketsetso.

Ho Atolosa Spark ka MLflow
Ho latela UI ho boloka lintho tsa khale ka baketeng ea S3

Mehlala e mathang

Hang ha seva sa Tracking se ntse se sebetsa, o ka qala ho koetlisa mehlala.

Mohlala, re tla sebelisa phetoho ea veine ho tsoa ho mohlala oa MLflow ho Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Joalo ka ha re se re buisane, MLflow e u lumella ho kenya liparamente tsa mohlala, metrics, le li-artifacts e le hore u tsebe ho tseba hore na li fetoha joang ha li pheta-pheta. Karolo ena e bohlokoa haholo hobane ka tsela ena re ka hlahisa mohlala o motle ka ho ikopanya le seva sa Tracking kapa ho utloisisa hore na ke khoutu efe e entseng phetisetso e hlokahalang re sebelisa git hash logs of commits.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Ho Atolosa Spark ka MLflow
Liphetoho tsa veine

Karolo ea seva bakeng sa mohlala

Seva ea ho latela MLflow, e qalileng ka ho sebelisa taelo ea "mlflow server", e na le REST API bakeng sa ho latela le ho ngola lintlha tsamaisong ea faele ea lehae. O ka hlakisa aterese ea seva ea ho latella o sebelisa mofuta o fapaneng oa tikoloho "MLFLOW_TRACKING_URI" 'me MLflow tracking API e tla iteanya le setsi sa ho latella atereseng ena ho theha/ho amohela lintlha tsa ho qala, lintlha tsa log, joalo-joalo.

Source: Docs// Ho tsamaisa seva ya ho latedisa

Ho fana ka mohlala ka seva, re hloka sebatli se tsamaeang (sheba sebopeho sa ho qala) le Run ID ea mohlala.

Ho Atolosa Spark ka MLflow
Kenya ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Ho fana ka mefuta e sebelisang ts'ebetso ea MLflow, re tla hloka phihlello ho Tracking UI ho fumana leseli mabapi le mohlala ka ho hlakisa feela. --run_id.

Hang ha mohlala o ikopanya le seva sa Tracking, re ka fumana ntlha e ncha ea ho qetela.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Mehlala e mathang ho tloha Spark

Leha taba ea hore seva sa Tracking se na le matla a lekaneng ho sebeletsa mehlala ka nako ea nnete, e koetlise le ho sebelisa ts'ebetso ea seva (mohloli: mlflow // docs // mehlala # ea lehae), ho sebelisa Spark (batch kapa phallela) ke tharollo e matla le ho feta ka lebaka la kabo ea eona.

Ak'u nahane hore u entse koetliso ntle le marang-rang ebe u sebelisa mohlala oa tlhahiso ho data eohle ea hau. Mona ke moo Spark le MLflow li phatsimang.

Kenya PySpark + Jupyter + Spark

Source: Qala PySpark - Jupyter

Ho bontša kamoo re sebelisang mefuta ea MLflow ho Spark dataframes, re hloka ho theha libuka tsa Jupyter ho sebetsa 'moho le PySpark.

Qala ka ho kenya mofuta oa morao-rao o tsitsitseng Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Kenya PySpark le Jupyter tikolohong e fumanehang:

pip install pyspark jupyter

Beha maemo a fapaneng a tikoloho:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Hoba le boikemisetso notebook-dir, re ka boloka libuka tsa rona foldareng eo re e batlang.

Ho qala Jupyter ho tloha PySpark

Kaha re khonne ho hlophisa Jupiter joalo ka mokhanni oa PySpark, joale re ka tsamaisa bukana ea Jupyter maemong a PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Ho Atolosa Spark ka MLflow

Joalokaha ho boletsoe ka holimo, MLflow e fana ka tšobotsi bakeng sa lisebelisoa tsa mohlala tsa ho rema lifate ho S3. Hang ha re se re e-na le mohlala o khethiloeng matsohong a rona, re na le monyetla oa ho o kenya ka ntle re le UDF re sebelisa mojule mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Ho Atolosa Spark ka MLflow
PySpark - E fana ka likhakanyo tsa boleng ba veine

Ho fihlela mona, re buile ka mokhoa oa ho sebelisa PySpark ka MLflow, ho tsamaisa likhakanyo tsa boleng ba veine ho dataset eohle ea veine. Empa ho thoe'ng haeba u hloka ho sebelisa li-module tsa Python MLflow ho tloha Scala Spark?

Le rona re ile ra leka sena ka ho arola maemo a Spark pakeng tsa Scala le Python. Ke hore, re ngolisitse MLflow UDF ho Python, 'me ra e sebelisa ho tloha Scala (e, mohlomong ha se tharollo e molemohali, empa seo re nang le sona).

Scala Spark + MLflow

Bakeng sa mohlala ona re tla eketsa Toree Kernel ho kena ho Jupiter e teng.

Kenya Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Joalo ka ha u bona ho tsoa bukeng e kentsoeng, UDF e arolelanoa pakeng tsa Spark le PySpark. Re tšepa hore karolo ena e tla ba molemo ho ba ratang Scala 'me ba batla ho sebelisa mekhoa ea ho ithuta ka mochine tlhahiso.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Mehato e latelang

Leha MLflow e le mofuta oa Alpha ka nako ea ho ngola, e shebahala e ts'episa haholo. Bokhoni feela ba ho tsamaisa meralo e mengata ea ho ithuta mochini le ho e sebelisa ho tloha pheletsong e le 'ngoe e nka litsamaiso tsa khothaletso ho ea boemong bo latelang.

Ho feta moo, MLflow e tlisa Baenjiniere ba Data le litsebi tsa Saense ea data haufi, e beha lesela le tloaelehileng pakeng tsa bona.

Kamora tlhahlobo ena ea MLflow, re na le ts'epo ea hore re tla hatela pele le ho e sebelisa bakeng sa liphaephe tsa rona tsa Spark le litsamaiso tsa likhothaletso.

Ho ka ba monate ho hokahanya polokelo ea faele le database ho fapana le sistimi ea faele. Sena se lokela ho re fa lintlha tse ngata tse ka sebelisang polokelo e tšoanang ea faele. Ka mohlala, sebelisa mekhoa e mengata Presto и Athena ka metastore e tšoanang ea Glue.

Ho akaretsa, ke rata ho leboha sechaba sa MLFlow ka ho etsa hore mosebetsi oa rona o be monate haholoanyane.

Haeba u ntse u bapala ka MLflow, u se ke ua tsilatsila ho re ngolla le ho re bolella hore na u e sebelisa joang, le ho feta haeba u e sebelisa tlhahiso.

Fumana ho eketsehileng ka lithuto tsena:
Ho ithuta ka mochini. Thuto ea motheo
Ho ithuta ka mochini. Thuto e tsoetseng pele

Bala haholoanyane:

Source: www.habr.com

Eketsa ka tlhaloso