Kordhinta Spark leh MLflow

Hello, dadka deggan Khabrovsk. Sidaan horeyba u qornay, bishaan OTUS waxay bilaabaysaa laba koorso oo barashada mashiinka hal mar, kuwaas oo kala ah saldhig ΠΈ horumarsan. Marka tan la eego, waxaan sii wadeynaa inaan wadaagno waxyaabo waxtar leh.

Ujeedada maqaalkani waa inaan ka hadalno khibradeena ugu horeysay ee isticmaalka MLflow.

Waxaan bilaabi doonaa dib u eegista MLflow Laga soo bilaabo server-keeda raadraaca oo gal dhammaan mareegta daraasadda. Markaa waxaanu wadaagi doonaa khibradeena ku xidhidhiyaha Spark iyo MLflow anagoo adeegsanayna UDF.

Dulucda

Waan ku jirnaa Caafimaadka Alpha Waxaan isticmaalnaa barashada mashiinka iyo sirdoonka macmalka ah si aan awood ugu siinno dadka inay masuul ka noqdaan caafimaadkooda iyo fayoobidooda. Taasi waa sababta moodooyinka barashada mashiinka ay xudunta u yihiin alaabada sayniska xogta ee aan horumarinay, waana sababta naloo soo jiitay MLflow, oo ah goob il furan oo daboolaysa dhammaan dhinacyada barashada mashiinka nolosha meertada.

MLflow

Hadafka ugu weyn ee MLflow waa in la bixiyo lakab dheeri ah oo ku saabsan barashada mashiinka taas oo u oggolaanaysa saynisyahannada xogta inay la shaqeeyaan ku dhawaad ​​maktabad kasta oo barashada mashiinka (h2o, keras, xumeyn, tooshka, sklearsan ΠΈ tensorflow), iyada oo shaqadeeda gaadhsiinaysa heer kale.

MLflow waxay bixisaa saddex qaybood:

  • Socodka - duubista iyo codsiyada tijaabooyinka: koodka, xogta, qaabaynta iyo natiijooyinka. La socodka habka abuurista nooc waa mid aad muhiim u ah.
  • Mashaariicda - Qaabka baakaynta si loogu shaqeeyo goob kasta (tusaale. SageMaker)
  • daydo - qaab caadi ah oo loogu talagalay soo gudbinta moodooyinka qalabka kala duwan ee geynta.

MLflow ( alfa wakhtiga qorista) waa goob furan oo kuu ogolaanaysa inaad maarayso mashiinka barashada meertada nolosha, oo ay ku jirto tijaabinta, dib u isticmaalida, iyo geynta

Dejinta MLflow

Si aad u isticmaasho MLflow waxaad u baahan tahay inaad marka hore dejiso deegaankaaga Python oo dhan, tan ayaan u isticmaali doonaa PyEnv (si aad Python ugu rakibto Mac, eeg halkan). Sidan ayaan ku abuuri karnaa jawi macmal ah halkaas oo aan ku rakibi doono dhammaan maktabadaha lagama maarmaanka u ah in lagu socodsiiyo.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Aynu rakibno maktabadaha loo baahan yahay.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Xusuusin: Waxaan isticmaalnaa PyArrow si aan u socodsiino moodooyinka sida UDF. Noocyada PyArrow iyo Numpy waxay u baahdeen in la hagaajiyo sababtoo ah noocyada dambe way isku dhaceen midba midka kale.

Bilaw Dabagalka UI

MLflow Tracking waxay noo ogolaataa inaan galno oo waydiino tijaabooyinka anagoo adeegsanayna Python iyo REST API. Intaa waxaa dheer, waxaad go'aamin kartaa meesha aad ku kaydin karto moodooyinka artifacts (localhost, Amazon S3, Kaydinta Blob Azure, Keydka Google Cloud ama Adeegga SFTP). Maadaama aan u isticmaalno AWS ee Alpha Health, kaydinta farshaxankeena waxay noqon doontaa S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow waxay ku talinaysaa isticmaalka kaydinta faylka joogtada ah. Kaydinta feylku waa halka uu seerfarku ku kaydin doono socodsiinta oo tijaabin doono xogta badan. Markaad bilaabayso server-ka, hubi inay tilmaamayso kaydka faylka joogtada ah. Halkan tijaabada waxaan si fudud u isticmaali doonaa /tmp.

Xusuusnow haddii aan rabno inaan isticmaalno server-ka mlflow si aan u wadno tijaabooyin hore, waa inay ku jiraan kaydinta faylka. Si kastaba ha ahaatee, xitaa tan la'aanteed waxaan u isticmaali karnaa iyaga gudaha UDF, maadaama aan u baahanahay kaliya wadada loo maro qaabka.

Fiiro gaar ah: Maskaxda ku hay in Dabagalka UI iyo macmiilka moodelku ay tahay inay galaangal u yeeshaan goobta farshaxanimada. Taasi waa, iyada oo aan loo eegin xaqiiqda ah in Tracking UI uu ku nool yahay tusaale ahaan EC2, marka uu ku shaqeynayo MLflow gudaha, mashiinku waa inuu si toos ah u galo S3 si uu u qoro moodooyinka artifact.

Kordhinta Spark leh MLflow
Dabagalka UI waxa ay ku kaydisaa agabka baaldiga S3

Moodooyinka socda

Sida ugu dhakhsaha badan ee server-ka raadraaca uu shaqeeyo, waxaad bilaabi kartaa tababarka moodooyinka.

Tusaale ahaan, waxaan u isticmaali doonaa beddelka khamriga ee tusaalaha MLflow gudaha Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Sidaan horeyba uga hadalnay, MLflow wuxuu kuu oggolaanayaa inaad gasho cabbirada moodeelka, cabbirka, iyo farshaxannada si aad ula socotid sida ay u kobcayaan soo noqnoqoshada. Habkani aad buu faa'iido u leeyahay sababtoo ah sidan ayaan u soo saari karnaa qaabka ugu fiican anagoo la xiriirnayna server-ka raadraaca ama fahamka koodka soo dhameystiray soo celinta loo baahan yahay iyadoo la isticmaalayo git hash logs of commitments.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Kordhinta Spark leh MLflow
Ku celcelinta khamriga

Qaybta Serverka ee qaabka

Adeegga raadraaca MLflow, oo la bilaabay iyadoo la adeegsanayo amarka "server mlflow", wuxuu leeyahay REST API si uu ula socdo socodka iyo qorista xogta nidaamka faylalka maxalliga ah. Waxaad qeexi kartaa ciwaanka raadraaca ee adeegsada doorsoomiyaha deegaanka "MLFLOW_TRACKING_URI" iyo MLflow tracking API waxay si toos ah ula xiriiri doontaa server-ka raadraaca ee ciwaankan si loo abuuro/helo macluumaadka bilaabista, metrics log, iwm.

Source: Docs// Ku socodsiinta server-ka raadraaca

Si aan u bixinno moodalka server-ka, waxaan u baahanahay server raadraaca socda (eeg interface interface) iyo Run ID ee moodeelka.

Kordhinta Spark leh MLflow
Orod aqoonsiga

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Si aan ugu adeegno moodooyinka isticmaalaya MLflow waxay u adeegaan shaqeynta, waxaan u baahan doonaa marin u hel Tracking UI si aan u helno macluumaadka ku saabsan moodeelka si fudud anagoo cadeynayna --run_id.

Marka moodeelku uu la xiriiro server-ka Dabagalka, waxaan heli karnaa nambar cusub oo dhamaadka ah.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Moodooyinka socda ee Spark

In kasta oo xaqiiqda ah in server-ku uu yahay mid awood leh oo ku filan inuu u adeego moodooyinka waqtiga dhabta ah, tababaro oo isticmaal shaqeynta serverka ( isha: mlflow // docs // moodooyinka # maxaliga ah), Isticmaalka Spark ( Dufcaddii ama streaming) waa xal ka sii xoog badan sababtoo ah qaybinta.

Bal qiyaas haddii aad kaliya tababarka ku samaysay khadka tooska ah ka dibna aad isticmaashay qaabka wax soo saarka dhammaan xogtaada. Tani waa meesha Spark iyo MLflow ay ka iftiimaan.

Ku rakib PySpark + Jupyter + Spark

Source: Bilow PySpark - Jupyter

Si aan u tuso sida aan ugu dabaqno moodooyinka MLflow ee Spark dataframes, waxaan u baahanahay inaan dejino buugaagta xusuus-qorka ee Jupyter si aan ula shaqeyno PySpark.

Ku bilow inaad ku rakibto nooca ugu dambeeyay ee xasilloon Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/sparkΜ€

Ku rakib PySpark iyo Jupyter jawiga farsamada:

pip install pyspark jupyter

Deji doorsoomayaasha deegaanka:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Isagoo go'aansaday notebook-dir, waxaan ku kaydin karnaa buugaagteena xusuus qorka galka la rabo.

Ka bilaabaya Jupyter PySpark

Maadaama aan awoodnay inaan Jupiter u habeyno darawal PySpark ah, waxaan hadda ku socodsiin karnaa Jupyter notebook macnaha PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Kordhinta Spark leh MLflow

Sida kor ku xusan, MLflow waxay bixisaa sifo loogu talagalay qorista naqshadaha moodada ee S3. Isla marka aan gacanta ku hayno qaabka la doortay, waxaan fursad u haysanaa inaan u soo dejino UDF ahaan anagoo adeegsanayna moduleka mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Kordhinta Spark leh MLflow
PySpark - Soo saarista saadaasha tayada khamriga

Ilaa hadda, waxaan ka hadalnay sida loo isticmaalo PySpark leh MLflow, oo ku shaqeynaya saadaasha tayada khamriga ee dhammaan xogta khamriga. Laakiin maxaa dhacaya haddii aad u baahan tahay inaad isticmaasho modules Python MLflow ka Scala Spark?

Waxaan tijaabinay tan anagoo kala qaybinayna macnaha Spark inta u dhaxaysa Scala iyo Python. Taasi waa, waxaan ka diiwaan gashanay MLflow UDF ee Python, waxaanan ka isticmaalnay Scala (haa, malaha maaha xalka ugu fiican, laakiin waxa aan haysano).

Scala Spark + MLflow

Tusaalahan waxaan ku dari doonaa Toree Kernel galay Jupiter-ka jira.

Ku rakib Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Sida aad ka arki karto buug-yaraha ku lifaaqan, UDF waxa ay wadaagaan Spark iyo PySpark. Waxaan rajeyneynaa in qaybtan ay faa'iido u yeelan doonto kuwa jecel Scala oo raba inay geeyaan moodooyinka barashada mashiinka wax soo saarka.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Tallaabooyinka xiga

Inkasta oo MLflow uu ku jiro nooca Alpha wakhtiga qorista, waxay u egtahay mid rajo leh. Kaliya awooda lagu socodsiiyo qaabab barasho mashiino badan oo laga isticmaalo hal meel oo dhamaadka ah waxay qaadataa nidaamyada la taliyayaasha heerka xiga.

Intaa waxaa dheer, MLflow waxay isu keentaa Injineerada Xogta iyo khabiirada Sayniska Xogta, iyaga oo dhigaya lakab caadi ah dhexdooda.

Sahankan MLflow ka dib, waxaan ku kalsoonahay inaan horay u socono oo aan u isticmaali doono dhuumahayada Spark iyo nidaamyada talo bixinta.

Way fiicnaan lahayd in la meel dhigo kaydinta faylka iyo kaydka kaydka halkii nidaamka faylka. Tani waa inay ina siinaysaa meelo badan oo dhamaadka ah oo isticmaali kara kaydinta fayl isku mid ah. Tusaale ahaan, isticmaal xaalado badan Horayba ΠΈ Athena oo leh metastore xabag la mid ah.

Isku soo wada duuboo, waxaan jeclaan lahaa inaan dhaho waad ku mahadsan tahay bulshada MLFlow sida aad shaqadayada xogta uga dhigtay mid xiiso badan.

Haddii aad ku ciyaareyso agagaarka MLflow, ha ka labalabeyn inaad noo soo qorto oo noo sheegto sida aad u isticmaasho, iyo xitaa si ka sii badan haddii aad u isticmaasho wax soo saarka.

Wax badan ka ogow koorsooyinka:
Barashada mashiinka. Koorsada aasaasiga ah
Barashada mashiinka. Koorso heer sare ah

Akhri wax dheeraad ah:

Source: www.habr.com

Add a comment