Ƙara Spark tare da MLflow

Sannu, mutanen Khabrovsk. Kamar yadda muka riga muka rubuta, a wannan watan OTUS tana ƙaddamar da darussan koyon injin guda biyu a lokaci ɗaya, wato tushe и ci gaba. Game da wannan, muna ci gaba da raba abubuwa masu amfani.

Manufar wannan labarin shine magana game da kwarewarmu ta farko ta amfani da MLflow.

Za mu fara bita MLflow daga uwar garken sa na bin diddigin sa kuma shigar da duk bayanan binciken. Sannan za mu raba kwarewar mu na haɗa Spark tare da MLflow ta amfani da UDF.

Ka'ida

Muna ciki Alfa Lafiya Muna amfani da koyan na'ura da hankali na wucin gadi don ƙarfafa mutane su kula da lafiyarsu da jin daɗinsu. Shi ya sa ƙirar koyon inji ke a zuciyar samfuran kimiyyar bayanai da muke haɓakawa, kuma shi ya sa aka jawo mu zuwa MLflow, buɗaɗɗen dandali wanda ke tattare da duk wani nau'i na tsarin koyan na'ura.

MLflow

Babban makasudin MLflow shine samar da ƙarin Layer akan koyan injin wanda zai ba da damar masana kimiyyar bayanai suyi aiki tare da kusan kowane ɗakin karatu na koyo na'ura (H2O, keras, m, fitila, sklearn и tensorflow), ɗaukar aikinta zuwa mataki na gaba.

MLflow yana samar da abubuwa uku:

  • Bin-sawu - rikodi da buƙatun gwaji: lamba, bayanai, daidaitawa da sakamako. Kula da tsarin samar da samfurin yana da matukar muhimmanci.
  • Projects - Tsarin marufi don gudana akan kowane dandamali (misali. SageMaker)
  • model – tsarin gama gari don ƙaddamar da samfura zuwa kayan aikin turawa daban-daban.

MLflow (a cikin alpha a lokacin rubuce-rubuce) dandamali ne mai buɗewa wanda ke ba ku damar sarrafa injin koyo rayuwa, gami da gwaji, sake amfani, da turawa.

Saita MLflow

Don amfani da MLflow kuna buƙatar fara saita duk yanayin Python ɗinku, don wannan za mu yi amfani da shi PyEnv (don shigar da Python akan Mac, duba a nan). Ta wannan hanyar za mu iya ƙirƙirar yanayi mai kama-da-wane inda za mu shigar da duk ɗakunan karatu waɗanda suka dace don gudanar da shi.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Mu shigar da dakunan karatu da ake bukata.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Lura: Muna amfani da PyArrow don gudanar da samfura kamar UDF. Ana buƙatar gyara nau'ikan PyArrow da Numpy saboda nau'ikan na ƙarshe sun ci karo da juna.

Kaddamar da Bibiya UI

MLflow Tracking yana ba mu damar shiga da gwajin gwaji ta amfani da Python da sauran API. Bugu da kari, zaku iya tantance inda zaku adana kayan tarihi (localhost, Amazon S3, Azure Blob Storage, Google Cloud Storage ko uwar garken SFTP). Tun da muke amfani da AWS a Lafiyar Alpha, ma'ajin kayan aikin mu zai zama S3.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow yana ba da shawarar amfani da ma'ajin fayil na dindindin. Ma'ajiyar fayil shine inda uwar garken zata adana gudu da gwada metadata. Lokacin fara uwar garken, tabbatar yana nuna ma'ajin fayil ɗin dagewa. Anan don gwajin za mu yi amfani da su kawai /tmp.

Ka tuna cewa idan muna so mu yi amfani da uwar garken mlflow don gudanar da tsoffin gwaje-gwaje, dole ne su kasance a cikin ajiyar fayil. Duk da haka, ko da ba tare da wannan ba za mu iya amfani da su a cikin UDF, tun da kawai muna buƙatar hanyar zuwa samfurin.

Lura: Ka tuna cewa Bibiyar UI da abokin ciniki samfurin dole ne su sami damar zuwa wurin kayan tarihi. Wato, ba tare da la'akari da cewa Bibiyar UI yana zaune a cikin misalin EC2 ba, lokacin da ake gudanar da MLflow a cikin gida, injin dole ne ya sami damar kai tsaye zuwa S3 don rubuta samfuran kayan tarihi.

Ƙara Spark tare da MLflow
Bibiyar UI tana adana kayan tarihi a cikin guga S3

Samfura masu gudana

Da zaran sabar Bibiya tana gudana, zaku iya fara horar da samfuran.

A matsayin misali, za mu yi amfani da gyaran giya daga misalin MLflow a Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Kamar yadda muka riga muka tattauna, MLflow yana ba ku damar shiga sigogin ƙirar ƙira, awo, da kayan tarihi don ku iya bin yadda suke canzawa akan abubuwan da suka faru. Wannan fasalin yana da matukar amfani saboda ta wannan hanyar za mu iya sake haifar da mafi kyawun ƙira ta hanyar tuntuɓar uwar garken Bibiya ko fahimtar wace lambar ta yi abin da ake buƙata ta amfani da git hash logs na aikata.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

Ƙara Spark tare da MLflow
Yawan ruwan inabi

Sashin uwar garke don samfurin

Sabar bin diddigin MLflow, wanda aka ƙaddamar ta amfani da umarnin "mlflow uwar garken", yana da REST API don gudanar da bin diddigin da rubuta bayanai zuwa tsarin fayil na gida. Kuna iya ƙayyade adireshin sabar saƙo ta amfani da madaidaicin yanayi "MLFLOW_TRACKING_URI" kuma MLflow tracking API za ta tuntuɓi uwar garken ta atomatik a wannan adireshin don ƙirƙira/ karɓar bayanin ƙaddamarwa, ma'aunin log, da sauransu.

source: Docs// Gudun sabar sa ido

Don samar da samfurin tare da uwar garken, muna buƙatar uwar garken sa ido mai gudana (duba ƙaddamar da ƙaddamarwa) da Run ID na samfurin.

Ƙara Spark tare da MLflow
Gudun ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

Don ba da samfura ta amfani da ayyukan hidimar MLflow, za mu buƙaci samun dama ga UI Bibiya don karɓar bayani game da ƙirar ta hanyar ƙididdigewa. --run_id.

Da zarar samfurin ya tuntuɓi uwar garken Bibiya, za mu iya samun sabon wurin ƙarshen ƙirar.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Samfura masu gudana daga Spark

Duk da cewa uwar garken Bibiya yana da ƙarfi isa don ba da samfura a cikin ainihin lokaci, horar da su kuma amfani da aikin uwar garken (tushen: mlflow // docs // samfura # na gida), Yin amfani da Spark (batch ko streaming) shine mafita mafi ƙarfi saboda rarraba ta.

Ka yi tunanin cewa kawai ka yi horon ba tare da layi ba sannan ka yi amfani da samfurin fitarwa zuwa duk bayananka. Wannan shine inda Spark da MLflow ke haskakawa.

Sanya PySpark + Jupyter + Spark

source: Fara PySpark - Jupyter

Don nuna yadda muke amfani da samfuran MLflow zuwa bayanan bayanan Spark, muna buƙatar saita littattafan rubutu na Jupyter don yin aiki tare da PySpark.

Fara da shigar da sabuwar barga version Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Sanya PySpark da Jupyter a cikin mahallin kama-da-wane:

pip install pyspark jupyter

Saita masu canjin yanayi:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Bayan ƙaddara notebook-dir, za mu iya adana littattafanmu a cikin babban fayil ɗin da ake so.

Ana ƙaddamar da Jupyter daga PySpark

Tunda mun sami damar saita Jupiter azaman direban PySpark, yanzu zamu iya gudanar da littafin rubutu na Jupyter a cikin mahallin PySpark.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

Ƙara Spark tare da MLflow

Kamar yadda aka ambata a sama, MLflow yana ba da fasali don shigar da kayan tarihi a cikin S3. Da zaran muna da samfurin da aka zaɓa a hannunmu, muna da damar shigo da shi azaman UDF ta amfani da tsarin mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

Ƙara Spark tare da MLflow
PySpark - Fitar da hasashen ingancin ruwan inabi

Har zuwa wannan batu, mun yi magana game da yadda ake amfani da PySpark tare da MLflow, yana gudanar da tsinkayar ingancin ruwan inabi akan duk bayanan ruwan inabi. Amma menene idan kuna buƙatar amfani da kayan aikin Python MLflow daga Scala Spark?

Mun gwada wannan kuma ta hanyar raba mahallin Spark tsakanin Scala da Python. Wato, mun yi rajistar MLflow UDF a Python, kuma mun yi amfani da shi daga Scala (e, watakila ba mafi kyawun bayani ba, amma abin da muke da shi).

Scala Spark + MLflow

Don wannan misali za mu ƙara Toree Kernel cikin Jupiter data kasance.

Sanya Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Kamar yadda kuke gani daga littafin rubutu da aka haɗe, ana raba UDF tsakanin Spark da PySpark. Muna fatan wannan ɓangaren zai zama da amfani ga waɗanda suke son Scala kuma suna son tura samfuran koyo na inji a samarwa.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Mataki na gaba

Kodayake MLflow yana cikin sigar Alpha a lokacin rubuce-rubuce, yana da kyau sosai. Kawai ikon gudanar da tsarin ilmantarwa na inji da yawa da cinye su daga wuri guda ɗaya yana ɗaukar tsarin masu ba da shawara zuwa mataki na gaba.

Bugu da kari, MLflow yana kawo Injiniyoyi na Bayanai da ƙwararrun Kimiyyar Bayanai kusa da juna, yana shimfida layi ɗaya a tsakanin su.

Bayan wannan binciken na MLflow, muna da tabbacin za mu ci gaba da amfani da shi don bututun Spark da tsarin masu ba da shawara.

Zai yi kyau a daidaita ma'ajin fayil ɗin tare da bayanan bayanai maimakon tsarin fayil ɗin. Wannan ya kamata ya ba mu maƙallan ƙarewa da yawa waɗanda za su iya amfani da ajiyar fayil iri ɗaya. Misali, yi amfani da misalai da yawa Presto и Athena tare da manne metastore iri ɗaya.

Don taƙaitawa, Ina so in ce na gode wa al'ummar MLFlow don yin aikinmu tare da bayanai mafi ban sha'awa.

Idan kuna wasa tare da MLflow, kada ku yi jinkirin rubuta mana kuma ku gaya mana yadda kuke amfani da shi, har ma fiye da haka idan kuna amfani da shi wajen samarwa.

Nemo ƙarin game da kwasa-kwasan:
Koyon inji. Babban darasi
Koyon inji. Babban kwas

Kara karantawa:

source: www.habr.com

Add a comment