MLflow bilan Spark kengaytirilmoqda

Salom, Xabrovsk aholisi. Biz allaqachon yozganimizdek, bu oyda OTUS bir vaqtning o'zida ikkita mashinani o'rganish kurslarini ishga tushiradi, xususan tayanch и rivojlangan. Shu munosabat bilan biz foydali materiallarni baham ko'rishda davom etamiz.

Ushbu maqolaning maqsadi bizning birinchi foydalanish tajribamiz haqida gapirishdir MLflow.

Ko'rib chiqishni boshlaymiz MLflow uning kuzatuv serveridan va tadqiqotning barcha iteratsiyalarini qayd qiling. Keyin biz UDF yordamida Spark-ni MLflow bilan ulash tajribamiz bilan o'rtoqlashamiz.

Kontekst

Biz shu yerdamiz Alfa salomatligi Biz mashinani o'rganish va sun'iy intellektdan odamlarga o'z salomatligi va farovonligi uchun javobgarlikni kuchaytirish uchun foydalanamiz. Shuning uchun mashinani o'rganish modellari biz ishlab chiqadigan ma'lumotlar fanlari mahsulotlarining markazida joylashgan va shuning uchun biz mashinani o'rganishning hayot aylanishining barcha jihatlarini qamrab oluvchi ochiq kodli platforma bo'lgan MLflowga qiziqdik.

MLflow

MLflow ning asosiy maqsadi ma'lumotlar olimlariga mashinani o'rganishning deyarli har qanday kutubxonasi bilan ishlashga imkon beradigan mashinani o'rganish ustiga qo'shimcha qatlamni taqdim etishdir (h2o, keralar, yugurish, pitorch, sklearn и tensor oqimi), uning ishini keyingi bosqichga olib chiqadi.

MLflow uchta komponentni taqdim etadi:

  • Tracking – eksperimentlarni yozib olish va so‘rovlar: kod, ma’lumotlar, konfiguratsiya va natijalar. Modelni yaratish jarayonini kuzatish juda muhimdir.
  • loyihalar – Har qanday platformada ishlash uchun qadoqlash formati (masalan, SageMaker)
  • modellar - turli xil joylashtirish vositalariga modellarni yuborish uchun umumiy format.

MLflow (yozish vaqtida alfada) - bu tajriba, qayta foydalanish va joylashtirishni o'z ichiga olgan mashinani o'rganishning hayot aylanishini boshqarish imkonini beruvchi ochiq manba platformasi.

MLflow sozlanmoqda

MLflow-dan foydalanish uchun avvalo butun Python muhitingizni sozlashingiz kerak, buning uchun biz foydalanamiz PyEnv (Mac-ga Python-ni o'rnatish uchun tekshiring shu yerda). Shunday qilib, biz virtual muhitni yaratishimiz mumkin, unda biz uni ishga tushirish uchun zarur bo'lgan barcha kutubxonalarni o'rnatamiz.

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

Kerakli kutubxonalarni o'rnatamiz.

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

Eslatma: Biz PyArrow-dan UDF kabi modellarni ishlatish uchun foydalanamiz. PyArrow va Numpy versiyalari tuzatilishi kerak edi, chunki oxirgi versiyalar bir-biriga zid edi.

Kuzatuv foydalanuvchi interfeysini ishga tushiring

MLflow Tracking bizga Python va yordamida tajribalarni jurnalga kiritish va so'rov qilish imkonini beradi REST API. Bundan tashqari, siz namunaviy artefaktlarni qayerda saqlashni aniqlashingiz mumkin (localhost, Amazon S3, Azure Blob xotirasi, Google bulutli saqlash yoki SFTP serveri). Biz Alpha Health-da AWS-dan foydalanganimiz sababli, artefakt saqlashimiz S3 bo'ladi.

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow doimiy fayl xotirasidan foydalanishni tavsiya qiladi. Fayl xotirasi server ishga tushirilgan va tajriba metama'lumotlarini saqlaydigan joy. Serverni ishga tushirayotganda, u doimiy fayllar do'koniga ishora qilganiga ishonch hosil qiling. Bu erda tajriba uchun biz shunchaki foydalanamiz /tmp.

Esda tutingki, agar biz eski tajribalarni bajarish uchun mlflow serveridan foydalanmoqchi bo'lsak, ular fayl xotirasida bo'lishi kerak. Biroq, busiz ham biz ularni UDFda ​​ishlatishimiz mumkin, chunki bizga faqat modelga yo'l kerak.

Eslatma: Kuzatuv foydalanuvchi interfeysi va model mijozi artefakt joylashuviga kirish huquqiga ega bo‘lishi kerakligini yodda tuting. Ya'ni, Kuzatuv interfeysi EC2 misolida joylashganidan qat'i nazar, MLflowni mahalliy sifatida ishga tushirganda, artefakt modellarini yozish uchun mashina S3 ga to'g'ridan-to'g'ri kirish huquqiga ega bo'lishi kerak.

MLflow bilan Spark kengaytirilmoqda
Tracking UI artefaktlarni S3 paqirida saqlaydi

Ishlaydigan modellar

Kuzatuv serveri ishga tushishi bilan siz modellarni o'qitishni boshlashingiz mumkin.

Misol sifatida, biz MLflow misolidagi vino modifikatsiyasidan foydalanamiz Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

Yuqorida aytib o'tganimizdek, MLflow sizga model parametrlari, ko'rsatkichlari va artefaktlarini jurnalga kiritish imkonini beradi, shunda siz ularning iteratsiyalar davomida qanday rivojlanishini kuzatishingiz mumkin. Bu xususiyat juda foydali, chunki biz Kuzatuv serveriga murojaat qilish yoki majburiyatlarning git xesh jurnallaridan foydalangan holda qaysi kod kerakli iteratsiyani amalga oshirganligini tushunish orqali eng yaxshi modelni takrorlashimiz mumkin.

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

MLflow bilan Spark kengaytirilmoqda
Sharobni takrorlash

Model uchun server qismi

"mlflow server" buyrug'i yordamida ishga tushirilgan MLflow kuzatuv serveri yugurishlarni kuzatish va mahalliy fayl tizimiga ma'lumotlarni yozish uchun REST API-ga ega. “MLFLOW_TRACKING_URI” muhit o‘zgaruvchisi yordamida kuzatuv serveri manzilini belgilashingiz mumkin va MLflow kuzatuv API’si ishga tushirish ma’lumotlarini, jurnal ko‘rsatkichlarini va hokazolarni yaratish/qabul qilish uchun avtomatik ravishda ushbu manzildagi kuzatuv serveri bilan bog‘lanadi.

Manba: Docs// Kuzatuv serverini ishga tushirish

Modelni server bilan ta'minlash uchun bizga ishlaydigan kuzatuv serveri (boshlash interfeysiga qarang) va modelning Run ID raqami kerak.

MLflow bilan Spark kengaytirilmoqda
Ishga tushirish ID

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

MLflow xizmat koʻrsatish funksiyasidan foydalangan holda modellarga xizmat koʻrsatish uchun bizga model haqida maʼlumot olish uchun Kuzatuv interfeysiga kirishimiz kerak boʻladi. --run_id.

Model Kuzatuv serveriga murojaat qilgandan so'ng, biz yangi model so'nggi nuqtasini olishimiz mumkin.

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

Spark-dan ishlaydigan modellar

Kuzatuv serveri real vaqt rejimida modellarga xizmat ko'rsatish, ularni o'rgatish va server funksiyalaridan foydalanish uchun etarlicha kuchli bo'lishiga qaramay (manba: mlflow // hujjatlar // modellar # mahalliy), Spark (to'plam yoki oqim) dan foydalanish uning tarqalishi tufayli yanada kuchliroq echimdir.

Tasavvur qiling-a, siz shunchaki oflayn rejimda mashq qildingiz va keyin barcha ma'lumotlaringizga chiqish modelini qo'lladingiz. Bu erda Spark va MLflow porlaydi.

PySpark + Jupyter + Spark-ni o'rnating

Manba: PySpark - Jupyter-ni ishga tushiring

MLflow modellarini Spark ma'lumotlar ramkalariga qanday qo'llashimizni ko'rsatish uchun PySpark bilan birgalikda ishlash uchun Jupyter noutbuklarini sozlashimiz kerak.

Eng so'nggi barqaror versiyani o'rnatish bilan boshlang Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

PySpark va Jupyter-ni virtual muhitda o'rnating:

pip install pyspark jupyter

Atrof-muhit o'zgaruvchilarini sozlang:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

Belgilangan holda notebook-dir, biz daftarlarimizni kerakli papkada saqlashimiz mumkin.

PySpark-dan Jupyter-ni ishga tushirish

Biz Yupiterni PySpark drayveri sifatida sozlashimiz mumkin bo'lganligi sababli, biz endi Jupyter noutbukini PySpark kontekstida ishga tushirishimiz mumkin.

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

MLflow bilan Spark kengaytirilmoqda

Yuqorida aytib o'tilganidek, MLflow S3-da model artefaktlarini qayd qilish xususiyatini taqdim etadi. Tanlangan model qo'limizda bo'lishi bilan biz modul yordamida uni UDF sifatida import qilish imkoniyatiga egamiz mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

MLflow bilan Spark kengaytirilmoqda
PySpark - Vino sifatini bashorat qilish

Shu paytgacha biz PySpark-dan MLflow bilan qanday foydalanish haqida gaplashdik, butun vino ma'lumotlar to'plamida vino sifatini bashorat qilish. Ammo Scala Spark-dan Python MLflow modullaridan foydalanish kerak bo'lsa-chi?

Biz buni Spark kontekstini Scala va Python o'rtasida bo'lish orqali ham sinab ko'rdik. Ya'ni, biz Python-da MLflow UDF-ni ro'yxatdan o'tkazdik va uni Scala-dan foydalandik (ha, ehtimol eng yaxshi yechim emas, lekin bizda mavjud).

Scala Spark + MLflow

Ushbu misol uchun biz qo'shamiz Toree yadrosi mavjud Yupiterga.

Spark + Toree + Jupyter-ni o'rnating

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

Ilova qilingan daftardan ko'rinib turibdiki, UDF Spark va PySpark o'rtasida taqsimlanadi. Umid qilamizki, bu qism Scala-ni yaxshi ko'radigan va ishlab chiqarishda mashinani o'rganish modellarini qo'llashni xohlaydiganlar uchun foydali bo'ladi.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

Keyingi qadamlar

Yozish paytida MLflow Alpha versiyasida bo'lsa ham, u juda istiqbolli ko'rinadi. Shunchaki bir nechta mashinani o'rganish tizimini ishga tushirish va ularni bitta so'nggi nuqtadan iste'mol qilish qobiliyati tavsiya qiluvchi tizimlarni keyingi bosqichga olib chiqadi.

Bundan tashqari, MLflow ma'lumotlar muhandislari va ma'lumotlar fanlari bo'yicha mutaxassislarni bir-biriga yaqinlashtiradi va ular o'rtasida umumiy qatlam yaratadi.

Ushbu MLflow tadqiqidan so'ng, biz oldinga siljishimizga va uni Spark quvurlari va tavsiya qiluvchi tizimlarimiz uchun ishlatishimizga aminmiz.

Fayl tizimi o'rniga fayl omborini ma'lumotlar bazasi bilan sinxronlashtirish yaxshi bo'lar edi. Bu bizga bir xil fayl xotirasidan foydalanishi mumkin bo'lgan bir nechta so'nggi nuqtalarni berishi kerak. Misol uchun, bir nechta misollardan foydalaning Presto и Afina bir xil Glue metastore bilan.

Xulosa qilib aytganda, ma'lumotlar bilan ishlashimizni yanada qiziqarli qilish uchun MLFlow hamjamiyatiga rahmat aytmoqchiman.

Agar siz MLflow bilan o'ynayotgan bo'lsangiz, bizga yozishdan tortinmang va undan qanday foydalanayotganingizni va undan ham ko'proq ishlab chiqarishda foydalansangiz aytib bering.

Kurslar haqida ko'proq bilib oling:
Mashinani o'rganish. Asosiy kurs
Mashinani o'rganish. Ilg'or kurs

Ko'proq o'qish:

Manba: www.habr.com

a Izoh qo'shish