Nyob zoo, cov neeg nyob hauv Khabrovsk. Raws li peb twb tau sau lawm, lub hlis no OTUS tab tom pib ob chav kawm tshuab ib zaug, uas yog puag ΠΈ siab heev. Hauv qhov no, peb txuas ntxiv muab cov ntaub ntawv tseem ceeb.
Lub hom phiaj ntawm tsab xov xwm no yog los tham txog peb thawj qhov kev siv MLflow.
Peb mam li pib qhov kev tshuaj xyuas MLflow los ntawm nws tus neeg rau zaub mov taug qab thiab teev tag nrho cov iterations ntawm txoj kev kawm. Tom qab ntawd peb yuav qhia peb qhov kev paub ntawm kev txuas Spark nrog MLflow siv UDF.
Ntsiab Lus
Peb nyob hauv Alpha Health Peb siv tshuab kev kawm thiab kev txawj ntse txawj ntse los txhawb cov neeg los saib xyuas lawv txoj kev noj qab haus huv thiab kev noj qab haus huv. Yog vim li cas cov qauv kev kawm tshuab yog lub hauv paus ntawm cov ntaub ntawv kev tshawb fawb cov khoom peb tsim, thiab yog vim li cas peb thiaj li tau kos rau MLflow, lub platform qhib uas npog txhua yam ntawm lub tshuab kev kawm lifecycle.
MLflow
Lub hom phiaj tseem ceeb ntawm MLflow yog muab cov txheej txheem ntxiv rau saum cov tshuab kev kawm uas yuav tso cai rau cov kws tshawb fawb cov ntaub ntawv ua haujlwm nrog yuav luag txhua lub tsev qiv ntawv kawm tshuab (h2o ua, keras, loj, pytorch, sklearn ΠΈ tensorflow), coj nws txoj haujlwm mus rau qib tom ntej.
Nco tseg: Peb siv PyArrow los khiav cov qauv xws li UDF. Cov versions ntawm PyArrow thiab Numpy yuav tsum tau kho vim tias cov versions tom qab tsis sib haum xeeb.
Tua tawm Tracking UI
MLflow Tracking tso cai rau peb teev thiab nug cov kev sim siv Python thiab SO API. Tsis tas li ntawd, koj tuaj yeem txiav txim siab qhov twg los khaws cov qauv artifacts (localhost, Amazon S3, Azure Blob Cia, Google Huab Cia los yog SFTP server). Txij li thaum peb siv AWS ntawm Alpha Health, peb cov khoom khaws cia yuav yog S3.
# Running a Tracking Server
mlflow server
--file-store /tmp/mlflow/fileStore
--default-artifact-root s3://<bucket>/mlflow/artifacts/
--host localhost
--port 5000
MLflow pom zoo kom siv cov ntaub ntawv tsis tu ncua. Cov ntaub ntawv cia yog qhov chaw uas tus neeg rau zaub mov yuav khaws cia khiav thiab sim metadata. Thaum pib lub server, xyuas kom tseeb tias nws taw qhia rau cov ntaub ntawv tsis tu ncua. Ntawm no rau kev sim peb tsuas yog siv /tmp.
Nco ntsoov tias yog tias peb xav siv mlflow server los khiav cov kev sim qub, lawv yuav tsum muaj nyob hauv cov ntaub ntawv khaws cia. Txawm li cas los xij, txawm tias tsis muaj qhov no peb tuaj yeem siv lawv hauv UDF, vim peb tsuas yog xav tau txoj hauv kev rau tus qauv.
Nco tseg: Nco ntsoov tias Kev Taug qab UI thiab tus qauv tus neeg siv yuav tsum tau nkag mus rau qhov chaw kos duab. Ntawd yog, tsis hais txog qhov tseeb tias Tracking UI nyob hauv EC2 piv txwv, thaum khiav MLflow hauv zos, lub tshuab yuav tsum muaj kev nkag ncaj qha rau S3 los sau cov qauv kos duab.
Taug qab UI khaws cov khoom qub rau hauv lub thoob S3
Raws li peb twb tau tham lawm, MLflow tso cai rau koj los teev cov qauv ntsuas, ntsuas, thiab cov khoom cuav kom koj tuaj yeem taug qab qhov lawv hloov pauv li cas. Qhov no yog qhov tseem ceeb heev vim tias txoj hauv kev no peb tuaj yeem tsim cov qauv zoo tshaj plaws los ntawm kev hu rau Tus Neeg Saib Xyuas Kev Taug Kev lossis nkag siab cov lej twg ua qhov yuav tsum tau rov ua dua siv git hash cav ntawm kev cog lus.
Txhawm rau muab tus qauv nrog rau tus neeg rau zaub mov, peb xav tau cov neeg siv khiav mus txog qhov kawg (saib lub interface pib) thiab Khiav ID ntawm tus qauv.
Khiav ID
# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve
--port 5005
--run_id 0f8691808e914d1087cf097a08730f17
--model-path model
Txhawm rau pab cov qauv siv MLflow ua haujlwm ua haujlwm, peb yuav xav tau nkag mus rau Kev Tshawb Fawb UI kom tau txais cov ntaub ntawv hais txog tus qauv yooj yim los ntawm kev qhia meej. --run_id.
Thaum tus qauv hu rau tus neeg rau zaub mov mus txog qhovtwg, peb tuaj yeem tau txais tus qauv tshiab kawg.
Txij li thaum peb tuaj yeem teeb tsa Jupiter ua tus tsav tsheb PySpark, tam sim no peb tuaj yeem khiav Jupyter phau ntawv hauv cov ntsiab lus ntawm PySpark.
(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
Raws li tau hais los saum toj no, MLflow muab qhov tshwj xeeb rau kev txiav cov qauv kos duab hauv S3. Thaum peb muaj cov qauv xaiv hauv peb txhais tes, peb muaj lub sijhawm los import nws li UDF siv lub module mlflow.pyfunc.
Txog rau lub sijhawm no, peb tau tham txog yuav ua li cas siv PySpark nrog MLflow, khiav kev kwv yees zoo ntawm cov cawv txiv hmab tag nrho. Tab sis ua li cas yog tias koj xav siv Python MLflow modules los ntawm Scala Spark?
Peb tau sim qhov no dhau los ntawm kev faib Spark ntsiab lus ntawm Scala thiab Python. Ntawd yog, peb tau sau npe MLflow UDF hauv Python, thiab siv nws los ntawm Scala (yog, tej zaum tsis yog qhov kev daws teeb meem zoo tshaj plaws, tab sis peb muaj dab tsi).
Txawm hais tias MLflow yog nyob rau hauv Alpha version thaum lub sijhawm sau ntawv, nws zoo li pheej hmoo heev. Tsuas yog lub peev xwm los khiav ntau lub tshuab kev kawm lub moj khaum thiab haus lawv los ntawm ib qho kawg nkaus yuav siv cov lus pom zoo rau qib tom ntej.
Tsis tas li ntawd, MLflow coj Data Engineers thiab Data Science cov kws tshaj lij los ze zog ua ke, tso ib txheej txheej ntawm lawv.
Tom qab qhov kev tshawb fawb ntawm MLflow no, peb ntseeg siab tias peb yuav txav mus tom ntej thiab siv nws rau peb cov Spark pipelines thiab cov lus pom zoo.