Ndewo, ndị bi na Khabrovsk. Dị ka anyị na-edebu, ọnwa a OTUS na-amalite ọmụmụ igwe abụọ n'otu oge, ya bụ isi и elu. N'akụkụ a, anyị na-aga n'ihu na-ekerịta ihe bara uru.
Ebumnuche nke isiokwu a bụ ikwu maka ahụmịhe mbụ anyị na-eji MLflow.
Anyị ga-amalite nyocha MLflow si ya nsochi nkesa na dekọọ niile iterations nke ọmụmụ. Mgbe ahụ, anyị ga-ekekọrịta ahụmịhe anyị nke iji UDF jikọọ Spark na MLflow.
Agaba
Anyị nọ Ahụike Alfa Anyị na-eji mmụta igwe na ọgụgụ isi na-eme ka ndị mmadụ nwee ike ilekọta ahụike na ọdịmma ha. Ọ bụ ya mere ụdị mmụta igwe ji dị n'etiti ngwaahịa sayensị data anyị na-etolite, ya mere e ji dọta anyị na MLflow, ikpo okwu mepere emepe nke na-ekpuchi akụkụ niile nke igwe mmụta ndụ okirikiri.
MLflow
Ebumnuche bụ isi nke MLflow bụ inye ihe mgbakwunye ọzọ n'elu mmụta igwe nke ga-eme ka ndị sayensị data rụọ ọrụ na ihe fọrọ nke nta ka ọ bụrụ ọbá akwụkwọ mmụta igwe ọ bụla (Oboe, keras, mmuo, egwu, sklearn и tensorflow), na-ewega ọrụ ya n'ọkwa ọzọ.
MLflow na-enye ihe atọ:
Ndepụta - ndekọ na arịrịọ maka nnwale: koodu, data, nhazi na nsonaazụ. Nyochaa usoro nke ịmepụta ihe nlereanya dị ezigbo mkpa.
ụdị - usoro a na-ahụkarị maka ịnyefe ụdị na ngwaọrụ ntinye dị iche iche.
MLflow (na alfa n'oge ederede) bụ ikpo okwu mepere emepe na-enye gị ohere ijikwa igwe mmụta ndụ okirikiri, gụnyere nnwale, ijigharị, na mbugharị.
Ịtọlite MLflow
Iji jiri MLflow ịkwesịrị ibu ụzọ guzobe gburugburu Python gị niile, maka nke a anyị ga-eji PyEnv (iji tinye Python na Mac, lelee ebe a). N'ụzọ dị otú a, anyị nwere ike ịmepụta ebe mebere ebe anyị ga-etinye ụlọ akwụkwọ niile dị mkpa iji mee ya.
```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```
Mara: Anyị na-eji PyArrow na-eme ụdị dị ka UDF. Ụdị PyArrow na Numpy kwesịrị idozi n'ihi na ụdị nke ikpeazụ na-emegide ibe ha.
Mwepụta UI Tracking
MLflow Tracking na-enye anyị ohere ịbanye na nyocha ajụjụ site na iji Python na REST API. Na mgbakwunye, ị nwere ike ikpebi ebe ị ga-echekwa ihe ngosi ihe ngosi (localhost, Amazon S3, Nchekwa Azure Blob, Nchekwa Google Cloud ma ọ bụ Ihe nkesa SFTP). Ebe anyị na-eji AWS na Alfa Health, nchekwa ihe anyị ga-abụ S3.
# Running a Tracking Server
mlflow server
--file-store /tmp/mlflow/fileStore
--default-artifact-root s3://<bucket>/mlflow/artifacts/
--host localhost
--port 5000
MLflow kwadoro iji nchekwa faịlụ na-adịgide adịgide. Nchekwa faịlụ bụ ebe ihe nkesa ga-echekwa ọsọ wee nwalee metadata. Mgbe ị na-amalite ihe nkesa, jide n'aka na ọ na-atụ aka na ụlọ ahịa faịlụ na-adịgide adịgide. Ebe a maka nnwale anyị ga-eji naanị /tmp.
Cheta na ọ bụrụ na anyị chọrọ iji ihe nkesa mlflow iji mee nnwale ochie, ha ga-adịrịrị na nchekwa faịlụ. Otú ọ dị, ọbụna na-enweghị nke a, anyị nwere ike iji ha na UDF, ebe ọ bụ na anyị chọrọ naanị ụzọ na nlereanya.
Mara: Buru n'uche na nsuso UI na onye ahịa ihe nlereanya ga-enwerịrị ohere ịnweta ebe ihe arụrụ arụ. Nke ahụ bụ, n'agbanyeghị eziokwu na Tracking UI bi na ihe atụ EC2, mgbe ọ na-agba ọsọ MLflow na mpaghara, igwe ga-enwerịrị ohere ịnweta S3 ozugbo iji dee ụdị artifact.
Isochi UI na-echekwa arịa n'ime bọket S3
Ụdị na-agba ọsọ
Ozugbo ihe nkesa na-arụ ọrụ, ị nwere ike ịmalite ịzụ ụdị.
Dịka ọmụmaatụ, anyị ga-eji mgbanwe mmanya site na ihe atụ MLflow na Sklearn.
Dịka anyị tụlerela, MLflow na-enye gị ohere ịdebanye paramita ihe atụ, metrik, na artifacts ka ị nwee ike soro otu ha si etolite n'usoro. Njirimara a bara uru nke ukwuu n'ihi na otu a anyị nwere ike imepụtaghachi ụdị kachasị mma site na ịkpọtụrụ ihe nkesa nsochi ma ọ bụ ịghọta koodu nke rụrụ usoro a chọrọ site na iji git hash logs of commitments.
Ihe nkesa nsochi MLflow, nke ewepụtara site na iji iwu "mlflow server", nwere REST API maka nsochi ọsọ na ide data na sistemụ faịlụ mpaghara. Ị nwere ike dee adreesị ihe nkesa nsochi site na iji mgbanwe gburugburu ebe obibi "MLFLOW_TRACKING_URI" na MLflow nsochi API ga-akpọtụrụ ihe nkesa nsochi na adreesị a ozugbo iji mepụta/nata ozi mmalite, log metrics, wdg.
Iji nye ihe nlereanya na ihe nkesa, anyị chọrọ ihe nkesa na-agba ọsọ (lee mmalite interface) na Run ID nke ihe nlereanya.
Gbaa ID
# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve
--port 5005
--run_id 0f8691808e914d1087cf097a08730f17
--model-path model
Iji jee ozi ụdị na iji ọrụ MLflow na-eje ozi, anyị ga-achọ ịnweta UI Tracking iji nweta ozi gbasara ihe nlereanya ahụ naanị site na ịkọwapụta. --run_id.
Ozugbo ihe nlereanya ahụ kpọtụụrụ ihe nkesa nsochi, anyị nwere ike nweta akara njedebe ụdị ọhụrụ.
N'agbanyeghị na ihe nkesa Tracking dị ike nke ukwuu iji jee ozi ụdị n'otu oge, zụọ ha ma jiri ọrụ nkesa (isi iyi: mlflow // docs // ụdị # local), iji Spark (ogbe ma ọ bụ nkwanye) bụ ihe ngwọta dị ike karịa n'ihi nkesa ya.
Cheedị ma ọ bụrụ na ị mere ọzụzụ ahụ na-anọghị n'ịntanetị wee tinye usoro mmepụta na data gị niile. Nke a bụ ebe Spark na MLflow na-enwu.
N'ịbụ ndị kpebisiri ike notebook-dir, anyị nwere ike ịchekwa akwụkwọ ndetu anyị na folda achọrọ.
Na-ebupụta Jupyter site na PySpark
Ebe ọ bụ na anyị nwere ike ịhazi Jupiter dị ka onye ọkwọ ụgbọ ala PySpark, anyị nwere ike na-agba ọsọ Jupyter notebook na ọnọdụ PySpark.
(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
Dịka e kwuru n'elu, MLflow na-enye atụmatụ maka ịdebanye ihe ngosi ihe ngosi na S3. Ozugbo anyị nwere ụdị ahọpụtara n'aka anyị, anyị nwere ohere ibubata ya dị ka UDF site na iji modul mlflow.pyfunc.
Ruo ugbu a, anyị ekwuola otu esi eji PySpark na MLflow, na-agba amụma àgwà mmanya na dataset mmanya niile. Mana gịnị ma ọ bụrụ na ịchọrọ iji Python MLflow modul sitere na Scala Spark?
Anyị nwalekwara nke a site na ikewa ọnọdụ Spark n'etiti Scala na Python. Nke ahụ bụ, anyị debanyere aha MLflow UDF na Python, wee jiri ya na Scala (ee, ikekwe ọ bụghị ngwọta kachasị mma, mana ihe anyị nwere).
Scala Spark + MLflow
Maka ihe atụ a, anyị ga-agbakwunye Toree kernel banye Jupita dị adị.
Wụnye Spark + Toree + Jupyter
pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
apache_toree_scala /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
python3 /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```
Dịka ị na-ahụ site na akwụkwọ ndetu agbakwunyere, UDF na-ekekọrịta n'etiti Spark na PySpark. Anyị na-atụ anya na akụkụ a ga-aba uru nye ndị hụrụ Scala n'anya ma chọọ itinye ụdị mmụta igwe na mmepụta.
Ọ bụ ezie na MLflow dị na ụdị Alfa n'oge edere, ọ dị ka ihe na-ekwe nkwa. Naanị ike ịme ọtụtụ usoro mmụta igwe ma na-eri ha site na otu njedebe na-ewe usoro ndị na-akwado ya n'ọkwa ọzọ.
Na mgbakwunye, MLflow na-eweta ndị injinia data na ndị ọkachamara sayensị sayensị nso, na-edobe otu oyi akwa n'etiti ha.
Ka emechara nyocha a nke MLflow, anyị nwere obi ike na anyị ga-aga n'ihu wee jiri ya maka pipeline Spark na sistemu nkwado anyị.
Ọ ga-adị mma ịmekọrịta nchekwa faịlụ na nchekwa data kama iji sistemụ faịlụ. Nke a kwesịrị inye anyị ọtụtụ njedebe nwere ike iji otu nchekwa faịlụ ahụ. Dịka ọmụmaatụ, jiri ọtụtụ ihe atụ Presto и Athena na otu Glue metastore.
Iji chịkọta, ọ ga-amasị m ịsị daalụ ndị obodo MLFlow maka ime ka ọrụ anyị na data na-atọ ụtọ karị.
Ọ bụrụ na ị na-egwu gburugburu na MLflow, egbula oge ịdegara anyị akwụkwọ ma gwa anyị otu esi eji ya, na ọbụna karịa ma ọ bụrụ na ị na-eji ya na mmepụta.