αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow

αž‡αŸ†αžšαžΆαž”αžŸαž½αžšαž’αŸ’αž“αž€αžŸαŸ’αžšαž»αž€ Khabrovsk αŸ” αžŠαžΌαž…αžŠαŸ‚αž›αž™αžΎαž„αž”αžΆαž“αžŸαžšαžŸαŸαžšαžšαž½αž…αž αžΎαž™ αž“αŸ…αžαŸ‚αž“αŸαŸ‡ OTUS αž€αŸ†αž–αž»αž„αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αžœαž‚αŸ’αž‚αžŸαž·αž€αŸ’αžŸαžΆαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž–αžΈαžšαž€αŸ’αž“αž»αž„αž–αŸαž›αžαŸ‚αž˜αž½αž™ αž–αŸ„αž›αž‚αžΊ αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“ ΠΈ αž€αž˜αŸ’αžšαž·αžαžαŸ’αž–αžŸαŸ‹. αž€αŸ’αž“αž»αž„αž“αŸαž™αž“αŸαŸ‡ αž™αžΎαž„αž”αž“αŸ’αžαž…αŸ‚αž€αžšαŸ†αž›αŸ‚αž€αžŸαž˜αŸ’αž—αžΆαžšαŸˆαžŠαŸ‚αž›αž˜αžΆαž“αž”αŸ’αžšαž™αŸ„αž‡αž“αŸαŸ”

αž‚αŸ„αž›αž”αŸ†αžŽαž„αž“αŸƒαž’αžαŸ’αžαž”αž‘αž“αŸαŸ‡αž‚αžΊαžŠαžΎαž˜αŸ’αž”αžΈαž“αž·αž™αžΆαž™αž’αŸ†αž–αžΈαž”αž‘αž–αž·αžŸαŸ„αž’αž“αŸαžŠαŸ†αž”αžΌαž„αžšαž”αžŸαŸ‹αž™αžΎαž„αž€αŸ’αž“αž»αž„αž€αžΆαžšαž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹ αž›αŸ†αž αžΌαžš.

αž™αžΎαž„αž“αžΉαž„αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αž€αžΆαžšαž–αž·αž“αž·αžαŸ’αž™αž‘αžΎαž„αžœαž·αž‰ αž›αŸ†αž αžΌαžš αž–αžΈαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αžšαž”αžŸαŸ‹αžœαžΆ αž αžΎαž™αž€αžαŸ‹αžαŸ’αžšαžΆαžšαžΆαž›αŸ‹αž€αžΆαžšαž’αŸ’αžœαžΎαž‘αžΎαž„αžœαž·αž‰αž“αŸƒαž€αžΆαžšαžŸαž·αž€αŸ’αžŸαžΆαŸ” αž”αž“αŸ’αž‘αžΆαž”αŸ‹αž˜αž€αž™αžΎαž„αž“αžΉαž„αž…αŸ‚αž€αžšαŸ†αž›αŸ‚αž€αž”αž‘αž–αž·αžŸαŸ„αž’αž“αŸαžšαž”αžŸαŸ‹αž™αžΎαž„αž€αŸ’αž“αž»αž„αž€αžΆαžšαž—αŸ’αž‡αžΆαž”αŸ‹ Spark αž‡αžΆαž˜αž½αž™ MLflow αžŠαŸ„αž™αž”αŸ’αžšαžΎ UDF β€‹β€‹αŸ”

αž”αžšαž·αž”αž‘

αž™αžΎαž„β€‹αžŸαŸ’αžαž·αžβ€‹αž“αŸ…β€‹αž€αŸ’αž“αž»αž„ αž’αžΆαž›αŸ‹αž αŸ’αžœαžΆαžŸαž»αžαž—αžΆαž– αž™αžΎαž„αž”αŸ’αžšαžΎαž€αžΆαžšαžšαŸ€αž“αžαžΆαž˜αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“ αž“αž·αž„αž”αž‰αŸ’αž‰αžΆαžŸαž·αž”αŸ’αž”αž“αž·αž˜αž·αžαŸ’αž αžŠαžΎαž˜αŸ’αž”αžΈαž•αŸ’αžαž›αŸ‹αž’αŸ†αžŽαžΆαž…αžŠαž›αŸ‹αž˜αž“αž»αžŸαŸ’αžŸαž±αŸ’αž™αž‘αž‘αž½αž›αžαž»αžŸαžαŸ’αžšαžΌαžœαž›αžΎαžŸαž»αžαž—αžΆαž– αž“αž·αž„αžŸαž»αžαž»αž˜αžΆαž›αž—αžΆαž–αžšαž”αžŸαŸ‹αž–αž½αž€αž‚αŸαŸ” αž“αŸ„αŸ‡αž αžΎαž™αž‡αžΆαž˜αžΌαž›αž αŸαžαž»αžŠαŸ‚αž›αž‚αŸ†αžšαžΌαž“αŸƒαž€αžΆαžšαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž‚αžΊαž‡αžΆαž…αŸ†αžŽαž»αž…αžŸαŸ’αž“αžΌαž›αž“αŸƒαž•αž›αž·αžαž•αž›αžœαž·αž‘αŸ’αž™αžΆαžŸαžΆαžŸαŸ’αžαŸ’αžšαž‘αž·αž“αŸ’αž“αž“αŸαž™αžŠαŸ‚αž›αž™αžΎαž„αž”αž„αŸ’αž€αžΎαž αž αžΎαž™αž“αŸ„αŸ‡αž‡αžΆαž˜αžΌαž›αž αŸαžαž»αžŠαŸ‚αž›αž™αžΎαž„αžαŸ’αžšαžΌαžœαž”αžΆαž“αž‘αžΆαž‰αž‘αŸ… MLflow αžŠαŸ‚αž›αž‡αžΆαžœαŸαž‘αž·αž€αžΆαž”αŸ’αžšαž—αž–αž”αžΎαž€αž…αŸ†αž αžŠαŸ‚αž›αž‚αŸ’αžšαž”αžŠαžŽαŸ’αžαž”αŸ‹αž‚αŸ’αžšαž”αŸ‹αž‘αž·αžŠαŸ’αž‹αž—αžΆαž–αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αž“αŸƒαžœαžŠαŸ’αžαž‡αžΈαžœαž·αžαž“αŸƒαž€αžΆαžšαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αŸ”

αž›αŸ†αž αžΌαžš

αž‚αŸ„αž›αžŠαŸ…αž…αž˜αŸ’αž”αž„αž“αŸƒ MLflow αž‚αžΊαžŠαžΎαž˜αŸ’αž”αžΈαž•αŸ’αžαž›αŸ‹αž“αžΌαžœαžŸαŸ’αžšαž‘αžΆαž”αŸ‹αž”αž“αŸ’αžαŸ‚αž˜αž˜αž½αž™αž“αŸ…αž›αžΎαž€αŸ†αž–αžΌαž›αž“αŸƒαž€αžΆαžšαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“ αžŠαŸ‚αž›αž“αžΉαž„αž’αž“αž»αž‰αŸ’αž‰αžΆαžαž±αŸ’αž™αž’αŸ’αž“αž€αžœαž·αž‘αŸ’αž™αžΆαžŸαžΆαžŸαŸ’αžαŸ’αžšαž‘αž·αž“αŸ’αž“αž“αŸαž™αž’αŸ’αžœαžΎαž€αžΆαžšαž‡αžΆαž˜αž½αž™αžŸαŸ’αž‘αžΎαžšαžαŸ‚αž‚αŸ’αžšαž”αŸ‹αž”αžŽαŸ’αžŽαžΆαž›αŸαž™αžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“ (h2o, keras, αž›αžΆαž—, pytorch, sklearn ΠΈ tensorflow αŸ”), αž™αž€αž€αžΆαžšαž„αžΆαžšαžšαž”αžŸαŸ‹αž“αžΆαž„αž‘αŸ…αž€αž˜αŸ’αžšαž·αžαž”αž“αŸ’αž‘αžΆαž”αŸ‹αŸ”

MLflow αž•αŸ’αžαž›αŸ‹αž“αžΌαžœαžŸαž˜αžΆαžŸαž’αžΆαžαž»αž”αžΈαž™αŸ‰αžΆαž„αŸ–

  • αž€αžΆαžšβ€‹αžαžΆαž˜αžŠαžΆαž“ - αž€αžΆαžšαž€αžαŸ‹αžαŸ’αžšαžΆ αž“αž·αž„αžŸαŸ†αžŽαžΎαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž–αž·αžŸαŸ„αž’αž“αŸαŸ– αž€αžΌαžŠ αž‘αž·αž“αŸ’αž“αž“αŸαž™ αž€αžΆαžšαž€αŸ†αžŽαžαŸ‹αžšαž…αž“αžΆαžŸαž˜αŸ’αž–αŸαž“αŸ’αž’ αž“αž·αž„αž›αž‘αŸ’αž’αž•αž›αŸ” αž€αžΆαžšαžαŸ’αžšαž½αžαž–αž·αž“αž·αžαŸ’αž™αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž“αŸƒαž€αžΆαžšαž”αž„αŸ’αž€αžΎαžαž‚αŸ†αžšαžΌαž‚αžΊαž˜αžΆαž“αžŸαžΆαžšαŸˆαžŸαŸ†αžαžΆαž“αŸ‹αžŽαžΆαžŸαŸ‹αŸ”
  • αž‚αž˜αŸ’αžšαŸ„αž„ - αž‘αž˜αŸ’αžšαž„αŸ‹αžœαŸαž…αžαŸ’αž…αž”αŸ‹αžŠαžΎαž˜αŸ’αž”αžΈαžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž›αžΎαžœαŸαž‘αž·αž€αžΆαžŽαžΆαž˜αž½αž™ (ឧ. SageMaker αŸ”)
  • αž˜αŸ‰αžΌαžŠαŸ‚αž› - αž‘αž˜αŸ’αžšαž„αŸ‹αž‘αžΌαž‘αŸ…αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž”αž‰αŸ’αž‡αžΌαž“αž‚αŸ†αžšαžΌαž‘αŸ…αž€αžΆαž“αŸ‹αž§αž”αž€αžšαžŽαŸαžŠαžΆαž€αŸ‹αž–αž„αŸ’αžšαžΆαž™αž•αŸ’αžŸαŸαž„αŸ—αŸ”

MLflow (αž‡αžΆαž’αžΆαž›αŸ‹αž αŸ’αžœαžΆαž“αŸ…αž–αŸαž›αžŸαžšαžŸαŸαžš) αž‚αžΊαž‡αžΆαžœαŸαž‘αž·αž€αžΆαž”αŸ’αžšαž—αž–αž”αžΎαž€αž…αŸ†αž αžŠαŸ‚αž›αž’αž“αž»αž‰αŸ’αž‰αžΆαžαž±αŸ’αž™αž’αŸ’αž“αž€αž‚αŸ’αžšαž”αŸ‹αž‚αŸ’αžšαž„αžœαžŠαŸ’αžαž“αŸƒαž€αžΆαžšαžŸαž·αž€αŸ’αžŸαžΆαžšαž”αžŸαŸ‹αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“ αžšαž½αž˜αž‘αžΆαŸ†αž„αž€αžΆαžšαž–αž·αžŸαŸ„αž’αž“αŸ αž€αžΆαžšαž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹αž‘αžΎαž„αžœαž·αž‰ αž“αž·αž„αž€αžΆαžšαžŠαžΆαž€αŸ‹αž±αŸ’αž™αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹αŸ”

αž€αžΆαžšαžŠαŸ†αž‘αžΎαž„ MLflow

αžŠαžΎαž˜αŸ’αž”αžΈαž”αŸ’αžšαžΎ MLflow αž’αŸ’αž“αž€αžαŸ’αžšαžΌαžœαžšαŸ€αž”αž…αŸ†αž”αžšαž·αžŸαŸ’αžαžΆαž“ Python αž‘αžΆαŸ†αž„αž˜αžΌαž›αžšαž”αžŸαŸ‹αž’αŸ’αž“αž€αž‡αžΆαž˜αž»αž“αžŸαž·αž“ αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž“αŸαŸ‡αž™αžΎαž„αž“αžΉαž„αž”αŸ’αžšαžΎ PyEnv (αžŠαžΎαž˜αŸ’αž”αžΈαžŠαŸ†αž‘αžΎαž„ Python αž“αŸ…αž›αžΎ Mac αžŸαžΌαž˜αž–αž·αž“αž·αžαŸ’αž™αž˜αžΎαž› αž“αŸ…αž‘αžΈαž“αŸαŸ‡) αžœαž·αž’αžΈαž“αŸαŸ‡αž™αžΎαž„αž’αžΆαž…αž”αž„αŸ’αž€αžΎαžαž”αžšαž·αž™αžΆαž€αžΆαžŸαž“αž·αž˜αŸ’αž˜αž·αžαžŠαŸ‚αž›αž™αžΎαž„αž“αžΉαž„αžŠαŸ†αž‘αžΎαž„αž”αžŽαŸ’αžŽαžΆαž›αŸαž™αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αžŠαŸ‚αž›αž…αžΆαŸ†αž”αžΆαž…αŸ‹αžŠαžΎαž˜αŸ’αž”αžΈαžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαžœαžΆαŸ”

```
pyenv install 3.7.0
pyenv global 3.7.0 # Use Python 3.7
mkvirtualenv mlflow # Create a Virtual Env with Python 3.7
workon mlflow
```

αžαŸ„αŸ‡αžŠαŸ†αž‘αžΎαž„αž”αžŽαŸ’αžŽαžΆαž›αŸαž™αžŠαŸ‚αž›αžαŸ’αžšαžΌαžœαž€αžΆαžšαŸ”

```
pip install mlflow==0.7.0 
            Cython==0.29  
            numpy==1.14.5 
            pandas==0.23.4 
            pyarrow==0.11.0
```

αž…αŸ†αžŽαžΆαŸ†αŸ– αž™αžΎαž„αž”αŸ’αžšαžΎ PyArrow αžŠαžΎαž˜αŸ’αž”αžΈαžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž˜αŸ‰αžΌαžŠαŸ‚αž›αžŠαžΌαž…αž‡αžΆ UDF αž‡αžΆαžŠαžΎαž˜αŸ” αž€αŸ†αžŽαŸ‚αžšαž”αžŸαŸ‹ PyArrow αž“αž·αž„ Numpy αž…αžΆαŸ†αž”αžΆαž…αŸ‹αžαŸ’αžšαžΌαžœαž”αžΆαž“αž‡αž½αžŸαž‡αž»αž› αž–αžΈαž–αŸ’αžšαŸ„αŸ‡αž€αŸ†αžŽαŸ‚αž…αž»αž„αž€αŸ’αžšαŸ„αž™αž˜αžΆαž“αž‡αž˜αŸ’αž›αŸ„αŸ‡αž‡αžΆαž˜αž½αž™αž‚αŸ’αž“αžΆαŸ”

αž”αžΎαž€αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš UI αžαžΆαž˜αžŠαžΆαž“

αž€αžΆαžšαžαžΆαž˜αžŠαžΆαž“ MLflow αž’αž“αž»αž‰αŸ’αž‰αžΆαžαž±αŸ’αž™αž™αžΎαž„αž€αžαŸ‹αžαŸ’αžšαžΆ αž“αž·αž„αžŸαžΆαž€αžŸαž½αžšαž€αžΆαžšαž–αž·αžŸαŸ„αž’αž“αŸαžŠαŸ„αž™αž”αŸ’αžšαžΎ Python αž“αž·αž„ αžŸαž˜αŸ’αžšαžΆαž€ API αž›αžΎαžŸαž–αžΈαž“αŸαŸ‡ αž’αŸ’αž“αž€αž’αžΆαž…αž€αŸ†αžŽαžαŸ‹αž€αž“αŸ’αž›αŸ‚αž„αžŠαŸ‚αž›αžαŸ’αžšαžΌαžœαžšαž€αŸ’αžŸαžΆαž‘αž»αž€αžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαž‚αŸ†αžšαžΌ (localhost, Amazon S3, αž€αžΆαžšαž•αŸ’αž‘αž»αž€ Azure Blob, Google Cloud Storage ឬ αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸ SFTP) αžŠαŸ„αž™αžŸαžΆαžšαž™αžΎαž„αž”αŸ’αžšαžΎ AWS αž“αŸ… Alpha Health αž€αžΆαžšαž•αŸ’αž‘αž»αž€αžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαžšαž”αžŸαŸ‹αž™αžΎαž„αž“αžΉαž„αž˜αžΆαž“ S3 αŸ”

# Running a Tracking Server
mlflow server 
    --file-store /tmp/mlflow/fileStore 
    --default-artifact-root s3://<bucket>/mlflow/artifacts/ 
    --host localhost
    --port 5000

MLflow αžŽαŸ‚αž“αžΆαŸ†αž±αŸ’αž™αž”αŸ’αžšαžΎαž€αžΆαžšαž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαž‡αžΆαž”αŸ‹αž›αžΆαž”αŸ‹αŸ” αž€αžΆαžšαž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαž‚αžΊαž‡αžΆαž€αž“αŸ’αž›αŸ‚αž„αžŠαŸ‚αž›αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαž“αžΉαž„αžšαž€αŸ’αžŸαžΆαž‘αž»αž€αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš αž“αž·αž„αž‘αž·αž“αŸ’αž“αž“αŸαž™αž˜αŸαžαžΆαž–αž·αžŸαŸ„αž’αž“αŸαŸ” αž“αŸ…αž–αŸαž›αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸ αžαŸ’αžšαžΌαžœαž”αŸ’αžšαžΆαž€αžŠαžαžΆαžœαžΆαž…αž„αŸ’αž’αž»αž›αž‘αŸ…αž€αž“αŸ’αž›αŸ‚αž„αž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαž‡αžΆαž”αŸ‹αŸ” αž“αŸ…αž‘αžΈαž“αŸαŸ‡αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž–αž·αžŸαŸ„αž’αž“αŸαžŠαŸ‚αž›αž™αžΎαž„αž“αžΉαž„αž”αŸ’αžšαžΎαž™αŸ‰αžΆαž„αžŸαžΆαž˜αž‰αŸ’αž‰ /tmp.

αžŸαžΌαž˜αž…αž„αž…αžΆαŸ†αžαžΆ αž”αŸ’αžšαžŸαž·αž“αž”αžΎαž™αžΎαž„αž…αž„αŸ‹αž”αŸ’αžšαžΎαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸ mlflow αžŠαžΎαž˜αŸ’αž”αžΈαžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž€αžΆαžšαž–αž·αžŸαŸ„αž’αž“αŸαž…αžΆαžŸαŸ‹αŸ— αž–αž½αž€αž‚αŸαžαŸ’αžšαžΌαžœαžαŸ‚αž˜αžΆαž“αžœαžαŸ’αžαž˜αžΆαž“αž“αŸ…αž€αŸ’αž“αž»αž„αž€αž“αŸ’αž›αŸ‚αž„αž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαŸ” αž‘αŸ„αŸ‡αž™αŸ‰αžΆαž„αžŽαžΆαž€αŸαžŠαŸ„αž™ αž‘αŸ„αŸ‡αž”αžΈαž‡αžΆαž‚αŸ’αž˜αžΆαž“αžœαžΆαž€αŸαžŠαŸ„αž™ αž™αžΎαž„αž’αžΆαž…αž”αŸ’αžšαžΎαž–αž½αž€αžœαžΆαž“αŸ…αž€αŸ’αž“αž»αž„ UDF αžŠαŸ„αž™αž αŸαžαž»αžαžΆαž™αžΎαž„αž‚αŸ’αžšαžΆαž“αŸ‹αžαŸ‚αžαŸ’αžšαžΌαžœαž€αžΆαžšαž•αŸ’αž›αžΌαžœαž‘αŸ…αž€αžΆαž“αŸ‹αž‚αŸ†αžšαžΌαž”αŸ‰αž»αžŽαŸ’αžŽαŸ„αŸ‡αŸ”

αž…αŸ†αžŽαžΆαŸ†αŸ– αžŸαžΌαž˜αž…αž„αž…αžΆαŸ†αžαžΆαž€αžΆαžšαžαžΆαž˜αžŠαžΆαž“ UI αž“αž·αž„αž’αžαž·αžαž·αž‡αž“αž‚αŸ†αžšαžΌαžαŸ’αžšαžΌαžœαžαŸ‚αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž…αžΌαž›αž‘αŸ…αž€αžΆαž“αŸ‹αž‘αžΈαžαžΆαŸ†αž„αžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαŸ” αž“αŸ„αŸ‡αž‚αžΊαžŠαŸ„αž™αž˜αž·αž“αž‚αž·αžαž–αžΈαž€αžΆαžšαž–αž·αžαžŠαŸ‚αž›αžαžΆ Tracking UI αžŸαŸ’αžαž·αžαž“αŸ…αž€αŸ’αž“αž»αž„αž§αž‘αžΆαž αžšαžŽαŸ EC2 αž“αŸ…αž–αŸαž›αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš MLflow αž€αŸ’αž“αž»αž„αžŸαŸ’αžšαž»αž€ αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αžαŸ’αžšαžΌαžœαžαŸ‚αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž…αžΌαž›αž”αŸ’αžšαžΎαžŠαŸ„αž™αž•αŸ’αž‘αžΆαž›αŸ‹αž‘αŸ…αž€αžΆαž“αŸ‹ S3 αžŠαžΎαž˜αŸ’αž”αžΈαžŸαžšαžŸαŸαžšαž‚αŸ†αžšαžΌαžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαŸ”

αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow
αž€αžΆαžšαžαžΆαž˜αžŠαžΆαž“ UI αžšαž€αŸ’αžŸαžΆαž‘αž»αž€αžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαž“αŸ…αž€αŸ’αž“αž»αž„αž’αž»αž„ S3

αž˜αŸ‰αžΌαžŠαŸ‚αž›αžŠαŸ‚αž›αž€αŸ†αž–αž»αž„αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš

αžŠαžšαžΆαž”αžŽαžΆαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αž€αŸ†αž–αž»αž„αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš αž’αŸ’αž“αž€αž’αžΆαž…αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αž”αžŽαŸ’αžαž»αŸ‡αž”αžŽαŸ’αžαžΆαž›αž‚αŸ†αžšαžΌαŸ”

αž‡αžΆαž§αž‘αžΆαž αžšαžŽαŸ αž™αžΎαž„αž“αžΉαž„αž”αŸ’αžšαžΎαž€αžΆαžšαž€αŸ‚αž”αŸ’αžšαŸ‚αžŸαŸ’αžšαžΆαž–αžΈαž§αž‘αžΆαž αžšαžŽαŸ MLflow αž€αŸ’αž“αž»αž„ Sklearn.

MLFLOW_TRACKING_URI=http://localhost:5000 python wine_quality.py 
  --alpha 0.9
  --l1_ration 0.5
  --wine_file ./data/winequality-red.csv

αžŠαžΌαž…αžŠαŸ‚αž›αž™αžΎαž„αž”αžΆαž“αž–αž·αž—αžΆαž€αŸ’αžŸαžΆαžšαž½αž…αž˜αž€αž αžΎαž™ MLflow αž’αž“αž»αž‰αŸ’αž‰αžΆαžαž±αŸ’αž™αž’αŸ’αž“αž€αž€αžαŸ‹αžαŸ’αžšαžΆαž”αŸ‰αžΆαžšαŸ‰αžΆαž˜αŸ‰αŸ‚αžαŸ’αžšαž‚αŸ†αžšαžΌ αžšαž„αŸ’αžœαžΆαžŸαŸ‹ αž“αž·αž„αžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽ αžŠαžΌαž…αŸ’αž“αŸαŸ‡αž’αŸ’αž“αž€αž’αžΆαž…αžαžΆαž˜αžŠαžΆαž“αž–αžΈαžšαž”αŸ€αž”αžŠαŸ‚αž›αž–αž½αž€αžœαžΆαžœαž·αžœαžŒαŸ’αžαž›αžΎαž€αžΆαžšαž’αŸ’αžœαžΎαž˜αŸ’αžαž„αž‘αŸ€αžαŸ” αž›αž€αŸ’αžαžŽαŸˆαž–αž·αžŸαŸαžŸαž“αŸαŸ‡αž˜αžΆαž“αž”αŸ’αžšαž™αŸ„αž‡αž“αŸαžαŸ’αž›αžΆαŸ†αž„αžŽαžΆαžŸαŸ‹ αž–αžΈαž–αŸ’αžšαŸ„αŸ‡αžœαž·αž’αžΈαž“αŸαŸ‡αž™αžΎαž„αž’αžΆαž…αž”αž„αŸ’αž€αžΎαžαž‚αŸ†αžšαžΌαž›αŸ’αž’αž”αŸ†αž•αž»αžαž‘αžΎαž„αžœαž·αž‰αžŠαŸ„αž™αž‘αžΆαž€αŸ‹αž‘αž„αž‘αŸ…αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“ αž¬αž€αžΆαžšαž™αž›αŸ‹αžŠαžΉαž„αž’αŸ†αž–αžΈαž€αžΌαžŠαžŽαžΆαž˜αž½αž™αžŠαŸ‚αž›αž’αž“αž»αžœαžαŸ’αžαž€αžΆαžšαž‘αžΆαž˜αž‘αžΆαžšαž‘αžΎαž„αžœαž·αž‰αžŠαŸ„αž™αž”αŸ’αžšαžΎ git hash logs of commitsαŸ”

with mlflow.start_run():

    ... model ...

    mlflow.log_param("source", wine_path)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.set_tag('domain', 'wine')
    mlflow.set_tag('predict', 'quality')
    mlflow.sklearn.log_model(lr, "model")

αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow
αžŸαŸ’αžšαžΆαžŠαžŠαŸ‚αž›αŸ—

αž•αŸ’αž“αŸ‚αž€αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž˜αŸ‰αžΌαžŠαŸ‚αž›

αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“ MLflow αžŠαŸ‚αž›αž”αžΆαž“αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αžŠαŸ„αž™αž”αŸ’αžšαžΎαž–αžΆαž€αŸ’αž™αž”αž‰αŸ’αž‡αžΆ "mlflow server" αž˜αžΆαž“ REST API αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αžαžΆαž˜αžŠαžΆαž“αž€αžΆαžšαžšαžαŸ‹ αž“αž·αž„αž€αžΆαžšαžŸαžšαžŸαŸαžšαž‘αž·αž“αŸ’αž“αž“αŸαž™αž‘αŸ…αž€αžΆαž“αŸ‹αž”αŸ’αžšαž–αŸαž“αŸ’αž’αž―αž€αžŸαžΆαžšαž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αŸ” αž’αŸ’αž“αž€αž’αžΆαž…αž”αž‰αŸ’αž‡αžΆαž€αŸ‹αž’αžΆαžŸαž™αžŠαŸ’αž‹αžΆαž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αžŠαŸ„αž™αž”αŸ’αžšαžΎαž’αžαŸαžšαž”αžšαž·αžŸαŸ’αžαžΆαž“ β€œMLFLOW_TRACKING_URI” αž αžΎαž™ API αžαžΆαž˜αžŠαžΆαž“ MLflow αž“αžΉαž„αž‘αžΆαž€αŸ‹αž‘αž„αžŠαŸ„αž™αžŸαŸ’αžœαŸαž™αž”αŸ’αžšαžœαžαŸ’αžαž·αž‘αŸ…αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αžαžΆαž˜αž’αžΆαžŸαž™αžŠαŸ’αž‹αžΆαž“αž“αŸαŸ‡ αžŠαžΎαž˜αŸ’αž”αžΈαž”αž„αŸ’αž€αžΎαž/αž‘αž‘αž½αž›αž–αŸαžαŸŒαž˜αžΆαž“αž“αŸƒαž€αžΆαžšαž”αžΎαž€αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš αž€αŸ†αžŽαžαŸ‹αž αŸαžαž»αž˜αŸ‰αŸ‚αžαŸ’αžšαŸ”αž›αŸ”

αž”αŸ’αžšαž—αž–: αž―αž€αžŸαžΆαžš// αž€αŸ†αž–αž»αž„αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“

αžŠαžΎαž˜αŸ’αž”αžΈαž•αŸ’αžαž›αŸ‹αž˜αŸ‰αžΌαžŠαŸ‚αž›αž‡αžΆαž˜αž½αž™αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸ αž™αžΎαž„αžαŸ’αžšαžΌαžœαž€αžΆαžšαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αžŠαŸ‚αž›αž€αŸ†αž–αž»αž„αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš (αžŸαžΌαž˜αž˜αžΎαž›αž…αŸ†αžŽαž»αž…αž”αŸ’αžšαž‘αžΆαž€αŸ‹αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜) αž“αž·αž„αž›αŸαžαžŸαž˜αŸ’αž‚αžΆαž›αŸ‹αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž“αŸƒαž‚αŸ†αžšαžΌαŸ”

αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow
αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž›αŸαžαžŸαž˜αŸ’αž‚αžΆαž›αŸ‹

# Serve a sklearn model through 127.0.0.0:5005
MLFLOW_TRACKING_URI=http://0.0.0.0:5000 mlflow sklearn serve 
  --port 5005  
  --run_id 0f8691808e914d1087cf097a08730f17 
  --model-path model

αžŠαžΎαž˜αŸ’αž”αžΈαž”αž˜αŸ’αžšαžΎαž˜αŸ‰αžΌαžŠαŸ‚αž›αžŠαŸ„αž™αž”αŸ’αžšαžΎαž˜αž»αžαž„αžΆαžšαž”αž˜αŸ’αžšαžΎ MLflow αž™αžΎαž„αž“αžΉαž„αžαŸ’αžšαžΌαžœαž€αžΆαžšαž…αžΌαž›αž‘αŸ…αž€αžΆαž“αŸ‹ UI αžαžΆαž˜αžŠαžΆαž“ αžŠαžΎαž˜αŸ’αž”αžΈαž‘αž‘αž½αž›αž”αžΆαž“αž–αŸαžαŸŒαž˜αžΆαž“αž’αŸ†αž–αžΈαž‚αŸ†αžšαžΌαžŠαŸ„αž™αž‚αŸ’αžšαžΆαž“αŸ‹αžαŸ‚αž”αž‰αŸ’αž‡αžΆαž€αŸ‹ --run_id.

αž“αŸ…αž–αŸαž›αžŠαŸ‚αž›αž˜αŸ‰αžΌαžŠαŸ‚αž›αž‘αžΆαž€αŸ‹αž‘αž„αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“ αž™αžΎαž„αž’αžΆαž…αž‘αž‘αž½αž›αž”αžΆαž“αž…αŸ†αžŽαž»αž…αž”αž‰αŸ’αž…αž”αŸ‹αž“αŸƒαž‚αŸ†αžšαžΌαžαŸ’αž˜αžΈαŸ”

# Query Tracking Server Endpoint
curl -X POST 
  http://127.0.0.1:5005/invocations 
  -H 'Content-Type: application/json' 
  -d '[
	{
		"fixed acidity": 3.42, 
		"volatile acidity": 1.66, 
		"citric acid": 0.48, 
		"residual sugar": 4.2, 
		"chloridessssss": 0.229, 
		"free sulfur dsioxide": 19, 
		"total sulfur dioxide": 25, 
		"density": 1.98, 
		"pH": 5.33, 
		"sulphates": 4.39, 
		"alcohol": 10.8
	}
]'

> {"predictions": [5.825055635303461]}

αž˜αŸ‰αžΌαžŠαŸ‚αž›αžŠαŸ‚αž›αž€αŸ†αž–αž»αž„αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž–αžΈ Spark

αž‘αŸ„αŸ‡αž”αžΈαž‡αžΆαž€αžΆαžšαž–αž·αžαžŠαŸ‚αž›αžαžΆαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸαžαžΆαž˜αžŠαžΆαž“αž˜αžΆαž“αžαžΆαž˜αž–αž›αž‚αŸ’αžšαž”αŸ‹αž‚αŸ’αžšαžΆαž“αŸ‹αžŠαžΎαž˜αŸ’αž”αžΈαž”αž˜αŸ’αžšαžΎαž˜αŸ‰αžΌαžŠαŸ‚αž›αž€αŸ’αž“αž»αž„αž–αŸαž›αžœαŸαž›αžΆαž‡αžΆαž€αŸ‹αžŸαŸ’αžαŸ‚αž„αž€αŸαžŠαŸ„αž™ αž”αžŽαŸ’αžαž»αŸ‡αž”αžŽαŸ’αžαžΆαž›αž–αž½αž€αž‚αŸ αž“αž·αž„αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹αž˜αž»αžαž„αžΆαžšαž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž˜αŸ (αž”αŸ’αžšαž—αž–αŸ– mlflow // docs // αž˜αŸ‰αžΌαžŠαŸ‚αž› # αž€αŸ’αž“αž»αž„αžŸαŸ’αžšαž»αž€) αžŠαŸ„αž™αž”αŸ’αžšαžΎ Spark (αž”αžΆαž…αŸ‹ αž¬αžŸαŸ’αž‘αŸ’αžšαžΈαž˜) αž‚αžΊαž‡αžΆαžŠαŸ†αžŽαŸ„αŸ‡αžŸαŸ’αžšαžΆαž™αžŠαŸαž˜αžΆαž“αž₯αž‘αŸ’αž’αž·αž–αž›αž‡αžΆαž„αž˜αž»αž“ αžŠαŸ„αž™αžŸαžΆαžšαž€αžΆαžšαž…αŸ‚αž€αž…αžΆαž™αžšαž”αžŸαŸ‹αžœαžΆαŸ”

αžŸαŸ’αžšαž˜αŸƒαžαžΆαž’αŸ’αž“αž€αž‚αŸ’αžšαžΆαž“αŸ‹αžαŸ‚αž’αŸ’αžœαžΎαž€αžΆαžšαž αŸ’αžœαžΉαž€αž αŸ’αžœαžΊαž“αž€αŸ’αžšαŸ…αž”αžŽαŸ’αžαžΆαž‰ αž αžΎαž™αž”αž“αŸ’αž‘αžΆαž”αŸ‹αž˜αž€αž’αž“αž»αžœαžαŸ’αžαž‚αŸ†αžšαžΌαž›αž‘αŸ’αž’αž•αž›αž‘αŸ…αž“αžΉαž„αž‘αž·αž“αŸ’αž“αž“αŸαž™αžšαž”αžŸαŸ‹αž’αŸ’αž“αž€αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αŸ” αž“αŸαŸ‡αž‚αžΊαž‡αžΆαž€αž“αŸ’αž›αŸ‚αž„αžŠαŸ‚αž› Spark αž“αž·αž„ MLflow αž—αŸ’αž›αžΊαŸ”

αžŠαŸ†αž‘αžΎαž„ PySpark + Jupyter + Spark

αž”αŸ’αžšαž—αž–: αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜ PySpark - Jupyter

αžŠαžΎαž˜αŸ’αž”αžΈαž”αž„αŸ’αž αžΆαž‰αž–αžΈαžšαž”αŸ€αž”αžŠαŸ‚αž›αž™αžΎαž„αž’αž“αž»αžœαžαŸ’αžαž‚αŸ†αžšαžΌ MLflow αž‘αŸ… Spark dataframes αž™αžΎαž„αžαŸ’αžšαžΌαžœαžšαŸ€αž”αž…αŸ† Jupyter notebooks αžŠαžΎαž˜αŸ’αž”αžΈαž’αŸ’αžœαžΎαž€αžΆαžšαžšαž½αž˜αž‚αŸ’αž“αžΆαž‡αžΆαž˜αž½αž™ PySparkαŸ”

αž…αžΆαž”αŸ‹αž•αŸ’αžαžΎαž˜αžŠαŸ„αž™αžŠαŸ†αž‘αžΎαž„αž€αŸ†αžŽαŸ‚αžŸαŸ’αžαŸαžšαž—αžΆαž–αž…αž»αž„αž€αŸ’αžšαŸ„αž™αž”αŸ†αž•αž»αžαŸ” Apache Spark:

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/sparkΜ€

αžŠαŸ†αž‘αžΎαž„ PySpark αž“αž·αž„ Jupyter αž“αŸ…αž€αŸ’αž“αž»αž„αž”αžšαž·αž™αžΆαž€αžΆαžŸαž“αž·αž˜αŸ’αž˜αž·αžαŸ–

pip install pyspark jupyter

αžšαŸ€αž”αž…αŸ†αž’αžαŸαžšαž”αžšαž·αžŸαŸ’αžαžΆαž“αŸ–

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

αžŠαŸ„αž™αž”αžΆαž“αž€αŸ†αžŽαžαŸ‹ notebook-dirαž™αžΎαž„αž’αžΆαž…αžšαž€αŸ’αžŸαžΆαž‘αž»αž€αžŸαŸ€αžœαž—αŸ…αž€αžαŸ‹αžαŸ’αžšαžΆαžšαž”αžŸαŸ‹αž™αžΎαž„αž€αŸ’αž“αž»αž„αžαžαž―αž€αžŸαžΆαžšαžŠαŸ‚αž›αž…αž„αŸ‹αž”αžΆαž“αŸ”

αž”αžΎαž€αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš Jupyter αž–αžΈ PySpark

αžŠαŸ„αž™αžŸαžΆαžšαž™αžΎαž„αž’αžΆαž…αž€αŸ†αžŽαžαŸ‹αžšαž…αž“αžΆαžŸαž˜αŸ’αž–αŸαž“αŸ’αž’ Jupiter αž‡αžΆαž€αž˜αŸ’αž˜αžœαž·αž’αžΈαž”αž‰αŸ’αž‡αžΆ PySpark αž₯αž‘αžΌαžœαž“αŸαŸ‡αž™αžΎαž„αž’αžΆαž…αžŠαŸ†αžŽαžΎαžšαž€αžΆαžš Jupyter notebook αž“αŸ…αž€αŸ’αž“αž»αž„αž”αžšαž·αž”αž‘αž“αŸƒ PySpark αŸ”

(mlflow) afranzi:~$ pyspark
[I 19:05:01.572 NotebookApp] sparkmagic extension enabled!
[I 19:05:01.573 NotebookApp] Serving notebooks from local directory: /Users/afranzi/Projects/notebooks
[I 19:05:01.573 NotebookApp] The Jupyter Notebook is running at:
[I 19:05:01.573 NotebookApp] http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745
[I 19:05:01.573 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:05:01.574 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=c06252daa6a12cfdd33c1d2e96c8d3b19d90e9f6fc171745

αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow

αžŠαžΌαž…αžŠαŸ‚αž›αž”αžΆαž“αžšαŸ€αž”αžšαžΆαž”αŸ‹αžαžΆαž„αž›αžΎ MLflow αž•αŸ’αžαž›αŸ‹αž“αžΌαžœαž›αž€αŸ’αžαžŽαŸˆαž–αž·αžŸαŸαžŸαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž€αžαŸ‹αžαŸ’αžšαžΆαžœαžαŸ’αžαž»αž”αž»αžšαžΆαžŽαž‚αŸ†αžšαžΌαž“αŸ…αž€αŸ’αž“αž»αž„ S3 αŸ” αžŠαžšαžΆαž”αžŽαžΆαž™αžΎαž„αž˜αžΆαž“αž‚αŸ†αžšαžΌαžŠαŸ‚αž›αž”αžΆαž“αž‡αŸ’αžšαžΎαžŸαžšαžΎαžŸαž“αŸ…αž€αŸ’αž“αž»αž„αžŠαŸƒαžšαž”αžŸαŸ‹αž™αžΎαž„ αž™αžΎαž„αž˜αžΆαž“αž±αž€αžΆαžŸαž“αžΆαŸ†αž…αžΌαž›αžœαžΆαž‡αžΆ UDF αžŠαŸ„αž™αž”αŸ’αžšαžΎαž˜αŸ‰αžΌαžŒαž»αž› mlflow.pyfunc.

import mlflow.pyfunc

model_path = 's3://<bucket>/mlflow/artifacts/1/0f8691808e914d1087cf097a08730f17/artifacts/model'
wine_path = '/Users/afranzi/Projects/data/winequality-red.csv'
wine_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("header", "true").option('delimiter', ';').load(wine_path)
columns = [ "fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"
          ]
          
df.withColumn('prediction', wine_udf(*columns)).show(100, False)

αž–αž„αŸ’αžšαžΈαž€ Spark αž‡αžΆαž˜αž½αž™ MLflow
PySpark - αž”αž„αŸ’αž αžΆαž‰αž€αžΆαžšαž–αŸ’αž™αžΆαž€αžšαžŽαŸαž‚αž»αžŽαž—αžΆαž–αžŸαŸ’αžšαžΆ

αžšαž αžΌαžαž˜αž€αžŠαž›αŸ‹αž…αŸ†αžŽαž»αž…αž“αŸαŸ‡ αž™αžΎαž„αž”αžΆαž“αž“αž·αž™αžΆαž™αž’αŸ†αž–αžΈαžšαž”αŸ€αž”αž”αŸ’αžšαžΎ PySpark αž‡αžΆαž˜αž½αž™ MLflow αžŠαŸ„αž™αžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž€αžΆαžšαž–αŸ’αž™αžΆαž€αžšαžŽαŸαž‚αž»αžŽαž—αžΆαž–αžŸαŸ’αžšαžΆαž“αŸ…αž›αžΎαžŸαŸ†αžŽαž»αŸ†αž‘αž·αž“αŸ’αž“αž“αŸαž™αžŸαŸ’αžšαžΆαž‘αžΆαŸ†αž„αž˜αžΌαž›αŸ” αž”αŸ‰αž»αž“αŸ’αžαŸ‚αž…αž»αŸ‡αž™αŸ‰αžΆαž„αžŽαžΆαž”αžΎαž’αŸ’αž“αž€αžαŸ’αžšαžΌαžœαž€αžΆαžšαž”αŸ’αžšαžΎαž˜αŸ‰αžΌαžŒαž»αž› Python MLflow αž–αžΈ Scala Spark?

αž™αžΎαž„αž”αžΆαž“αžŸαžΆαž€αž›αŸ’αž”αž„αžœαžΆαž•αž„αžŠαŸ‚αžšαžŠαŸ„αž™αž”αŸ†αž”αŸ‚αž€αž”αžšαž·αž”αž‘ Spark αžšαžœαžΆαž„ Scala αž“αž·αž„ Python αŸ” αž“αŸ„αŸ‡αž‚αžΊαž™αžΎαž„αž”αžΆαž“αž…αž»αŸ‡αžˆαŸ’αž˜αŸ„αŸ‡ MLflow UDF αž“αŸ…αž€αŸ’αž“αž»αž„ Python αž αžΎαž™αž”αžΆαž“αž”αŸ’αžšαžΎαžœαžΆαž–αžΈ Scala (αž”αžΆαž‘ αž”αŸ’αžšαž αŸ‚αž›αž‡αžΆαž˜αž·αž“αž˜αŸ‚αž“αž‡αžΆαžŠαŸ†αžŽαŸ„αŸ‡αžŸαŸ’αžšαžΆαž™αžŠαŸαž›αŸ’αž’αž”αŸ†αž•αž»αž αž”αŸ‰αž»αž“αŸ’αžαŸ‚αž’αŸ’αžœαžΈαžŠαŸ‚αž›αž™αžΎαž„αž˜αžΆαž“)αŸ”

Scala Spark + MLflow

αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž§αž‘αžΆαž αžšαžŽαŸαž“αŸαŸ‡αž™αžΎαž„αž“αžΉαž„αž”αž“αŸ’αžαŸ‚αž˜ αžαžΊαžŽαŸ‚αž› Toree αž…αžΌαž›αž‘αŸ…αž€αŸ’αž“αž»αž„αž—αž–αž–αŸ’αžšαž αžŸαŸ’αž”αžαž·αŸαžŠαŸ‚αž›αž˜αžΆαž“αžŸαŸ’αžšαžΆαž”αŸ‹αŸ”

αžŠαŸ†αž‘αžΎαž„ Spark + Toree + Jupyter

pip install toree
jupyter toree install --spark_home=${SPARK_HOME} --sys-prefix
jupyter kernelspec list
```
```
Available kernels:
  apache_toree_scala    /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/apache_toree_scala
  python3               /Users/afranzi/.virtualenvs/mlflow/share/jupyter/kernels/python3
```

αžŠαžΌαž…αžŠαŸ‚αž›αž’αŸ’αž“αž€αž’αžΆαž…αž˜αžΎαž›αžƒαžΎαž‰αž–αžΈαžŸαŸ€αžœαž—αŸ…αž€αžαŸ‹αžαŸ’αžšαžΆαžŠαŸ‚αž›αž”αžΆαž“αž—αŸ’αž‡αžΆαž”αŸ‹ UDF αžαŸ’αžšαžΌαžœαž”αžΆαž“αž…αŸ‚αž€αžšαŸ†αž›αŸ‚αž€αžšαžœαžΆαž„ Spark αž“αž·αž„ PySpark αŸ” αž™αžΎαž„αžŸαž„αŸ’αžƒαžΉαž˜αžαžΆαž•αŸ’αž“αŸ‚αž€αž“αŸαŸ‡αž“αžΉαž„αž˜αžΆαž“αž”αŸ’αžšαž™αŸ„αž‡αž“αŸαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž’αŸ’αž“αž€αžŠαŸ‚αž›αžŸαŸ’αžšαž‘αžΆαž‰αŸ‹ Scala αž αžΎαž™αž…αž„αŸ‹αž”αŸ’αžšαžΎαž˜αŸ‰αžΌαžŠαŸ‚αž›αžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž“αŸ…αž€αŸ’αž“αž»αž„αž•αž›αž·αžαž€αž˜αŸ’αž˜αŸ”

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{Column, DataFrame}
import scala.util.matching.Regex

val FirstAtRe: Regex = "^_".r
val AliasRe: Regex = "[\s_.:@]+".r

def getFieldAlias(field_name: String): String = {
    FirstAtRe.replaceAllIn(AliasRe.replaceAllIn(field_name, "_"), "")
}

def selectFieldsNormalized(columns: List[String])(df: DataFrame): DataFrame = {
    val fieldsToSelect: List[Column] = columns.map(field =>
        col(field).as(getFieldAlias(field))
    )
    df.select(fieldsToSelect: _*)
}

def normalizeSchema(df: DataFrame): DataFrame = {
    val schema = df.columns.toList
    df.transform(selectFieldsNormalized(schema))
}

FirstAtRe = ^_
AliasRe = [s_.:@]+

getFieldAlias: (field_name: String)String
selectFieldsNormalized: (columns: List[String])(df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
normalizeSchema: (df: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
Out[1]:
[s_.:@]+
In [2]:
val winePath = "~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv"
val modelPath = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"

winePath = ~/Research/mlflow-workshop/examples/wine_quality/data/winequality-red.csv
modelPath = /tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
Out[2]:
/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model
In [3]:
val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("delimiter", ";")
              .load(winePath)
              .transform(normalizeSchema)

df = [fixed_acidity: string, volatile_acidity: string ... 10 more fields]
Out[3]:
[fixed_acidity: string, volatile_acidity: string ... 10 more fields]
In [4]:
%%PySpark
import mlflow
from mlflow import pyfunc

model_path = "/tmp/mlflow/artifactStore/0/96cba14c6e4b452e937eb5072467bf79/artifacts/model"
wine_quality_udf = mlflow.pyfunc.spark_udf(spark, model_path)

spark.udf.register("wineQuality", wine_quality_udf)
Out[4]:
<function spark_udf.<locals>.predict at 0x1116a98c8>
In [6]:
df.createOrReplaceTempView("wines")
In [10]:
%%SQL
SELECT 
    quality,
    wineQuality(
        fixed_acidity,
        volatile_acidity,
        citric_acid,
        residual_sugar,
        chlorides,
        free_sulfur_dioxide,
        total_sulfur_dioxide,
        density,
        pH,
        sulphates,
        alcohol
    ) AS prediction
FROM wines
LIMIT 10
Out[10]:
+-------+------------------+
|quality|        prediction|
+-------+------------------+
|      5| 5.576883967129615|
|      5|  5.50664776916154|
|      5| 5.525504822954496|
|      6| 5.504311247097457|
|      5| 5.576883967129615|
|      5|5.5556903912725755|
|      5| 5.467882654744997|
|      7| 5.710602976324739|
|      7| 5.657319539336507|
|      5| 5.345098606538708|
+-------+------------------+

In [17]:
spark.catalog.listFunctions.filter('name like "%wineQuality%").show(20, false)

+-----------+--------+-----------+---------+-----------+
|name       |database|description|className|isTemporary|
+-----------+--------+-----------+---------+-----------+
|wineQuality|null    |null       |null     |true       |
+-----------+--------+-----------+---------+-----------+

αž‡αŸ†αž αžΆαž“β€‹αž”αž“αŸ’αž‘αžΆαž”αŸ‹

αž‘αŸ„αŸ‡αž”αžΈαž‡αžΆ MLflow αžŸαŸ’αžαž·αžαž“αŸ…αž€αŸ’αž“αž»αž„αž€αŸ†αžŽαŸ‚αž’αžΆαž›αŸ‹αž αŸ’αžœαžΆαž“αŸ…αž–αŸαž›αžŸαžšαžŸαŸαžšαž€αŸαžŠαŸ„αž™ αžœαžΆαž˜αžΎαž›αž‘αŸ…αž–αž·αžαž‡αžΆαž‡αŸ„αž‚αž‡αŸαž™αžŽαžΆαžŸαŸ‹αŸ” αž‚αŸ’αžšαžΆαž“αŸ‹αžαŸ‚αžŸαž˜αžαŸ’αžαž—αžΆαž–αž€αŸ’αž“αž»αž„αž€αžΆαžšαžŠαŸ†αžŽαžΎαžšαž€αžΆαžšαž€αŸ’αžšαž”αžαŸαžŽαŸ’αžŒαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αž…αŸ’αžšαžΎαž“ αž αžΎαž™αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹αžœαžΆαž–αžΈαž…αŸ†αžŽαž»αž…αž”αž‰αŸ’αž…αž”αŸ‹αžαŸ‚αž˜αž½αž™αž“αžΆαŸ†αž”αŸ’αžšαž–αŸαž“αŸ’αž’αžŽαŸ‚αž“αžΆαŸ†αž‘αŸ…αž€αž˜αŸ’αžšαž·αžαž”αž“αŸ’αž‘αžΆαž”αŸ‹αŸ”

αž›αžΎαžŸαž–αžΈαž“αŸαŸ‡ MLflow αž“αžΆαŸ†αž˜αž€αž“αžΌαžœαžœαž·αžŸαŸ’αžœαž€αžšαž‘αž·αž“αŸ’αž“αž“αŸαž™ αž“αž·αž„αž’αŸ’αž“αž€αž‡αŸ†αž“αžΆαž‰αž•αŸ’αž“αŸ‚αž€αžœαž·αž‘αŸ’αž™αžΆαžŸαžΆαžŸαŸ’αžαŸ’αžšαž‘αž·αž“αŸ’αž“αž“αŸαž™αž±αŸ’αž™αž€αžΆαž“αŸ‹αžαŸ‚αž‡αž·αžαžŸαŸ’αž“αž·αž‘αŸ’αž’αž‡αžΆαž˜αž½αž™αž‚αŸ’αž“αžΆ αžŠαŸ„αž™αžŠαžΆαž€αŸ‹αžŸαŸ’αžšαž‘αžΆαž”αŸ‹αž‘αžΌαž‘αŸ…αžšαžœαžΆαž„αž–αž½αž€αž‚αŸαŸ”

αž”αž“αŸ’αž‘αžΆαž”αŸ‹αž–αžΈαž€αžΆαžšαžšαž»αž€αžšαž€ MLflow αž“αŸαŸ‡ αž™αžΎαž„αž˜αžΆαž“αž‘αŸ†αž“αž»αž€αž…αž·αžαŸ’αžαžαžΆαž™αžΎαž„αž“αžΉαž„αž†αŸ’αž–αŸ„αŸ‡αž‘αŸ…αž˜αž»αž αž αžΎαž™αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹αžœαžΆαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž”αŸ’αžšαž–αŸαž“αŸ’αž’ Spark pipelines αž“αž·αž„αž”αŸ’αžšαž–αŸαž“αŸ’αž’αžŽαŸ‚αž“αžΆαŸ†αžšαž”αžŸαŸ‹αž™αžΎαž„αŸ”

αžœαžΆαž‡αžΆαž€αžΆαžšαž›αŸ’αž’αž€αŸ’αž“αž»αž„αž€αžΆαžšαž’αŸ’αžœαžΎαžŸαž˜αž€αžΆαž›αž€αž˜αŸ’αž˜αž€αžΆαžšαž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαž‡αžΆαž˜αž½αž™αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αž‘αž·αž“αŸ’αž“αž“αŸαž™αž‡αŸ†αž“αž½αžŸαž±αŸ’αž™αž”αŸ’αžšαž–αŸαž“αŸ’αž’αž―αž€αžŸαžΆαžšαŸ” αž“αŸαŸ‡αž‚αž½αžšαžαŸ‚αž•αŸ’αžαž›αŸ‹αž±αŸ’αž™αž™αžΎαž„αž“αžΌαžœαž…αŸ†αžŽαž»αž…αž”αž‰αŸ’αž…αž”αŸ‹αž‡αžΆαž…αŸ’αžšαžΎαž“αžŠαŸ‚αž›αž’αžΆαž…αž”αŸ’αžšαžΎαž€αž“αŸ’αž›αŸ‚αž„αž•αŸ’αž‘αž»αž€αž―αž€αžŸαžΆαžšαžŠαžΌαž…αž‚αŸ’αž“αžΆαŸ” αž§αž‘αžΆαž αžšαžŽαŸ αž”αŸ’αžšαžΎαž§αž‘αžΆαž αžšαžŽαŸαž…αŸ’αžšαžΎαž“αŸ” Presto ΠΈ Athena αž‡αžΆαž˜αž½αž™ Glue metastore αžŠαžΌαž…αž‚αŸ’αž“αžΆαŸ”

αžŠαžΎαž˜αŸ’αž”αžΈαžŸαž„αŸ’αžαŸαž” αžαŸ’αž‰αž»αŸ†αž…αž„αŸ‹αž“αž·αž™αžΆαž™αž’αžšαž‚αž»αžŽαžŠαž›αŸ‹αžŸαž αž‚αž˜αž“αŸ MLFlow αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αžΆαžšαž’αŸ’αžœαžΎαž±αŸ’αž™αž€αžΆαžšαž„αžΆαžšαžšαž”αžŸαŸ‹αž™αžΎαž„αž‡αžΆαž˜αž½αž™αž“αžΉαž„αž‘αž·αž“αŸ’αž“αž“αŸαž™αž€αžΆαž“αŸ‹αžαŸ‚αž‚αž½αžšαž±αŸ’αž™αž…αžΆαž”αŸ‹αž’αžΆαžšαž˜αŸ’αž˜αžŽαŸαŸ”

αž”αŸ’αžšαžŸαž·αž“αž”αžΎαž’αŸ’αž“αž€αž€αŸ†αž–αž»αž„αž›αŸαž„αž‡αžΆαž˜αž½αž™ MLflow αžŸαžΌαž˜αž€αž»αŸ†αžŸαŸ’αž‘αžΆαž€αŸ‹αžŸαŸ’αž‘αžΎαžšαž€αŸ’αž“αž»αž„αž€αžΆαžšαžŸαžšαžŸαŸαžšαž˜αž€αž€αžΆαž“αŸ‹αž™αžΎαž„ αž αžΎαž™αž”αŸ’αžšαžΆαž”αŸ‹αž™αžΎαž„αž–αžΈαžšαž”αŸ€αž”αžŠαŸ‚αž›αž’αŸ’αž“αž€αž”αŸ’αžšαžΎαžœαžΆ αž αžΎαž™αžαŸ‚αž˜αž‘αžΆαŸ†αž„αžŠαžΌαž…αŸ’αž“αŸαŸ‡αž‘αŸ€αžαž”αŸ’αžšαžŸαž·αž“αž”αžΎαž’αŸ’αž“αž€αž”αŸ’αžšαžΎαžœαžΆαž“αŸ…αž€αŸ’αž“αž»αž„αž•αž›αž·αžαž€αž˜αŸ’αž˜αŸ”

αžŸαŸ’αžœαŸ‚αž„αž™αž›αŸ‹αž”αž“αŸ’αžαŸ‚αž˜αž’αŸ†αž–αžΈαžœαž‚αŸ’αž‚αžŸαž·αž€αŸ’αžŸαžΆαŸ–
αž€αžΆαžšαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αŸ” αžœαž‚αŸ’αž‚αžŸαž·αž€αŸ’αžŸαžΆαž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“
αž€αžΆαžšαžšαŸ€αž“αž˜αŸ‰αžΆαžŸαŸŠαžΈαž“αŸ” αžœαž‚αŸ’αž‚αžŸαž·αž€αŸ’αžŸαžΆαž€αž˜αŸ’αžšαž·αžαžαŸ’αž–αžŸαŸ‹

αž’αžΆαž“β€‹αž”αž“αŸ’αžαŸ‚αž˜:

αž”αŸ’αžšαž—αž–: www.habr.com

αž”αž“αŸ’αžαŸ‚αž˜αž˜αžαž·αž™αŸ„αž”αž›αŸ‹