MLOps—Cook book, chapter 1

MLOps—Cook book, chapter 1

Hi all! I am a CV developer at CROC. For 3 years we have been implementing projects in the field of CV. During this time, we didn’t do anything, for example: monitor drivers so that while driving they don’t drink, don’t smoke, don’t talk on the phone, look at the road, and not dreams or clouds; recorded those who like to drive in dedicated lanes and occupy several parking spaces; ensure that employees wear helmets, gloves, etc.; identified an employee who wants to enter the facility; counted everything that was possible.

What am I all this for?

In the process of implementing projects, we have bumped into bumps, a lot of bumps, you are either familiar with some of the problems, or will meet in the future.

Simulate the situation

Imagine that we got a job in a young company “N”, whose activities are related to ML. We work on an ML (DL, CV) project, then for some reason we switch to another job, in general we take a break, and return to our own or someone else's neuron.

  1. The moment of truth comes, you need to somehow remember where you left off, what hyperparameters you tried and, most importantly, what results they led to. There can be many options for who stored information on all launches: in the head, configs, notepad, in the working environment in the cloud. I happened to see an option when hyperparameters were stored as commented lines in the code, in general, a flight of fancy. Now imagine that you returned not to your project, but to the project of a person who left the company and inherited the code and model called model_1.pb. To complete the picture and convey all the pain, let's imagine that you are also a novice specialist.
  2. Go ahead. To run the code, we and everyone who will work with it need to create an environment. It often happens that it was not left to us as a legacy for some reason. This, too, can be a non-trivial task. You don't want to waste time on this step, do you?
  3. We train the model (for example, a car detector). We reach the moment when it becomes very even nothing - it's time to save the result. Let's call it car_detection_v1.pb. Then we train another one - car_detection_v2.pb. Some time later, our colleagues or ourselves teach more and more using different architectures. As a result, a bunch of artifacts is formed, information about which needs to be painstakingly collected (but we will do this later, because we still have more priority cases).
  4. OK it's all over Now! We have a model! Can we start training the next model, developing an architecture for a new problem, or can we go for tea? And who will deploy?

We identify problems

Working on a project or product is the work of many people. And over time, people leave and come, there are more projects, the projects themselves become more complex. One way or another, situations from the cycle described above (and not only) in various combinations will occur from iteration to iteration. All this results in a waste of time, confusion, nerves, possibly - in the dissatisfaction of the customer, and ultimately - in lost money. Although we all usually follow the old rake, I believe that no one wants to relive these moments over and over again.

MLOps—Cook book, chapter 1

So, we went through one development cycle and we see that there are problems that need to be solved. For this you need:

  • convenient to store the results of work;
  • make it easy to recruit new employees;
  • simplify the process of deploying the development environment;
  • set up the model versioning process;
  • have a convenient way to validate models;
  • find a model state management tool;
  • find a way to deliver models to production.

Apparently it is necessary to come up with a workflow that would make it easy and convenient to manage this life cycle? This practice is called MLOps

MLOps, or DevOps for Machine Learning, enables data scientists and IT teams to collaborate and accelerate model development and deployment through monitoring, validation, and governance for machine learning models.

Can ReadWhat do the guys at Google think about all this? It is clear from the article that MLOps is quite a voluminous thing.

MLOps—Cook book, chapter 1

Further in my article, I will describe only part of the process. For implementation, I will use the MLflow tool, because. this is an open-source project, you need a small amount of code to connect and there is integration with popular ml-frameworks. You can search the web for other tools like Kubeflow, SageMaker, Trains, etc. and maybe find the one that best suits your needs.

"Building" MLOps on the example of using the MLFlow tool

MLFlow is an open source platform for managing the life cycle of ml models (https://mlflow.org/).

MLflow includes four components:

  • MLflow Tracking - closes the issues of fixing the results and parameters that led to this result;
  • MLflow Project - allows you to package the code and play it on any platform;
  • MLflow Models - responsible for deploying models to production;
  • MLflow Registry - allows you to store models and manage their state in a centralized repository.

MLflow operates with two entities:

  • launch is a complete training cycle, the parameters and metrics for which we want to register;
  • An experiment is the “topic” that the launches are merged into.

All steps of the example are implemented on the Ubuntu 18.04 operating system.

1. Deploy the server

In order for us to easily manage our project and receive all the necessary information, let's deploy the server. MLflow tracking server has two main components:

  • backend store - responsible for storing information about registered models (supports 4 DBMS: mysql, mssql, sqlite, and postgresql);
  • artifact store - responsible for storing artifacts (supports 7 storage options: Amazon S3, Azure Blob Storage, Google Cloud Storage, FTP server, SFTP Server, NFS, HDFS).

As the artifact store for simplicity, let's take an sftp server.

  • create a group
    $ sudo groupadd sftpg
  • add a user and set a password
    $ sudo useradd -g sftpg mlflowsftp
    $ sudo passwd mlflowsftp 
  • adjusting a couple of access settings
    $ sudo mkdir -p /data/mlflowsftp/upload
    $ sudo chown -R root.sftpg /data/mlflowsftp
    $ sudo chown -R mlflowsftp.sftpg /data/mlflowsftp/upload
  • add some lines to /etc/ssh/sshd_config
    Match Group sftpg
     ChrootDirectory /data/%u
     ForceCommand internal-sftp
  • restart the service
    $ sudo systemctl restart sshd

As the backend store take postgresql.

$ sudo apt update
$ sudo apt-get install -y postgresql postgresql-contrib postgresql-server-dev-all
$ sudo apt install gcc
$ pip install psycopg2
$ sudo -u postgres -i
# Create new user: mlflow_user
[postgres@user_name~]$ createuser --interactive -P
Enter name of role to add: mlflow_user
Enter password for new role: mlflow
Enter it again: mlflow
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) n
# Create database mlflow_bd owned by mlflow_user
$ createdb -O mlflow_user mlflow_db

To start the server, you need to install the following python packages (I advise you to create a separate virtual environment):

pip install mlflow
pip install pysftp

Starting our server

$ mlflow server  
                 --backend-store-uri postgresql://mlflow_user:mlflow@localhost/mlflow_db 
                 --default-artifact-root sftp://mlflowsftp:mlflow@sftp_host/upload  
                --host server_host 
                --port server_port

2. Adding tracking

In order for the results of our training not to disappear, for future generations of developers to understand what was happening in general, and for older comrades and you to be able to calmly analyze the learning process, we need to add tracking. Tracking means saving parameters, metrics, artifacts and any additional information about the start of training, in our case, on the server.

For example, I created a small project on github on Keras by segmenting everything that is in COCO dataset. To add tracking, I created the mlflow_training.py file.

Here are the lines where the most interesting happens:

def run(self, epochs, lr, experiment_name):
        # getting the id of the experiment, creating an experiment in its absence
        remote_experiment_id = self.remote_server.get_experiment_id(name=experiment_name)
        # creating a "run" and getting its id
        remote_run_id = self.remote_server.get_run_id(remote_experiment_id)

        # indicate that we want to save the results on a remote server
        mlflow.set_tracking_uri(self.tracking_uri)
        mlflow.set_experiment(experiment_name)

        with mlflow.start_run(run_id=remote_run_id, nested=False):
            mlflow.keras.autolog()
            self.train_pipeline.train(lr=lr, epochs=epochs)

        try:
            self.log_tags_and_params(remote_run_id)
        except mlflow.exceptions.RestException as e:
            print(e)

Here self.remote_server is a little wrapper over the mlflow.tracking methods. MlflowClient (I made for convenience), with which I create an experiment and run on the server. Next, I specify where the launch results should merge (mlflow.set_tracking_uri(self.tracking_uri)). I enable automatic logging mlflow.keras.autolog(). Currently MLflow Tracking supports automatic logging for TensorFlow, Keras, Gluon XGBoost, LightGBM, Spark. If you have not found your framework or library, then you can always log explicitly. We start training. We register tags and input parameters on a remote server.

A couple of lines and you, like everyone else, have access to information about all launches. Cool?

3. We make out the project

Now let's make it as easy as possible to start the project. To do this, add the MLproject and conda.yaml file to the project root.
MLproject

name: flow_segmentation
conda_env: conda.yaml

entry_points:
  main:
    parameters:
        categories: {help: 'list of categories from coco dataset'}
        epochs: {type: int, help: 'number of epochs in training'}

        lr: {type: float, default: 0.001, help: 'learning rate'}
        batch_size: {type: int, default: 8}
        model_name: {type: str, default: 'Unet', help: 'Unet, PSPNet, Linknet, FPN'}
        backbone_name: {type: str, default: 'resnet18', help: 'exampe resnet18, resnet50, mobilenetv2 ...'}

        tracking_uri: {type: str, help: 'the server address'}
        experiment_name: {type: str, default: 'My_experiment', help: 'remote and local experiment name'}
    command: "python mlflow_training.py 
            --epochs={epochs}
            --categories={categories}
            --lr={lr}
            --tracking_uri={tracking_uri}
            --model_name={model_name}
            --backbone_name={backbone_name}
            --batch_size={batch_size}
            --experiment_name={experiment_name}"

MLflow Project has several properties:

  • Name - the name of your project;
  • Environment - in my case, conda_env indicates that Anaconda is used to start and the dependencies are described in the conda.yaml file;
  • Entry Points - indicates which files and with what parameters we can run (all parameters are automatically logged when training is started)

conda.yaml

name: flow_segmentation
channels:
  - defaults
  - anaconda
dependencies:
  - python==3.7
  - pip:
    - mlflow==1.8.0
    - pysftp==0.2.9
    - Cython==0.29.19
    - numpy==1.18.4
    - pycocotools==2.0.0
    - requests==2.23.0
    - matplotlib==3.2.1
    - segmentation-models==1.0.1
    - Keras==2.3.1
    - imgaug==0.4.0
    - tqdm==4.46.0
    - tensorflow-gpu==1.14.0

You can use docker as a runtime, for more information see documentation.

4. Start training

Clone the project and go to the project directory:

git clone https://github.com/simbakot/mlflow_example.git
cd mlflow_example/

To run you need to install libraries

pip install mlflow
pip install pysftp

Because in the example I'm using conda_env , Anaconda must be installed on your computer (but you can get around this by installing all the necessary packages yourself and playing around with the launch options).

All preparatory steps are completed and we can start to launch the training. From the root of the project:

$ mlflow run -P epochs=10 -P categories=cat,dog -P tracking_uri=http://server_host:server_port .

After entering the command, the conda environment will be automatically created and the training will start.
In the example above, I passed the number of epochs for training, the categories we want to segment into (the full list can be viewed here) and the address of our remote server.
A complete list of possible parameters can be found in the MLproject.

5. Evaluate learning outcomes

After completing the training, we can go in the browser to the address of our server http://server_host:server_port

MLOps—Cook book, chapter 1

Here we see a list of all experiments (top left), as well as information on launches (middle). We can see more detailed information (parameters, metrics, artifacts and some additional information) for each launch.

MLOps—Cook book, chapter 1

For each metric, we can observe the history of change

MLOps—Cook book, chapter 1

Those. at the moment we can analyze the results in a "manual" mode, you can also set up automatic validation using the MLflow API.

6. Registering the Model

After we have analyzed our model and decided that it is ready for battle, we proceed to register it. To do this, select the launch we need (as shown in the previous paragraph) and go down.

MLOps—Cook book, chapter 1

After we give our model a name, it has a version. If you save another model with the same name, the version will be automatically upgraded.

MLOps—Cook book, chapter 1

For each model, we can add a description and select one of the three states (Staging, Production, Archived), later we can access these states using api, which, along with versioning, gives additional flexibility.

MLOps—Cook book, chapter 1

We also have easy access to all models

MLOps—Cook book, chapter 1

and their versions

MLOps—Cook book, chapter 1

As in the previous paragraph, all operations can be done using the API.

7. Deploy model

At this stage, we already have a trained (keras) model. An example of how you can use it:

class SegmentationModel:
    def __init__(self, tracking_uri, model_name):

        self.registry = RemoteRegistry(tracking_uri=tracking_uri)
        self.model_name = model_name
        self.model = self.build_model(model_name)

    def get_latest_model(self, model_name):
        registered_models = self.registry.get_registered_model(model_name)
        last_model = self.registry.get_last_model(registered_models)
        local_path = self.registry.download_artifact(last_model.run_id, 'model', './')
        return local_path

    def build_model(self, model_name):
        local_path = self.get_latest_model(model_name)

        return mlflow.keras.load_model(local_path)

    def predict(self, image):
        image = self.preprocess(image)
        result = self.model.predict(image)
        return self.postprocess(result)

    def preprocess(self, image):
        image = cv2.resize(image, (256, 256))
        image = image / 255.
        image = np.expand_dims(image, 0)
        return image

    def postprocess(self, result):
        return result

Here self.registry is again a little wrapper over mlflow.tracking.MlflowClient, for convenience. The bottom line is that I am accessing a remote server and looking for a model there with the specified name, moreover, the latest production version. Next, I download the artifact locally to the ./model folder and build the model from this directory mlflow.keras.load_model(local_path). Now we can use our model. CV (ML) developers can easily improve the model and publish new versions.

In conclusion

I introduced a system that allows:

  • centrally store information about ML models, the course and results of training;
  • quickly deploy a development environment;
  • monitor and analyze the progress of work on models;
  • it is convenient to version and manage the state of models;
  • easy to deploy the resulting models.

This example is a toy and serves as a starting point for building your own system, which may include automating the evaluation of results and registration of models (p. 5 and p. 6, respectively) or you will add versioning of datasets, or maybe something else ? I was trying to convey the idea that you need MLOps in general, MLflow is just a means to an end.

Write what problems you encountered that I did not display?
What would you add to the system to meet your needs?
What tools and approaches do you use to close all or part of the problems?

PS I'll leave a couple of links:
github project - https://github.com/simbakot/mlflow_example
MLflow- https://mlflow.org/
My work email, for questions - [email protected]

We periodically hold various events for IT specialists in our company, for example: on July 8 at 19:00 Moscow time there will be a CV meetup in an online format, if you are interested, you can take part, register here .

Source: habr.com

Add a comment