Industrial Machine Learning: 10 Design Principles

Industrial Machine Learning: 10 Design Principles

Nowadays, new services, applications and other important programs are being created every day that allow you to create incredible things: from software for controlling the SpaceX rocket to interacting with the kettle in the next room through your smartphone.

And, sometimes, every novice programmer, whether he is a passionate startup or an ordinary Full Stack or Data Scientist, sooner or later comes to the realization that there are certain rules for programming and creating software that greatly simplify life.

In this article, I will briefly describe 10 principles of how to program industrial machine learning so that it can be easily embedded into an application / service, based on the 12-factor App methodology, proposed by the Heroku team. My initiative is to raise awareness of this technique, which can help many developers and people from Data Science.

This article is a prologue to a series of articles about Industrial Machine Learning. In them, I will continue to talk about how to actually make a model and run it in production, create an API for it, as well as examples from various areas and companies that have ML built into their systems.

Principle 1. One codebase

Some programmers at the first stages, because of laziness to figure it out (or for some reason of their own), forget about Git. They either forget the word completely, that is, they throw files to each other in the drive / just throw text / send them as pigeons, or they don’t think through their workflow, and commit each to their own branch, and then to the master.

This principle says: have one codebase and many deployments.

Git can be used in both production and research and development (R&D), where it is used less frequently.

For example, in the R&D phase, you can leave commits with different data processing methods and models in order to then choose the best one and easily continue working with it further.

Secondly, in production this is an indispensable thing - you will need to constantly look at how your code changes and know which model produced the best results, which code worked at the end and what happened, because of which it stopped working or started to issue incorrect results. That's what commits are for!

And you can also create a package for your project, placing it, for example, on Gemfury, and then simply importing functions from it for other projects, so as not to rewrite them 1000 times, but more on that later.

Principle 2: Clearly declare and isolate dependencies

Each project has different libraries that you import from outside in order to apply them somewhere. Whether it is Python libraries, or libraries of other languages ​​for various purposes, or system tools, your task is to:

  • Clearly declare dependencies, that is, a file that will contain all the libraries, tools, and their versions that are used in your project and that must be installed (for example, in Python this can be done using Pipfile or requirements.txt. A link that allows understand well: realpython.com/pipenv-guide)
  • Isolate dependencies specifically for your program during development. You do not want to constantly change versions and reinstall, for example, Tensorflow?

Thus, developers who will join your team in the future will be able to quickly become familiar with the libraries and their versions that are used in your project, and you will also have the ability to manage versions and the libraries themselves installed for a particular project, which will help you avoid incompatibility of libraries or their versions.

Your application also doesn't need to rely on system tools that might be installed on a particular OS. These tools must also be declared in the dependencies manifest. This is necessary in order to avoid situations where the version of the tools (as well as their availability) does not match the system tools of a particular OS.

Thus, even if you can use curl on almost all computers, you should still declare it in dependencies, since when migrating to another platform it may not be there or the version will not be the one you originally needed.

For example, your requirements.txt might look like this:

# Model Building Requirements
numpy>=1.18.1,<1.19.0
pandas>=0.25.3,<0.26.0
scikit-learn>=0.22.1,<0.23.0
joblib>=0.14.1,<0.15.0

# testing requirements
pytest>=5.3.2,<6.0.0

# packaging
setuptools>=41.4.0,<42.0.0
wheel>=0.33.6,<0.34.0

# fetching datasets
kaggle>=1.5.6,<1.6.0

Principle 3: Configurations

Many have heard stories of various developer guys accidentally uploading code on GitHub to open repositories with passwords and other keys from AWS, waking up the next day with $6000 in debt, or even with all $50000.

Industrial Machine Learning: 10 Design Principles

Of course, these cases are extreme, but very revealing. If you store your credentials or other data needed for configuration inside the code, you are making a mistake, and I think it is not worth explaining why.

An alternative to this is to store configurations in environment variables. You can read more about environment variables. here.

Examples of data that is usually stored in environment variables:

  • Domain names
  • API URLs/URIs
  • Public and private keys
  • Contacts (mail, phones, etc.)

This way you don't have to constantly change the code if your configuration variables change. This will help save you time, effort and money.

For example, if you are using the Kaggle API to perform tests (for example, downloading a thaw and running a model through it to test at run time that the model works well), then private keys from Kaggle, such as KAGGLE_USERNAME and KAGGLE_KEY, must be stored in environment variables.

Principle 4: Third Party Services

The idea here is to create the program in such a way that there is no difference between local and third-party resources in terms of code. For example, you can connect both local MySQL and third-party MySQL. The same goes for various APIs such as Google Maps or the Twitter API.

In order to disable a third-party service or connect another, you just need to change the keys in the configuration in the environment variables, which I talked about in the paragraph above.

So, for example, instead of specifying the path to files with datasets inside the code each time, it is better to use the pathlib library and declare the path to the datasets in config.py, so that no matter what service you use (for example, CircleCI), the program was able to find out the path to the datasets, taking into account the structure of the new file system in the new service.

Principle 5. Build, release, runtime

It is useful for many people from Data Science to download skills in writing software. If we want our program to crash as little as possible and work without failures for as long as possible, we need to divide the process of releasing a new version into 3 stages:

  1. Stage assembly. You will convert your bare code with separate resources into a so-called package that contains all the necessary code and data. This package is called an assembly.
  2. Stage release - here we connect our config to the assembly, without which we would not be able to release our program. Now this is a completely ready-to-launch release.
  3. Next comes the stage execution. Here we release the application by running the necessary processes from our release.

Such a system for releasing new versions of a model or the entire pipeline allows you to separate roles between administrators and developers, allows you to track versions, and prevents unwanted shutdowns of the program.

For the release task, many different services have been created in which you can write processes to run themselves in a .yml file (for example, in CircleCI this is config.yml to provide the process itself). Wheely is great at creating packages for projects.

You will be able to create packages with different versions of your machine learning model, and then package them and refer to the required packages and their versions in order to use the functions that you wrote from there. This will help you create an API for your model, and your package can be hosted on Gemfury, for example.

Principle 6. Run your model as one or more processes

Moreover, processes should not have shared data. That is, processes must exist separately, and all kinds of data must exist separately, for example, on third-party services like MySQL or others, depending on what you need.

That is, it is definitely not worth storing data inside the file system of the process, otherwise it may lead to clearing of this data during the next release / configuration change or transfer of the system on which the program is running.

But there is an exception: for machine learning projects, you can store a cache of libraries so as not to reinstall them every time a new version is launched, if there were no additional libraries or any changes in their versions. In this way, you will reduce the time to run your model in the industry.

To run the model as several processes, you can create a .yml file in which you specify the necessary processes and their sequence.

Principle 7: Recyclability

The processes that run in your app with the model should be easy to start and stop. Thus, it will allow you to quickly deploy code changes, configuration changes, scale quickly and flexibly, and prevent possible breakages of the production version.

That is, your process with the model should:

  • Minimize startup time. Ideally, the startup time (from the moment the start command was given to the moment the process comes to life) should be no more than a few seconds. The library caching described above is one of the techniques to reduce startup time.
  • End correctly. That is, listening on the service port is actually suspended, and new requests submitted to this port will not be processed. Here you already need to either set up a good connection with DevOps engineers, or understand how it works yourself (preferably, of course, the second, but you should always keep in touch, in any project!)

Principle 8: Continuous Deployment/Integration

Many companies use a separation between application development and deployment (making the application available to end users) teams. This can greatly slow down the development of software and progress in its improvement. It also spoils the DevOps culture, where development and integration are, roughly speaking, combined.

Therefore, this principle says that your development environment should be as close as possible to your production environment.

This will allow:

  1. Reduce release time tenfold
  2. Reduce the number of errors due to code incompatibility.
  3. It also reduces the burden on staff, as developers and people deploying the application are now one team.

Tools that allow you to work with this are CircleCI, Travis CI, GitLab CI and others.

You can quickly make additions to the model, update it, and immediately run, while it will be easy, in case of failures, to return very quickly to the working version, without the end user even noticing. This can be done especially easily and quickly if you have good tests.

Minimize the differences!!!

Principle 9. Your logs

Logs (or β€œLogs”) are events recorded, usually in text format, that occur inside the application (event stream). A simple example: "2020-02-02 - system level - process name". They are designed so that the developer can literally see what is happening when the program is running. He sees the course of processes, and understands whether it is as the developer himself intended.

This principle says that you should not store your logs inside your file system - you just need to β€œoutput them to the screen”, for example, do it on the standard output of the stdout system. And the flow can thus be monitored in the terminal during development.

Does this mean that you do not need to save the logs at all? Of course not. It's just that your application should not be doing this - leave it to third-party services. Your application can only redirect the logs to a specific file or terminal for live viewing, or redirect it to a general purpose data storage system (eg Hadoop). Your application itself should not store or interact with logs.

Principle 10. Test!

For industrial machine learning, this phase is extremely important, as you need to understand that the model works correctly and produces what you wanted.

Tests can be created with pytest and tested with a small dataset if you have a regression/classification task.

Do not forget to set the same seed for deep learning models so that they do not constantly produce different results.

This was a brief description of 10 principles, and, of course, it is difficult to use them without trying and seeing how they work, so this article is just a prologue to a series of interesting articles in which I will reveal how to create industrial machine learning models how to integrate them into systems, and how these principles can make life easier for all of us.

I will also try to use cool principles that someone can leave in the comments if they wish.

Source: habr.com

Add a comment