The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

The future is here, AI and machine learning technologies are already successfully used by your favorite shops, transport companies and even turkey farms.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

And if something exists, then there is already an open project on the Internet about it! See how the Open Data Hub helps you scale new technologies and avoid the hassle of implementing them.

With all the advantages of artificial intelligence (artificial intelligence, AI) and machine learning (machine learning, ML), organizations often have difficulty scaling these technologies. The main problems are usually the following:

  • Information exchange and cooperation Effortlessly sharing information and collaborating in rapid iteration is next to impossible.
  • Data access – for each task it needs to be built anew and manually, which takes a lot of time.
  • Access on demand – there is no way to get on-demand access to machine learning tools and platform, as well as to the computing infrastructure.
  • Production - models remain at the prototype stage and are not brought to commercial operation.
  • Tracking and explaining the results of AI work – Reproducibility, tracking and explanation of AI/ML results are difficult.

Left unresolved, these problems affect the speed, efficiency, and productivity of valuable data scientists. This leads to their frustration, disappointment in their work, and as a result, the expectations of the business regarding AI / ML go down the drain.

Responsibility for solving these problems lies with IT professionals, who must provide data analysts with - that's right, something like a cloud. If more deployed, then we need a platform that gives freedom of choice and has convenient, easy access. At the same time, it is fast, easily reconfigured, scalable on demand, and fault tolerant. Building such an open-source platform helps you avoid becoming vendor dependent and maintains a long-term strategic advantage in terms of cost control.

A few years ago, something similar happened in application development and led to the emergence of microservices, hybrid clouds, IT automation, and agile processes. To cope with all this, IT professionals began to use containers, Kubernetes and open hybrid clouds.

Now this experience is applied to answer the challenges of Al. Therefore, IT professionals create platforms that are based on containers, allow you to create AI / ML services within agile processes, accelerate innovation, and are built with an eye on the hybrid cloud.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

We'll start building such a platform with Red Hat OpenShift, our hybrid cloud containerized Kubernetes platform that has a rapidly growing ecosystem of ML software and hardware solutions (NVIDIA, H2O.ai, Starburst, PerceptiLabs, etc.). Some of Red Hat's customers, such as BMW Group, ExxonMobil, and others, have already deployed containerized ML toolchains and DevOps processes on the platform and its ecosystem to bring their ML architectures into production and speed up data analysts.

Another reason why we launched the Open Data Hub project is to demonstrate an example architecture based on several open source projects and show how to implement the entire life cycle of an ML solution based on the OpenShift platform.

Open Data Hub project

This is an open source project that is developed within the relevant development community and implements a full cycle of operations - from loading and transforming initial data to generating, training and maintaining a model - when solving AI / ML problems using containers and Kubernetes on the OpenShift platform. This project can be considered as a reference implementation, an example of how to build an open source AI/ML as a Service solution based on OpenShift and related open source tools such as Tensorflow, JupyterHub, Spark and others. It is important to note that Red Hat itself uses this project to provide its AI/ML services. In addition, OpenShift integrates with key software and hardware ML solutions from NVIDIA, Seldon, Starbust and other vendors, making it easy to build and run your own machine learning systems.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

The Open Data Hub project targets the following user categories and use cases:

  • A data analyst who needs a self-service cloud-based solution for ML projects.
  • The data analyst looking for the ultimate in choice from the latest open source AI/ML tools and platforms.
  • A data analyst who needs access to data sources when training models.
  • Data analyst who needs access to computing resources (CPU, GPU, memory).
  • Data analyst who needs the ability to collaborate and share work with peers, get feedback, and make improvements in rapid iteration.
  • A data analyst who wants to interact with developers (and devops teams) to get their ML models and work done into production.
  • A data engineer who needs to provide a data analyst with access to a variety of data sources while complying with regulations and security requirements.
  • IT system administrator/operator who needs the ability to effortlessly control the life cycle (installation, configuration, update) of open source components and technologies. We also need appropriate management and quota tools.

The Open Data Hub project combines a range of open source tools to implement the full cycle of AI / ML operations. Jupyter Notebook is used here as the main working tool for data analytics. This toolkit is very popular among data scientists today, and the Open Data Hub allows them to easily create and manage Jupyter Notebook workspaces using the built-in JupyterHub. In addition to creating and importing Jupyter notebooks, the Open Data Hub project also contains a number of ready-made notebooks in the form of an AI Library.

This library is a collection of open-source machine learning components and common scenario solutions that make rapid prototyping easier. JupyterHub is integrated with the OpenShift RBAC access model, which allows you to use existing OpenShift accounts and implement single sign-on. In addition, JupyterHub offers a convenient user interface called spawner, with which the user can easily adjust the amount of computing resources (processor cores, memory, GPU) for the selected Jupyter Notebook.

After the data analyst creates and configures the laptop, the Kubernetes scheduler, which is part of OpenShift, takes care of everything else. Users can only perform their experiments, save and share the results of their work. In addition, advanced users can directly access the OpenShift CLI shell directly from Jupyter notebooks to enable Kubernetes primitives like Job or OpenShift functionality like Tekton or Knative. Or you can use the handy OpenShift GUI for this, which is called the "OpenShift Web Console".

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

Moving on to the next stage, Open Data Hub gives you the ability to manage data pipelines (data pipelines). For this, a Ceph object is used, which is provided as an S3-compatible object data store. Apache Spark provides data streaming from external sources or Ceph S3 built-in storage, and also allows you to perform preliminary data transformations. Apache Kafka provides advanced data pipeline management (where you can perform multiple loads, as well as transform, parse, and store data).

So, the data analyst got access to the data and built a model. Now he has a desire to share the results with colleagues or application developers, and provide them with his model on the principles of a service. This requires an inference server, and the Open Data Hub has one called Seldon, which allows you to publish the model as a RESTful service.

At some point, there are several such models on the Seldon server, and there is a need to monitor how they are used. To do this, the Open Data Hub offers a collection of relevant metrics and a reporting engine based on the widely used open source monitoring tools Prometheus and Grafana. As a result, we receive feedback to monitor the use of AI models, in particular in the production environment.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

Thus, the Open Data Hub provides a cloud-like approach throughout the entire lifecycle of AI / ML operations, from data access and preparation to training and production model exploitation.

Putting it all together

Now the question arises how to organize all this for the OpenShift administrator. And here a special Kubernetes operator for Open Data Hub projects comes into play.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

This operator manages the installation, configuration, and life cycle of the Open Data Hub project, including deployment of the aforementioned tools such as JupyterHub, Ceph, Spark, Kafka, Seldon, Prometheus, and Grafana. The Open Data Hub project can be found in the OpenShift web console, in the community-operators section. Thus, the OpenShift administrator can specify that the corresponding OpenShift projects are categorized as an "Open Data Hub project". This is done once. After that, the data analyst enters his project space through the OpenShift web console and sees that the corresponding Kubernetes operator is installed and available for his projects. It then instantiates the Open Data Hub project with a single click and immediately has access to the above tools. And all this can be configured in high availability and fault tolerance mode.

The Open Data Hub project is an open machine learning platform powered by Red Hat OpenShift

If you want to try the Open Data Hub project yourself, start with installation instructions and introductory tutorial. Technical details of the Open Data Hub architecture can be found here, project development plans - here. In the future, it is planned to implement additional integration with Kubeflow, solve a number of issues with data regulation and security, as well as organize integration with systems based on the rules of Drools and Optaplanner. Express your opinion and become a participant of the project open data hub can be on the page communities.

To recap, serious scaling challenges prevent organizations from realizing the full potential of artificial intelligence and machine learning. Red Hat OpenShift has long been successfully used to solve similar problems in the software industry. The Open Data Hub project, implemented as part of the open source development community, offers a reference architecture for organizing a full cycle of AI / ML operations based on the OpenShift hybrid cloud. We have a clear and well-thought-out plan for the development of this project, and we are serious about creating an active and fruitful community for the development of open AI solutions on the OpenShift platform around it.

Source: habr.com

Add a comment