The history of the creation of a cloud service, seasoned with cyberpunk

The history of the creation of a cloud service, seasoned with cyberpunk

With the growth of work experience in IT, you begin to notice that systems have their own character. They can be accommodating, silent, eccentric, harsh. They can be attractive or repulsive. One way or another, you have to “negotiate” with them, maneuver between the “pitfalls” and build chains of their interaction.

So we were honored to build a cloud platform, and for this it was necessary to “persuade” a couple of subsystems to work with us. Fortunately, we have an “API language”, direct hands and a lot of enthusiasm.

This article will not be technical hardcore, but will describe the problems that we encountered when building the cloud. I decided to describe our path in the form of a light technical fantasy about how we were looking for a common language with the systems and what came of it.

Welcome under cat.

Beginning of a journey

Some time ago, our team was tasked with launching a cloud platform for our clients. At our disposal was the support of the leadership, resources, hardware stack and freedom in choosing technologies for implementing the software part of the service.

There were also a number of requirements:

  • the service needs a convenient personal account;
  • the platform must be integrated into the existing billing system;
  • hardware and software: OpenStack + Tungsten Fabric (Open Contrail), which our engineers have learned to “cook” quite well.

About how the team gathered, developed the interface of the personal account and made design decisions, we will tell another time if the habra community is interested.
The tools we chose to use:

  • Python + Flask + Swagger + SQLAlchemy - quite a standard Python suite;
  • Vue.js for the frontend
  • The interaction between components and services was decided to be done using Celery over AMQP.

Anticipating questions about choosing towards Python, I will explain. The language has taken its place in our company, and a small but still culture has developed around it. Therefore, it was decided to start building a service on it. Moreover, the speed of development in such tasks often solves.

So, let's start our acquaintance.

Silent Bill - billing

We have known this guy for a long time. He always sat next to me and silently considered something. Sometimes he forwarded user requests to us, issued client invoices, managed services. An ordinary hard worker. True, there were difficulties. He is silent, sometimes thoughtful and often on his own mind.

The history of the creation of a cloud service, seasoned with cyberpunk

Billing is the first system we tried to make friends with. And the first difficulty we encountered was in the processing of services.

For example, when creating or deleting, the task gets into the internal billing queue. Thus, a system of asynchronous work with services is implemented. To process our types of services, we needed to “add” our tasks to this queue. And here we ran into a problem: the lack of documentation.

The history of the creation of a cloud service, seasoned with cyberpunk

Judging by the description of the software API, it is still possible to solve this problem, but we didn’t have time to do reverse engineering, so we took the logic out and organized a task queue on top of RabbitMQ. The operation on the service is initiated by the client from the personal account, turns into a Celery "task" on the backend and is performed on the billing and OpenStack side. Celery allows you to conveniently manage tasks, organize repetitions and monitor the status. You can read more about "celery", for example, here.

Also, billing did not stop a project that ran out of money. Communicating with the developers, we found out that when calculating according to statistics (and we need to implement just such logic), there is a complex relationship of stopping rules. But these models do not fit well with our realities. Also implemented through tasks on Celery, taking the logic of service management to the side of the backend.

Both of the above problems led to the fact that the code was a little bloated and in the future we will have to refactor to move the logic of working with tasks into a separate service. We also need to store some information about users and their services in our tables to support this logic.

Another problem is silence.

Billy silently replies “OK” to part of the requests to the API. So, for example, it was when we credited the promised payments for the duration of the test (more on that later). Requests were executed correctly and we did not see any errors.

The history of the creation of a cloud service, seasoned with cyberpunk

I had to study the logs while working with the system through the UI. It turned out that the billing itself performs such requests by changing the scope to a specific user, for example, admin, passing it in the su parameter.

In general, despite the gaps in the documentation and small flaws in the API, everything went pretty well. Logs can be read even under heavy load, if you understand how they are arranged and what to look for. The database structure is ornate, but quite logical and in some ways even attractive.

So, summing up, the main problems that we had at the stage of interaction are related to the features of the implementation of a particular system:

  • undocumented "features" that somehow affected us;
  • closed sources (billing is written in C ++), as a result - the inability to solve problem 1 in any way, except for "trial and error".

Fortunately, the product has a rather spreading API and we have integrated the following subsystems into our personal account:

  • technical support module — requests from the personal account are “proxied” to the billing system transparently for the service clients;
  • financial module - allows you to issue invoices to current customers, make write-offs and generate payment documents;
  • service control module - for it we had to implement our own handler. The scalability of the system played into our hands and we "taught" Billy a new type of service.
    It took a while, but anyway, I think Billy and I will get along.

Tungsten Field Walks – Tungsten Fabric

Tungsten fields studded with hundreds of wires running thousands of bits of information through them. Information is collected in “packages”, sorted out, building complex routes, as if by magic.

The history of the creation of a cloud service, seasoned with cyberpunk

This is the patrimony of the second system with which we had to make friends - Tungsten Fabric (TF), the former OpenContrail. Its job is to manage network hardware by providing a software abstraction to us as users. TF - SDN, encapsulates the complex logic of working with network equipment. There is a good article about the technology itself, for example, here.

The system is integrated with OpenStack (which will be discussed below) through the Neutron plugin.

The history of the creation of a cloud service, seasoned with cyberpunk
Interaction of OpenStack services.

We were introduced to this system by the guys from the Operations Department. We use the system's API to manage the network stack of our services. It hasn't caused us any serious problems or inconveniences yet (I can't speak for the guys from the OE), but there were also some oddities in the interaction.

The first one looked like this: commands that require the output of a large amount of data to the instance console when connected via SSH simply “hung up” the connection, while everything worked correctly via VNC.

The history of the creation of a cloud service, seasoned with cyberpunk

For those who are not familiar with the problem, it looks quite funny: ls /root works correctly, while, for example, top "hangs" tightly. Fortunately, we have already encountered similar problems. It was decided by tuning the MTU on the route from compute nodes to routers. By the way, this is not a TF problem.

The next problem was around the corner. At one "beautiful" moment, the magic of routing disappeared, just like that. TF stopped managing routing on the hardware.

The history of the creation of a cloud service, seasoned with cyberpunk

We worked with Openstack from the admin level and after that we moved to the level of the desired user. SDN seems to "hijack" the scope of the user that is performing the actions. The fact is that the same admin account is used to connect TF and OpenStack. At the step of switching under the user, the “magic” disappeared. It was decided to create a separate account to work with the system. This allowed us to work without breaking the integration functionality.

Silicone Lifeforms - OpenStack

A bizarre silicone creature lives near the tungsten fields. Most of all, it looks like an overgrown child who can crush us with one stroke, but does not come from obvious aggression. It does not cause fear, but its size inspires fear. As well as the complexity of what is happening around.

The history of the creation of a cloud service, seasoned with cyberpunk

OpenStack is the core of our platform.

OpenStack has several subsystems, of which we use Nova, Glance and Cinder the most actively so far. Each of them has its own API. Nova is responsible for compute resources and creating instances, Cinder is for managing volumes and their snapshots, Glance is an image service that manages OS templates and meta-information on them.

Each service runs in a container, and the message broker is the "white rabbit" - RabbitMQ.

This system gave us the most unexpected trouble.

And the first problem did not take long when we tried to connect an additional volume to the server. Cinder API flatly refused to perform this task. More precisely, according to OpenStack itself, the connection is established, but there is no disk device inside the virtual server.

The history of the creation of a cloud service, seasoned with cyberpunk

We decided to "take a detour" and requested the same action from the Nova API. The result is that the device connects correctly and is available inside the server. It looks like the problem occurs when block-storage doesn't match Cinder.

Another difficulty awaited us when working with disks. The system volume could not be detached from the server.

Again, OpenStack itself "swears" that it has destroyed the connection and now it is possible to work correctly with the volume separately. But the API categorically did not want to perform operations on the disk.

The history of the creation of a cloud service, seasoned with cyberpunk

Here we decided not to fight especially, but to change our view on the logic of the service. If there is an instance, there must be a system volume. Therefore, the user cannot yet remove or disable the system "disk" without deleting the "server".

OpenStack is a fairly complex set of systems with its own interaction logic and ornate API. We are rescued by rather detailed documentation and, of course, a trial and error method (where without it).

Test run

We carried out a test launch in December last year. The main task was to test our project in combat mode from the technical side and from the UX side. The audience was invited selectively and testing was closed. However, we have also retained the ability to request access to testing on our website.

The test itself, of course, was not without curious moments, because this is where our adventures are just beginning.

Firstly, we somewhat incorrectly assessed the interest in the project and had to quickly add compute nodes right during the test. The usual case for a cluster, however, there were nuances here too. The documentation for a specific version of TF indicates the specific version of the kernel on which work with vRouter was tested. We decided to run nodes with more recent kernels. As a result, TF did not receive routes from the nodes. I had to urgently roll back the cores.

The history of the creation of a cloud service, seasoned with cyberpunk

Another curiosity is related to the functionality of the "change password" button in your personal account.

We decided to use JWT to organize access to the personal account so as not to work with sessions. Since the systems are diverse and widely scattered, we manage our own token, in which we “wrap” sessions from billing and a token from OpenStack. When the password is changed, the token, of course, “goes bad”, because the user data is no longer valid and needs to be reissued.

The history of the creation of a cloud service, seasoned with cyberpunk

We lost sight of this moment, and there were simply not enough resources to quickly finish this piece. We had to cut out the functionality just before launching it into the test.
Currently, we are logging out the user if the password has been changed.

Despite these nuances, testing went well. In a couple of weeks, about 300 people came to visit us. We were able to look at the product through the eyes of users, test it in combat and collect high-quality feedback.

To be continued

For many of us, this is the first project of this magnitude. We learned a number of valuable lessons about how to work in a team, make architectural and design decisions. How to integrate complex systems with little resources and roll them out into production.

Of course, there is something to work on both in terms of code and at the junctions of system integration. The project is quite young, but we are full of ambitions to grow a reliable and convenient service out of it.

We have already managed to persuade the systems. Bill obediently handles counting, billing, and user requests in his closet. The "magic" of tungsten fields provides us with a stable connection. And only OpenStack is sometimes naughty, shouting something like "'WSREP has not yet prepared node for application use". But that's a completely different story...

We recently launched the service.
You can find all the details on our Online.

The history of the creation of a cloud service, seasoned with cyberpunk
CLO development team

Useful links

OpenStack

Tungsten Fabric

Source: habr.com

Add a comment