Orphan Services: The Other Side of a (Micro)Service Architecture

Andrey Nikolsky, Operations Director of the Banki.ru portal, spoke at last year's conference DevOpsDays Moscow about orphan services: how to identify an orphan in the infrastructure, why orphan services are bad, what to do with them, and what to do if nothing helps.

Under the cut is the text version of the report.


Hello colleagues! My name is Andrey, I manage operations at Banki.ru.

We have large services, these are such monolithic services, there are services in a more classical sense, there are very small ones. In my worker-peasant terminology, I say that if the service is simple and small, then it is micro, and if it is not very simple and not small, then it is just a service.

Services pros

I will quickly go over the pros of services.

Orphan Services: The Other Side of a (Micro)Service Architecture

The first is scaling. You can quickly make something on the service and start production. You have received traffic, you have cloned the service. More traffic has arrived, you have cloned and live with it. This is a good bonus, and, in principle, when we started, it was considered the most important for us, why we do all this at all.

Orphan Services: The Other Side of a (Micro)Service Architecture

Secondly, isolated development, when you have several development teams, several different developers in each team, and each team is working on some kind of service.

With commands there is a nuance. Developers are different. And there are, for example, snowflake people. I first saw this at Maxim Dorofeev. Sometimes snowflake people are on some teams and some are not. This makes the different services that are used in the company a little uneven.

Orphan Services: The Other Side of a (Micro)Service Architecture

Look at the picture: this is a good developer, he has big hands, he can do a lot. The main problem is where these hands grow from.

Orphan Services: The Other Side of a (Micro)Service Architecture

Services make it possible to use different programming languages ​​that are more suitable for different tasks. Some service in Go, some in Erlang, some in Ruby, some in PHP, some in Python. In general, you can turn around very widely. Here, too, there are nuances.

Orphan Services: The Other Side of a (Micro)Service Architecture

Service-oriented architecture is primarily about devops. That is, if you do not have automation, there is no deployment process, if you set it up manually, your configurations can change from service instance to instance, and you have to go there to do something, then you are in hell.

For example, you have 20 services, and you need to deploy by hand, you have 20 consoles, and you press enter at the same time, like a ninja. It's not very good.

If you have a service after testing (if there is testing, of course), and you still need to finish it with a file so that it works in production, I also have bad news for you.

If you rely on specific Amazon services and work in Russia, then two months ago you also had "Everything is on fire, I'm fine, everything is cool."

Orphan Services: The Other Side of a (Micro)Service Architecture

We use Ansible for deployment automation, Puppet for convergence, Bamboo for deployment automation, Confluence to describe it all somehow.

I will not dwell on this in detail, because the report is more about interaction practices, and not about technical implementation.

Orphan Services: The Other Side of a (Micro)Service Architecture

For example, we have had problems that Puppet on the server works with Ruby 2, and some application is written under Ruby 1.8, and they do not work together. There's some sort of bug going on. And when you need to keep multiple versions of Ruby on the same machine, you usually run into problems.

For example, we give each developer a platform that has about everything we have, all the services that can be developed so that he has an isolated environment, he can break it and build it as he wants.

Sometimes, you need some specially compiled package with support for something there. It's tough enough. I've listened to a report where the docker image weighs 45 GB. In Linux, of course, it’s simpler, everything is smaller there, but still, there won’t be enough places.

Well, there are conflicting dependencies, when you have one piece of the project depends on the library of one version, another piece of the project depends on another version, and the libraries are not put together at all.

Orphan Services: The Other Side of a (Micro)Service Architecture

We have sites and services on PHP 5.6, we are ashamed of them, but what can we do. This is our one platform. There are sites and services on PHP 7, there are more of them, we are not ashamed of them. And each developer has his own base, where he happily saws.

If you write in a company in one language, then three virtual machines per developer sounds normal. If you have different programming languages, then the situation gets worse.

Orphan Services: The Other Side of a (Micro)Service Architecture

You have sites and services on this, on this, then another site for Go, one site for Ruby, some more Redis on the side. As a result, all this turns into a large field for support, and all the time something from this can break.

Orphan Services: The Other Side of a (Micro)Service Architecture

Therefore, we have replaced the goodies of the programming language with the use of different frameworks, since PHP frameworks are quite different, they have different capabilities, different communities, different support. And you can write a service so that you already have something ready for it.

Each service has its own team

Orphan Services: The Other Side of a (Micro)Service Architecture

Our main advantage, which has crystallized over several years, is that each service has its own team. This is convenient for a large project, you can save time on documentation, managers know their project well.

Tasks from support can be perfectly thrown. For example, the insurance service broke down. And immediately the team that deals with insurance goes to repair it.

New features are quickly made, because when you have some kind of atomic service, you can quickly screw something into it.

And when you break your service, and this inevitably happens, you didn’t hurt other people’s services, and developers with bits from other teams don’t come running to you and say: “Ay-ay, don’t do this.”

Orphan Services: The Other Side of a (Micro)Service Architecture

As always, there are nuances. We have stable teams, managers are nailed to the team. There are clear documents, managers closely monitor all this. Each team with a manager has several services, and there is a specific point of competence.

If the commands are floating (this is also sometimes used by us), there is a good method called the "star map".

Orphan Services: The Other Side of a (Micro)Service Architecture

You have a list of services and people. An asterisk means that a person is an expert in this service, a book means that a person is studying this service. The task of a person is to change the book for an asterisk. And if nothing is written opposite the service, then problems begin, which I will continue to talk about.

How do orphan services appear?

Orphan Services: The Other Side of a (Micro)Service Architecture

The first problem, the first way to get an orphan service in your infrastructure is the dismissal of people. Has anyone experienced when deadlines arrive from a business before tasks have been estimated? Sometimes it happens that deadlines are tight and there is simply not enough time for documentation. “We need to hand over the service to production, then we will add it.”

If the team is small, it happens that it has one developer who writes everything, the rest are in the wings. "I wrote the main architecture, you give me the interfaces." Then at some point the manager, for example, leaves. And during this period, when the manager left, and a new one has not yet been appointed, the developers themselves decide where the service is moving, what is happening there. And as we know (let's go back a few slides), some teams have snowflake people, sometimes a team lead snowflake. Then he quits, and we get a service orphan.

Orphan Services: The Other Side of a (Micro)Service Architecture

At the same time, tasks from support and business do not go away, they settle in the backlog. If there were any architectural errors during the development of the service, they also settle in the backlog. The service is slowly degrading.

How to identify an orphan?

This list describes the situation well. Has anyone learned anything about infrastructure?

Orphan Services: The Other Side of a (Micro)Service Architecture

About documented work-arounds: there is a service and, in general, it works, it has a two-page manual on how to work with it, but no one knows how it works inside.

Or, for example, there is some kind of link shortener. For example, we currently have three link shorteners in use for different purposes in different services. These are just the consequences.

Orphan Services: The Other Side of a (Micro)Service Architecture

Now I'll be the captain of the obvious. What needs to be done? First, you need to transfer the service to another manager, another team. If your team leader has not yet quit, then when you understand that the service is like an orphan, you need to include someone who understands at least something about it in this other team.

The main thing: you must have transfer procedures written in blood. In our case, I usually follow this, because I need it to work. Managers need it to be handed over quickly, and what will happen to it later is not so important to them.

Orphan Services: The Other Side of a (Micro)Service Architecture

The next way to make an orphan is “Let’s do it outsourced, it will be faster, and then we will transfer it to the team.” It is clear that everyone has some plans in the team, turn. Often a business customer thinks that the outsourcer will do the same as the technical department that the company has. Although their motivations are different. There are strange technological solutions and strange algorithmic solutions in outsourcing.

Orphan Services: The Other Side of a (Micro)Service Architecture

For example, we had a service that had Sphinx in various unexpected places. I'll tell you later what I had to do.

Outsourcers have self-written frameworks. It's just plain PHP copy-pasted from a previous project where you can find everything. Big crutches in deployment scripts, when you need to change several lines in some file with some complex Bash scripts, while these deployment scripts are called by some third script. As a result, you change the deployment system, choose something else, hop, but the service does not work for you. Because there it was necessary to put 8 more links between different daddies. Or it happens that a thousand records work, but a hundred thousand are no longer there.

I will continue to captain. Acceptance of service from outsourcing is a mandatory procedure. Who has ever experienced that an outsourced service arrives, but it is not accepted anywhere? It is not, of course, as popular as an orphan service, but still.

Orphan Services: The Other Side of a (Micro)Service Architecture

The service needs to be checked, the service needs to be reviewed, passwords need to be changed. We had a case when a service was thrown to us, there is an admin panel “if login == 'admin' && password == 'admin'…”, it is written right in the code. We sit and think, and people write this in 2018?

Testing the amount of storage is also a necessary thing. You need to look at what will happen on a hundred thousand records, even before you put this service somewhere into production.

Orphan Services: The Other Side of a (Micro)Service Architecture

Sending a service for revision should not be a shame. When you say: “We will not accept this service, we have 20 tasks, do them, then we will accept it,” this is normal. Conscience should not hurt that you set up a manager or that a business will spend money. The business will spend more.

We had a case when we decided to outsource a pilot project.

Orphan Services: The Other Side of a (Micro)Service Architecture

It was handed over on time, and this was the only quality criterion. Therefore, we made another pilot project, not even quite a pilot one. They accepted these services, they said in administrative ways, here is your code, here is the team, here is your manager. The services have already begun to make a profit. At the same time, in fact, they are still orphans, no one understands how they work, and managers in every possible way deny their tasks.

Orphan Services: The Other Side of a (Micro)Service Architecture

There is another great concept - guerrilla development. When some department, as a rule, is a marketing department, they want to test a hypothesis and order a service entirely outsourced. Traffic starts pouring into it, they close documents, sign acts with a contractor, come into operation and say: “Dudes, we have a service here, it already has traffic, it brings us money, let's accept it.” We're like, "Oppa, how come."

Orphan Services: The Other Side of a (Micro)Service Architecture

And one more way to get an orphan service: when some team suddenly turns out to be loaded, the management says: “Let's transfer the service of this team to another team, it has a smaller load.” And then we will transfer it to the third team, and we will change the manager. And in the end, we again have an orphan.

What is the problem with orphans?

Orphan Services: The Other Side of a (Micro)Service Architecture

Who does not know, this is the battleship Wasa raised in Sweden, famous for the fact that she drowned 5 minutes after launching. And the king of Sweden, by the way, did not execute anyone for this. It was built by two generations of engineers who did not know how to build such ships. Regular effect.

The ship could sink, by the way, much worse, for example, when the king would already be riding on it somewhere in a storm. And so, he drowned right away, according to agile it’s good to fail early.

If we fail early, there is usually no problem. For example, at the time of acceptance, they were sent for revision. And if we failed already in production, when the money was invested, then there may be problems. Consequences, as they are called in business.

Why orphan services are dangerous:

  • The service may break suddenly.
  • Service is being repaired for a long time or not repaired at all.
  • Safety problems.
  • Problems with improvements and updates.
  • If an important service breaks down, the company's reputation suffers.

What to do with orphan services?

Orphan Services: The Other Side of a (Micro)Service Architecture

Once again I repeat what to do. First, there must be documentation. 7 years in Banki.ru taught me that testers should not take the word of the developers, and the operation should not take the word of everyone. We must check.

Orphan Services: The Other Side of a (Micro)Service Architecture

Secondly, you need to write interaction diagrams, because it happens that services that are not well received contain dependencies that no one has said about. For example, the developers put the service on their key to some Yandex.Maps or to Dadata. You ran out of free limit, everything is broken and you don't know what happened at all. All such rakes should be described: the service uses Dadata, Sms, something else.

Orphan Services: The Other Side of a (Micro)Service Architecture

Thirdly, work with technical debt. When you make some kind of crutches or accept a service and say that something needs to be done, you need to make sure that they do it. Because then it may turn out that the little hole is not so small, and you will fall into it.

With architectural tasks, we had a story about Sphinx. In one of the services, Sphinx was used to enter lists. Just a list with pagination, but it was reindexed every night. It was assembled from two indexes: one was indexed every night by a large one, and there was also a small index that was attached to it. Every day, with a probability of 50%, either bombing or not, the index struggled during the calculation, and the news stopped updating on our main page. At first it was 5 minutes while the index was reindexing, then the index grew, and at some point it began to reindex for 40 minutes. When we saw it out, we breathed a sigh of relief, because it was clear that a little more time would pass, and our index would be reindexed full time. This will be a fail for our portal, eight hours no news - that's it, the business has risen.

Orphan Service Plan

Orphan Services: The Other Side of a (Micro)Service Architecture

In fact, it is very difficult to do this, because devops is about communication. You want to be on good terms with your colleagues, and when you hit colleagues and managers on the head with regulations, they may experience conflicting feelings towards those people who do this.

In addition to all these points, there is another important thing: specific people should be responsible for each specific service, for each specific section of the deployment procedure. When there are no people and you have to involve some other people, it becomes hard to study the whole thing.

Orphan Services: The Other Side of a (Micro)Service Architecture

If all this did not help, and your orphan service is still an orphan, no one wants to take it to itself, the documentation is not written, the team that was called to this service refuses to do something, there is a simple way - to redo everything .

That is, you take the requirements for the service again and write a new service, better, on a better platform, without strange technological solutions. And migrate to it in battle.

Orphan Services: The Other Side of a (Micro)Service Architecture

We had a situation when we took a service on Yii 1 and realized that we could not develop it further, because we ran out of developers who can write well on Yii 1. All developers write well on the third Symfony. What to do? We allocated time, allocated a team, allocated a manager, rewrote the project and smoothly switched traffic to it.

After that, the old service can be deleted. This is my favorite procedure when you need to take and clean up some service from the configuration management system and then go through to see that all the production cars are canceled so that the developers do not leave any traces. The repository in the git remains.

That's all I wanted to talk about, I'm ready to discuss, the topic is hot, many swam in it.

The slides were about the fact that you unified the languages. An example was image resizing. Is it really necessary to strictly up to one language? Because image resizing in PHP, well, could really be done in Golang.

In fact, it is optional, like all practices. Maybe, in some cases, even undesirable. But you need to understand that if you have 50 people in the technical department in your company, 45 of them are PHP specialists, 3 more are devops who can use Python, Ansible, Puppet and something like that, and only one of them writes in some some Go image resizing service, then when it leaves, the expertise leaves with it. And in doing so, you will need to look for a market-specific developer who knows this language, especially if it is rare. That is, from an organizational point of view, it is problematic. From a devops point of view, you will not only need to clone some ready-made set of playbooks that you use to deploy services, but you will have to write them again.

We are now sawing a service on Node.js, and this will be just a platform nearby for each developer with a separate language. But we sat and thought that the game was worth the candle. That is, the question here is to sit and think.

How do you monitor your services? How do you collect and track logs?

We collect logs in Elasticsearch and put them in Kibana, and depending on whether it is production or test environments, different collectors are used there. Somewhere Lumberjack, somewhere else, I don't remember. And there are some more places in certain services where we install Telegraf and shoot somewhere else separately.

How to live with Puppet and Ansible in the same environment?

In fact, we now have two environments, one is Puppet, the other is Ansible. We are working on hybridizing them. Ansible is a good environment for initial setup, Puppet is a bad thing for initial setup because it requires manual work directly with the site, and Puppet ensures configuration convergence. This means that the site keeps itself up to date, and in order for the ansible machine to be kept up to date, you need to run playbooks on it all the time with some frequency. Here is such a difference.

How do you maintain compatibility? Do you have configs in both Ansible and Puppet?

This is our big pain, we support compatibility with our hands and think how to move from all this somewhere now. It turns out that Puppet rolls packages and maintains some links there, and Ansible, for example, rolls the code and adjusts fresh application configs there.

The presentation was about different versions of Ruby. What solution?

We encountered this in one place, and we have to keep it in our heads all the time. We simply turned off the part that ran on the Ruby that was incompatible with applications and kept it separate.

This year's conference DevOpsDays Moscow will be held on December 7 at Technopolis. Until November 11, we accept applications for reports. Write us if you would like to speak.

Registration for participants is open, join us!

Source: habr.com

Add a comment