What do we know about microservices

Hello! My name is Vadim Madison, I lead the development of the Avito System Platform. How we in the company are moving from a monolithic architecture to a microservice has been said more than once. It's time to share how we transformed our infrastructure to get the most out of microservices and not get lost in them. How PaaS helps us here, how we simplified the deployment and reduced the creation of a microservice to one click - read on. Not everything that I write about below is fully implemented in Avito, part of it is how we develop our platform.

(And at the end of this article, I will talk about the opportunity to get to a three-day seminar from an expert in microservice architecture, Chris Richardson).

What do we know about microservices

How we came to microservices

Avito is one of the largest classifieds in the world, it publishes more than 15 million new ads per day. Our backend accepts more than 20 thousand requests per second. Now we have several hundred microservices.

We have been building a microservice architecture for more than a year. How exactly - our colleagues in detail told at our section at RIT++ 2017. At CodeFest 2017 (see video), Sergey Orlov and Mikhail Prokopchuk explained in detail why we needed the transition to microservices at all and what role Kubernetes played here. Well, now we are doing everything to minimize the scaling costs that are inherent in such an architecture.

Initially, we did not create an ecosystem that would comprehensively help us in the development and launch of microservices. They just collected smart open source solutions, launched them at home and offered the developer to deal with them. As a result, he went to a dozen places (dashboards, internal services), after which he strengthened his desire to cut the code in the old way, in a monolith. The green color in the diagrams below indicates what the developer does one way or another with his own hands, the yellow color indicates automation.

What do we know about microservices

Now, in the PaaS CLI utility, one team creates a new service, and two more add a new database and deploy it to Stage.

What do we know about microservices

How to overcome the era of "microservice fragmentation"

With a monolithic architecture, for the sake of consistency of changes in the product, developers were forced to figure out what was happening with their neighbors. When working on the new architecture, service contexts no longer depend on each other.

In addition, for the microservice architecture to be effective, it is necessary to establish many processes, namely:

• logging;
• request tracing (Jaeger);
• error aggregation (Sentry);
• statuses, messages, events from Kubernetes (Event Stream Processing);
• race limit / circuit breaker (you can use Hystrix);
• service connectivity control (we use Netramesh);
• monitoring (Grafana);
• assembly (TeamCity);
• communication and notification (Slack, email);
• task tracking; (Jira)
• preparation of documentation.

To ensure that the system does not lose integrity and remain efficient as it scales, we rethought the organization of microservices in Avito.

How we manage microservices

Avito helps to carry out a single "party policy" among the many microservices:

  • division of infrastructure into layers;
  • the concept of Platform as a Service (PaaS);
  • monitoring everything that happens with microservices.

The infrastructure abstraction levels include three layers. Let's go from top to bottom.

A. Top - service mesh. At first we tried Istio, but it turned out that it uses too many resources, which is too expensive for our volumes. Therefore, the senior engineer in the architecture team Alexander Lukyanchenko developed his own solution − Netramesh (available in Open Source), which we are currently using in production and which consumes several times less resources than Istio (but does not do everything that Istio can boast of).
B. Medium - Kubernetes. On it, we deploy and operate microservices.
C. Bottom - bare metal. We don't use clouds and things like OpenStack, but sit entirely on bare metal.

All layers are combined PaaS. And this platform, in turn, consists of three parts.

I. Generators, managed through the CLI utility. It is she who helps the developer to create a microservice in the right way and with a minimum of effort.

II. Consolidated collector with control of all instruments through a common dashboard.

III. storage. Connects with schedulers that automatically set triggers for meaningful actions. Thanks to such a system, not a single task is missed just because someone forgot to put a task in Jira. We use an internal tool called Atlas for this.

What do we know about microservices

The implementation of microservices in Avito is also carried out according to a single scheme, which simplifies control over them at each stage of development and release.

How the standard microservice development pipeline works

In general, the microservice creation chain looks like this:

CLI-push → Continuous Integration → Bake → Deploy → Artificial tests → Canary tests → Squeeze Testing → Production → Maintenance.

Let's go through it exactly in this order.

CLI-push

• Creating a microservice.
We struggled for a long time to teach every developer how to make microservices. Including wrote detailed instructions in Confluence. But the schemes changed and were supplemented. The result - a bottleneck was formed at the beginning of the journey: it took much more time to launch microservices than allowed, and still there were often problems when creating them.

In the end, we built a simple CLI utility that automates the basic steps for creating a microservice. In fact, it replaces the first git push. Here's exactly what she does.

- Creates a service according to a template - step by step, in the "wizard" mode. We have templates for the main programming languages ​​in the Avito backend: PHP, Golang and Python.

- One command at a time deploys an environment for local development on a specific machine - Minikube rises, Helm charts are automatically generated and run in local kubernetes.

— Connects the required database. The developer does not need to know the IP, login and password to get access to the database he needs - at least locally, at least in Stage, at least in production. Moreover, the database is deployed immediately in a fault-tolerant configuration and with balancing.

- It performs live-assembly itself. Let's say the developer corrected something in the microservice through his IDE. The utility sees changes in the file system and, based on them, rebuilds the application (for Golang) and restarts. For PHP, we simply forward the directory inside the cube and there the live-reload is obtained “automatically”.

- Generates autotests. In the form of blanks, but quite usable.

• Deploy microservice.

It used to be a little dreary to deploy a microservice with us. Mandatory required:

I. Dockerfile.

II. Config.
III. Helm-chart, which is cumbersome in itself and includes:

- the charts themselves;
- templates;
- specific values ​​​​taking into account different environments.

We've taken the pain out of reworking Kubernetes manifests and now they're auto-generated. But most importantly, we simplified deployment to the limit. From now on, we have a Dockerfile, and the developer writes the entire config in a single short app.toml file.

What do we know about microservices

Yes, and in the app.toml itself, there is now a matter for a minute. We prescribe where how many copies of the service to raise (on the dev-server, on staging, on production), indicate its dependencies. Notice the line size = "small" in the [engine] block. This is the limit that will be allocated to the service via Kubernetes.

Further, on the basis of the config, all the necessary Helm-charts are automatically generated and connections to the databases are created.

• Basic validation. Such checks are also automated.
Need to track:
- is there a Dockerfile;
- is there app.toml;
- Is there documentation?
— whether in the order of dependence;
— whether alert rules are set.
To the last point: the owner of the service himself specifies which product metrics to monitor.

• Preparation of documentation.
Still a problem area. It seems to be the most obvious, but at the same time the record "often forgotten", and therefore the vulnerable link in the chain.
It is necessary that the documentation is for each microservice. It includes the following blocks.

I. Brief description of the service. Just a few sentences about what it does and what it is for.

II. Link to architecture diagram. It is important that at a glance it is easy to understand, for example, whether you are using Redis for caching or as the main data store in persistent mode. In Avito, for now, this is a link to Confluence.

III. runbook. A short guide to launching the service and the subtleties of handling it.

IV. FAQ, where it would be good to anticipate problems that your colleagues may encounter when working with the service.

V. Description of API endpoints. If suddenly you did not specify destinations, colleagues whose microservices are connected to yours will almost certainly pay for it. Now we use Swagger for this and our solution called brief.

VI. Labels. Or markers that show which product, functionality, structural division of the company the service belongs to. They help to quickly understand, for example, whether you are sawing the functionality that your colleagues rolled out for the same business unit a week ago.

VII. Service owner or owners. In most cases, it - or them - can be automatically determined using PaaS, but for insurance, we require the developer to specify them manually as well.

Finally, it's good practice to conduct documentation reviews, similar to code reviews.

Continuous Integration

  • Preparing repositories.
  • Creating a pipeline in TeamCity.
  • Grant of rights.
  • Search for service owners. Here is a hybrid scheme - manual marking and minimal automation from PaaS. The fully automatic scheme fails when services are transferred to support in another development team or, for example, if the service developer quits.
  • Service registration in Atlas (see above). With all its owners and dependencies.
  • Checking migrations. We check if there are any potentially dangerous ones among them. For example, in one of them, an alter table pops up or something else that can break the compatibility of the data scheme between different versions of the service. Then the migration is not performed, but put into a subscription - PaaS should signal the owner of the service when it becomes safe to apply it.

Beacon

The next stage is packaging services before deployment.

  • Application assembly. According to the classics - in a Docker image.
  • Generation of Helm charts for the service itself and related resources. Including for databases and cache. They are created automatically in accordance with the app.toml config that was generated at the CLI-push stage.
  • Creating tickets for admins to open ports (when required).
  • Running unit tests and calculating code coverage. If the code coverage is below the specified threshold value, then, most likely, the service will not go further - to the deployment. If it is on the verge of acceptable, then the service will be assigned a “pessimizing” coefficient: then, if there is no improvement in the indicator over time, the developer will receive a notification that there is no progress in terms of tests (and something should be done about it).
  • Accounting for memory and CPU limitations. Basically, we write microservices in Golang and run them in Kubernetes. Hence, one subtlety associated with the feature of the Golang language: by default, all cores on the machine are used at startup, if you do not explicitly set the GOMAXPROCS variable, and when several such services are launched on the same machine, they begin to compete for resources, interfering with each other. The graphs below show how the execution time changes if the application is run without contention and in resource race mode. (The sources of the charts are here).

What do we know about microservices

Execution time, less is better. Maximum: 643ms, minimum: 42ms. Photo is clickable.

What do we know about microservices

Time for surgery, less is better. Maximum: 14091 ns, minimum: 151 ns. Photo is clickable.

At the assembly preparation stage, you can set this variable explicitly or you can use the library automaxprocs from the guys at Uber.

Deploy

• Checking conventions. Before you start delivering service builds to your intended environments, you need to check the following:
- API endpoints.
— Correspondence of API endpoints responses to the schema.
- Log format.
- Setting headers for requests to the service (now this is done by netramesh)
- Setting the owner marker when sending messages to the bus (event bus). This is necessary to track the connectivity of services through the bus. You can send both idempotent data to the bus that does not increase the connectivity of services (which is good), and business data that enhances the connectivity of services (which is very bad!). And at the moment when this connectivity becomes a problem, understanding who writes and reads the bus helps to properly separate the services.

So far, there are not very many conventions in Avito, but their pool is expanding. The more such agreements in a form that is understandable and convenient for the team, the easier it is to maintain consistency between microservices.

Synthetic tests

• Closed loop testing. For it, we now use open source hoverfly.io. First, it records the real load on the service, then - just in a closed loop - it emulates.

• Stress Testing. We try to bring all services to optimal performance. And all versions of each service should be subjected to load testing - so we can understand the current performance of the service and the difference with previous versions of the same service. If, after a service update, its performance has fallen by one and a half times, this is a clear signal for its owners: you need to dig into the code and correct the situation.
We start from the collected data, for example, in order to correctly implement auto scaling and, in the end, to understand in general how scalable the service is.

During load testing, we check whether resource consumption meets the limits set. And we focus primarily on extremes.

a) We look at the total load.
- Too small - most likely something does not work at all if the load suddenly drops several times.
- Too large - optimization required.

b) Look at the RPS cutoff.
Here we look at the difference between the current version and the previous one and the total number. For example, if the service gives out 100 rps, then it is either poorly written, or this is its specifics, but in any case, this is a reason to look at the service very closely.
If, on the contrary, there are too many RPS, then, perhaps, some kind of bug and some of the endpoints stopped executing the payload, but just some return true;

Canary tests

After the synthetic tests have been passed, we run the microservice on a small number of users. We start carefully, with a tiny share of the intended audience of the service - less than 0,1%. At this stage, it is very important that the correct technical and product metrics are entered in the monitoring so that they show the problem in the service as quickly as possible. The minimum canary test time is 5 minutes, the main one is 2 hours. For complex services, we set the time in manual mode.
We analyze:
- language-specific metrics, in particular, php-fpm workers;
— errors in Sentry;
— response statuses;
— response time (response time), exact and average;
— latency;
- exceptions, handled and unhandled;
- product metrics.

Squeeze Testing

Squeeze Testing is also called "squeeze" testing. The name of the technique was introduced into Netflix. Its essence is that first we fill one instance with real traffic to the state of failure and thus set its limit. Then we add another instance and load this couple - again to the maximum; we see their ceiling and delta with the first squeeze. And so we connect one instance per step and calculate the pattern in the changes.
Data on tests through “squeezing” also flows into a common database of metrics, where we either enrich the results of artificial load with them, or even replace “synthetics” with them.

Production

• Scaling. Rolling out the service to production, we monitor how it scales. At the same time, monitoring only CPU indicators, in our experience, is inefficient. Auto scaling with RPS benchmarking works in its purest form, but only for certain services, such as online streaming. So we look primarily at application-specific product metrics.

As a result, when scaling, we analyze:
- CPU and RAM indicators,
- the number of requests in the queue,
- response time,
— forecast based on accumulated historical data.

When scaling a service, it is also important to keep track of its dependencies so that it does not happen that we scale the first service in the chain, and those that it accesses fall under load. In order to establish an acceptable load for the entire service pool, we look at the historical data of the “nearest” dependent service (in terms of CPU and RAM, coupled with app-specific metrics) and compare it with the historical data of the initializing service, and so on along the entire “dependency chain” ", from top to bottom.

Service

After the microservice is put into operation, we can hang triggers on it.

Here are typical situations in which triggers work.
- Potentially dangerous migrations detected.
- Security updates have been released.
- The service itself has not been updated for a long time.
— The load on the service has noticeably decreased or some of its product metrics are out of the norm.
— The service no longer meets the new requirements of the platform.

Some of the triggers are responsible for the stability of the work, some - as a system maintenance function - for example, some service has not been deployed for a long time and its base image has ceased to pass security checks.

Dashboard

In short, the dashboard is the control panel of our entire PaaS.

  • A single point of information about the service, with data on its coverage with tests, the number of its images, the number of production copies, versions, etc.
  • A tool for filtering data by services and labels (tokens of belonging to business units, product functionality, etc.)
  • Integration tool with infrastructure tools for tracing, logging, monitoring.
  • Single point of documentation for services.
  • A single point of view of all events by service.

What do we know about microservices
What do we know about microservices
What do we know about microservices
What do we know about microservices

Total

Prior to the introduction of PaaS, a new developer could spend several weeks understanding all the tools needed to launch a microservice in production: Kubernetes, Helm, our internal TeamCity features, setting up a connection to databases and caches in a fault-tolerant way, etc. Now it takes a couple of hours to read quickstart and make the service itself.

I made a report on this topic for HighLoad ++ 2018, you can see video и presentation.

Bonus track for those who read to the end

We in Avito organize an internal three-day training for developers from Chris Richardson, an expert in microservice architecture. We want to give the opportunity to participate in it to one of the readers of this post. Here the training program has been posted.

The training will be held from 5 to 7 August in Moscow. These are working days that will be fully occupied. Lunch and training will be in our office, and the chosen participant pays for the travel and accommodation himself.

You can apply to participate in this google form. From you - the answer to the question why you need to attend the training and information on how to contact you. Answer in English, because the participant who gets to the training will be chosen by Chris himself.
We will announce the name of the training participant with an update to this post and on social networks Avito for developers (AvitoTech in Facebook, In contact with, Twitter) no later than July 19.

Source: habr.com

Add a comment