Testing on Prod: Canary Deployment

The canary is a small bird that constantly sings. These birds are sensitive to methane and carbon monoxide. Even from a small concentration of excess gases in the air, they lose consciousness or die. Gold diggers and miners took the birds to mine: while the canaries sing, you can work, if you are silent - there is gas in the mine and it's time to leave. The miners sacrificed a small bird to get out of the mines alive.

Testing on Prod: Canary Deployment

A similar practice has found itself in IT. For example, in the standard task of deploying a new version of a service or application to production with testing before that. The test environment can be too expensive, automated tests don't cover everything you want, and not testing and sacrificing quality is risky. This is where the Canary Deployment approach comes in handy, where some real production traffic is directed towards the new version. The approach helps safely check new version for production, sacrificing a little for a big cause. More details on how the approach works, what is useful and how to implement it, will tell Andrey Markelov (Andrey_V_Markelov), on the example of implementation in the company Infobip.

Andrey Markelov - Lead Software Engineer at Infobip, has been developing Java applications in the field of finance and telecommunications for 11 years. Develops Open Source products, actively participates in the Atlassian Community and writes plugins for Atlassian products. Evangelist of Prometheus, Docker and Redis.

Play Video

About Infobip

It is a global telecommunications platform that allows banks, retailers, online stores and transport companies to send messages to their customers via SMS, push, letters and voice messages. In such a business, stability and reliability are important so that customers receive messages on time.

Infobip IT infrastructure in numbers:

  • 15 data centers around the world;
  • 500 unique services in operation;
  • 2500 instances of services, which is much more than commands;
  • 4,5 TB monthly traffic;
  • 4,5 billion phone numbers;

The business is growing, and with it the number of releases. We spend on 60 releases per daybecause customers want more features and power. But this is difficult - there are a lot of services, but few commands. You have to quickly write code that should work in production without errors.

Releases

A typical release goes like this. For example, there are services A, B, C, D and E, each of them is developed by a separate team.

Testing on Prod: Canary Deployment

At some point, the team of service A decides to deploy a new version, but the teams of services B, C, D and E do not know about it. There are two options for how the service team A will act.

Will hold incremental release: first replace one version, and then the second.

Testing on Prod: Canary Deployment

But there is a second option: the command will find additional capacities and machines, deploy the new version, and then switch the router, and the version will start working for production.

Testing on Prod: Canary Deployment

In any case, there are almost always problems after the deployment, even if the version is tested. You can test manually, you can do it automatically, you can not test - problems will arise in any case. The easiest and most correct way to solve them is to roll back to the working version. Only then you can deal with the damage, with the causes and correct them.

So what do we want?

We don't need problems. If customers find them faster than us, it will hurt their reputation. Therefore we must find problems faster than customers. Working proactively, we minimize damage.

At the same time, we want speed up deploymentso that it happens quickly, easily, naturally and without pressure from the team. Engineers, DevOps engineers and programmers must be protected - the release of a new version is stressful. The team is not expendable, we strive rational use of human resources.

Deployment problems

Client traffic is unpredictable. It is impossible to predict when client traffic will be at its lowest. We don't know where or when clients will start their campaigns - maybe tonight in India, tomorrow in Hong Kong. Given the large time difference, deploying even at 2 am does not guarantee that customers will not be affected.

Provider problems. Messengers and providers are our partners. Sometimes they have crashes that cause errors during the deployment of new versions.

Distributed teams. The teams that develop the client side and the backend are in different time zones. Because of this, they often cannot agree among themselves.

Data centers cannot be repeated on the stage. There are 200 racks in one data center - you can’t even approximately repeat this in a sandbox.

Downtimesunacceptable! We have an Error Budget when we are working 99,99% of the time, for example, and the remaining percentage is "error margin". Achieving 100% reliability is impossible, but it is important to constantly monitor drops and downtime.

Classic solutions

Write code without bugs. When I was a young developer, managers approached me with a request to release without bugs, but this is not always possible.

Write tests. Tests work, but sometimes not in the way the business wants. Making money is not the job of tests.

Test on stage. For 3,5 years of my work at Infobip, I have never seen that the state of the stage at least partially coincides with production.

Testing on Prod: Canary Deployment

We even tried to develop this idea: first we had stage, then pre-production, and then pre-production. But this did not help either - they did not even match in power. With stage, we can guarantee basic functionality, but we don't know how it will work under loads.

The release is made by the one who developed it. This is a good practice: even if someone changes the name of the comment, they immediately add it to production. This helps to develop responsibility and not forget about the changes made.

There are additional complications too. It's stressful for a developer to spend a lot of time manually checking everything.

Agreed releases. This option is usually offered by management: "Let's agree that every day you will test and add new versions." It doesn't work: there is always a command waiting for everyone else, or vice versa.

Smoke tests

Another way to solve our deployment problems. Consider how smoke tests work in the previous example, when team A wants to deploy a new version.

First, the team deploys one instance to production. Messages to the instance from mocks simulates real trafficto match normal daily traffic. If all is well, the team switches the new version to user traffic.

Testing on Prod: Canary Deployment

The second option is to deploy with extra iron. The team tests it for production, then switches, and everything works.

Testing on Prod: Canary Deployment

Disadvantages of smoke tests:

  • Tests cannot be trusted. Where to get the same traffic as for production? You can use yesterday or a week ago, but it does not always match the current one.
  • Difficult to maintain. You will have to maintain test accounts, constantly reset them before each deployment, when active records are sent to the repository. This is harder than writing a test in your own sandbox.

The only bonus here is performance can be checked.

Canary releases

Due to the shortcomings of smoke tests, we started using canary releases.

A practice similar to how miners used canaries to indicate the level of gases found its way into IT. We let some real production traffic to the new versionwhile trying to meet the Service Level Agreement (SLA). SLA is our "right to make a mistake", which we can use once a year (or for some other period of time). If everything goes well, we will add more traffic. If not, we will return the previous versions.

Testing on Prod: Canary Deployment

Implementation and nuances

How did we implement canary releases? For example, a group of clients send messages through our service.

Testing on Prod: Canary Deployment

The deployment goes like this: we remove one node from under the balancer (1), change the version (2) and let some traffic separately (3).

Testing on Prod: Canary Deployment

In general, everyone in the group will be happy, even if one user is unhappy. If everything is fine, we change all versions.

Testing on Prod: Canary Deployment

I will show schematically how it looks for microservices in most cases.

There is Service Discovery and two more services: S1N1 and S2. The first service (S1N1) notifies Service Discovery when it starts, and Service Discovery remembers it. The second service with two nodes (S2N1 and S2N2) also notifies Service Discovery when it starts.

Testing on Prod: Canary Deployment

The second service for the first works as a server. The first one asks Service Discovery for information about its servers, and when it receives it, it looks for and checks them (“health check”). When he checks, he will send them messages.

When someone wants to deploy a new version of the second service, he tells Service Discovery that the second node will be a canary node: less traffic will be sent to it, because the deploy will take place now. We remove the canary node from under the balancer and the first service does not send traffic to it.

Testing on Prod: Canary Deployment

We change the version and Service Discovery knows that the second node is now canary - you can give it less load (5%). If everything is fine, we change the version, return the loads and work on.

To implement all this, we need:

  • balancing;
  • monitoringbecause it is important to know what each user expects and how our services work in detail;
  • version analysisto understand how well the new version will work in production;
  • automation - we write the deployment sequence (deployment pipeline).

Testing on Prod: Canary Deployment

Balancing

This is the first thing we should think about. There are two balancing strategies.

The simplest option when one node is always canary. This node always receives less traffic and we start the deployment from it. In case of problems, we will compare its work before the deployment and during it. For example, if there are 2 times more errors, then the damage has increased 2 times.

Canary node is set during the deployment process. When the deployment ends and we remove the canary node status from it, the traffic balance will be restored. With fewer cars, we will get a fair distribution.

Monitoring

The cornerstone of canary releases. We must understand exactly why we are doing this and what metrics we want to collect.

Examples of metrics we collect from our services.

  • Number of mistakes, which are written to the logs. This is a clear indicator that everything is working as it should. In general, this is a good metric.
  • Query Execution Time (latency). Everyone monitors this metric because everyone wants to work fast.
  • Queue size (throughput).
  • Number of successful responses per second.
  • Execution time of 95% of all requests.
  • Business metrics: how much money a business makes in a given amount of time or user churn. These metrics for our new version may be more important than those added by the engineers.

Examples of metrics in most popular monitoring systems.

counter. This is some increasing value, for example, the number of errors. This metric is easy to interpolate and study the chart: yesterday there were 2 errors, and today 500, which means something went wrong.

The number of errors per minute or per second is the most important indicator that can be calculated using Counter. This data gives a clear picture of how the system works at a distance. Consider the example of a graph of the number of errors per second for two versions of the production system.

Testing on Prod: Canary Deployment

There were few errors in the first version, perhaps the audit did not work. In the second version, everything is much worse. We can say for sure that there are problems, so we should roll back this version.

Gauge. Metrics are similar to Counter, but we record values ​​that can either increase or decrease. For example, query execution time or queue size.

The graph shows an example of latency. The graph shows that the versions are similar, you can work with them. But if you look closely, you can see how the value changes. If the query execution time increases when users are added, then it is immediately clear that there are problems - this was not the case before.

Testing on Prod: Canary Deployment

summary. One of the most important indicators for business is percentiles. The metric shows that 95% of cases our system works the way we want. We can accept if there are problems somewhere, because we understand the general trend, how good or bad everything is.

Tools

EACH Stack. You can implement canary using Elasticsearch - we write errors to it when events occur. With the simplest API call, you can get the number of errors at any given time and compare with past segments: GET /applg/_cunt?q=level:errr.

Prometheus. He showed himself well in Infobip. It allows you to implement multidimensional metrics because labels are used.

We can use level, instance, service, combine them in one system. With help offset you can see, for example, the value of a value a week ago with just one command GET /api/v1/query?query={query}Where {query}:

rate(logback_appender_total{ 
    level="error",  
    instance=~"$instance" 
}[5m] offset $offset_value)

Version Analysis

There are several versioning strategies.

View metrics of canary nodes only. One of the simplest options: deploy a new version and study only the work. But if the engineer at this time begins to study the logs, constantly nervously reloading the pages, then this solution is no different from the others.

Canary node is compared to any other node. This is a comparison with other instances that run at full traffic. For example, if things are worse with small traffic, or no better than on real instances, then something is wrong.

The Canary node is compared to itself in the past. Nodes allocated to canary can be compared with historical data. For example, if everything was fine a week ago, then we can focus on this data in order to understand the current situation.

Automation

We want to free engineers from manual comparison, so it's important to implement automation. The deployment pipeline usually looks like this:

  • we start;
  • remove the node from under the balancer;
  • set up a canary node;
  • turn on the balancer with a limited amount of traffic;
  • compare.

Testing on Prod: Canary Deployment

At this stage, we implement automatic comparison. How it can look like and why it is better than verification after deployment, let's look at an example from Jenkins.

This is the pipeline to Groovy.

while (System.currentTimeMillis() < endCanaryTs) {
    def isOk = compare(srv, canary, time, base, offset, metrics)
    if (isOk) {
        sleep DEFAULT SLEEP
    }   else {
        echo "Canary failed, need to revert"  
        return false
    }
}

Here in the loop we set that we will compare the new node for an hour. If the canary process has not yet ended the process, we call the function. She reports that everything is fine or not: def isOk = compare(srv, canary, time, base, offset, metrics).

If all is good - sleep DEFAULT SLEEP, for example, for a second, and continue. If not, exit — the deployment failed.

Description of the metric. Let's see what the function might look like compare on the example of DSL.

metric(
    'errorCounts',
    'rate(errorCounts{node=~"$canaryInst"}[5m] offset $offset)',
    {   baseValue, canaryValue ->
        if (canaryValue > baseValue * 1.3) return false 
        return true
    }
)

Let's say we are comparing the number of errors and we want to know the number of errors per second for the last 5 minutes.

We have two values: base and canary nodes. The value of the canary node is the current one. Basic - baseValue is the value of any other non-canary node. We compare the values ​​\uXNUMXb\uXNUMXbwith each other according to the formula, which we set based on our experience and observations. If the value canaryValue bad, then the deployment failed, and we roll back.

Why is all this necessary?

A person cannot check hundreds and thousands of metricsespecially to do it quickly. Automatic comparison helps to check all metrics and quickly notifies you of problems. The timing of the alert is critical: if something happened in the last 2 seconds, the damage will not be as great as if it happened 15 minutes ago. Until someone notices a problem, writes to support, and support us to roll back, you can lose customers.

If the process went through and everything is fine, we will deploy all the other nodes automatically. During this time, the engineers do nothing. Only when they launch the canary do they decide what metrics to take, how long to do the comparison, what strategy to use.

Testing on Prod: Canary Deployment

If there are problems, we automatically roll back the canary node, work on previous versions and fix the errors that we found. By metrics, they are easy to find and see the damage from the new version.

Obstacles

Of course, this is not easy to implement. First of all, you need general monitoring system. Engineers have their own metrics, support and analysts have different metrics, and businesses have third ones. The common system is the common language that business and development speak.

Needs to be tested in practice metric stability. Checking helps you understand what is the minimum set of metrics needed to ensure quality.

How to achieve this? Use canary-service not at the time of deployment. We add a certain service on the old version, which at any time can take any dedicated node, reduce traffic without deployment. After we compare: we study the mistakes and look for that line when we achieve quality.

Testing on Prod: Canary Deployment

How did we benefit from canary releases?

Minimized the percentage of damage from bugs. Most deployment errors are due to inconsistencies in some data or priority. There are much fewer such errors, because we can solve the problem in the first seconds.

Optimized team work. Beginners have a “right to make a mistake”: they can deploy to production without fear of making a mistake, there is an additional initiative, an incentive to work. If they break something, then it will not be critical, and the one who makes a mistake will not be fired.

Automated deployment. This is no longer a manual process, as before, but a real automated one. But it takes longer.

Highlighted important metrics. The whole company, starting from business and engineers, understands what is really important in our product, what metrics, for example, the outflow and inflow of users. We control the process: we test metrics, introduce new ones, see how old ones work in order to build a system that will make money more productively.

We have a lot of cool practices and systems that help us. Despite this, we strive to be professional and do our job well, regardless of whether we have a system that will help us or not.

Engineering approaches and practices - main focus of TechLead Conf. If you have achieved success on the path to technical excellence and are ready to tell you what helped you in this, — apply for a report.

We are planning to Tech Lead Conf June 8. We understand that it is difficult to make decisions about participation in the conference now. But at the same time, we believe that quarantine is not a reason to stop professional communication and development. Therefore, in any case, we will find a way to discuss the tasks of a technical lead and approaches to solving them - if necessary, we will go online and set up networking there!

Source: habr.com

author avatar
ProHoster Consultant, Technical Specialist
A technical specialist at ProHoster with over six years of experience in server administration, VPN solutions, and network security. I manage infrastructure setup and support, monitor service stability, and implement solutions to protect client data. I also contribute to performance optimization and compliance with modern security and privacy requirements.

Add a comment