Top fakapov Cyan

Top fakapov Cyan

All good! 

My name is Nikita, I am the team leader of the Cyan engineering team. One of my responsibilities in the company is to reduce the number of incidents related to the infrastructure in production to zero.
What follows has brought us a lot of pain, and the purpose of this article is to prevent other people from repeating our mistakes, or at least minimize their impact. 

Preamble

A long time ago, when Cyan consisted of monoliths, and there were no hints of microservices yet, we measured the availability of a resource by checking 3–5 pages. 

They answer - everything is fine, they do not answer for a long time - an alert. How long they should not work in order for this to be considered an incident, people decided at meetings. The engineering team has always been involved in the investigation of the incident. When the investigation was completed, they wrote a post-mortem - a kind of report to the mail in the format: what happened, how long it lasted, what was done at the moment, what will we do in the future. 

The main pages of the site or how we understand that we have broken through the bottom

 
In order to be able to somehow understand the priority of the error, we have identified the most critical pages for the business functionality of the site. Based on them, we count the number of successful/unsuccessful requests and timeouts. This is how we measure uptime. 

Let's say we found out that there are a number of super-important sections of the site that are responsible for the main service - searching and submitting ads. If the number of requests that fail exceeds 1%, this is a critical incident. If within 15 minutes during prime time the percentage of errors exceeds 0,1%, then this is also considered a critical incident. These criteria cover most of the incidents, the rest are beyond the scope of this article.

Top fakapov Cyan

Top Best Cyan Incidents

So, we have learned to accurately determine the fact that the incident happened. 

Now each incident is described in detail and reflected in the Jira epic. By the way: for this we started a separate project, called it FAIL - only epics can be created in it. 

If we collect all the fails over the past few years, then the leaders are: 

  • mssql related incidents;
  • incidents caused by external factors;
  • admin errors.

Let's dwell in more detail on the errors of admins, as well as on some other interesting fails.

Fifth place - "Cleaning up the DNS"

It was a stormy Tuesday. We decided to clean up the DNS cluster. 

I wanted to transfer internal dns servers from bind to powerdns, allocating completely separate servers for this, where there is nothing but dns. 

We placed one dns server in each location of our DCs, and the moment came for the zones to move from bind to powerdns and switch the infrastructure to new servers. 

In the midst of the move, of all the servers that were listed in local caching binds on all servers, only one remained, which was in the data center in St. Petersburg. This DC was originally declared as non-critical for us, but suddenly became a single point of failure.
Just in such a period of moving, the canal between Moscow and St. Petersburg fell. We were effectively without DNS for five minutes and picked up when the hoster fixed the problem. 

Conclusions:

If earlier we neglected external factors during preparation for work, now they are also included in the list of what we are preparing for. And now we strive to ensure that all components are reserved n-2, and for the duration of the work we can lower this level to n-1.

  • When drawing up an action plan, note the points where the service can fall, and think over the scenario where everything went “worse than anywhere” in advance.
  • Distribute internal dns servers across different geolocations/data centers/racks/switches/inputs.
  • On each server, install a local caching dns server that redirects requests to the main dns servers, and if it is unavailable, it will respond from the cache. 

Fourth place - "Cleaning up Nginx"

One fine day, our team decided that “enough to endure this,” and the process of refactoring nginx configs started. The main goal is to bring the configs to an intuitive structure. Previously, everything was “historically established” and there was no logic in itself. Now each server_name has been moved to a file of the same name and all configs have been distributed into folders. By the way, the config contains 253949 lines or 7836520 characters and takes almost 7 megabytes. Top level structure: 

nginx structure

├── access
│   ├── allow.list
...
│   └── whitelist.conf
├── geobase
│   ├── exclude.conf
...
│   └── geo_ip_to_region_id.conf
├── geodb
│   ├── GeoIP.dat
│   ├── GeoIP2-Country.mmdb
│   └── GeoLiteCity.dat
├── inc
│   ├── error.inc
...
│   └── proxy.inc
├── lists.d
│   ├── bot.conf
...
│   ├── dynamic
│   └── geo.conf
├── lua
│   ├── cookie.lua
│   ├── log
│   │   └── log.lua
│   ├── logics
│   │   ├── include.lua
│   │   ├── ...
│   │   └── utils.lua
│   └── prom
│       ├── stats.lua
│       └── stats_prometheus.lua
├── map.d
│   ├── access.conf
│   ├── .. 
│   └── zones.conf
├── nginx.conf
├── robots.txt
├── server.d
│   ├── cian.ru
│   │   ├── cian.ru.conf
│   │   ├── ...
│   │   └── my.cian.ru.conf
├── service.d
│   ├── ...
│   └── status.conf
└── upstream.d
    ├── cian-mcs.conf
    ├── ...
    └── wafserver.conf

It has become much better, but in the process of renaming and distributing configs, some of them had the wrong extension and did not get into the include *.conf directive. As a result, some of the hosts became unavailable and returned 301 to the main one. Due to the fact that the response code was not 5xx / 4xx, this was not noticed immediately, but only in the morning. After that, we started writing tests to check infrastructure components.

Conclusions: 

  • Structure your configs properly (not just nginx) and think about the structure early in the project. So you will make them more understandable to the team, which in turn will reduce the TTM.
  • For some infrastructure components, write tests. For example: checking that all key server_names return the correct status + response body. It will be enough to have at hand just a few scripts that check the main functions of the component, so as not to frantically remember at 3 o'clock in the morning what else needs to be checked. 

Third place - "Suddenly ran out of space in Cassandra"

The data grew steadily, and everything was fine until the moment when repair of large casespaces began to fall in the Cassandra cluster, because compaction could not work on them. 

On one rainy day, the cluster almost turned into a pumpkin, namely:

  • places remained about 20% in total for the cluster;
  • it is impossible to fully add nodes, because cleanup does not pass after adding a node due to lack of space on the partitions;
  • performance gradually drops, as the compaction does not work; 
  • the cluster is in emergency mode.

Top fakapov Cyan

Exit - added 5 more nodes without cleanup, after which they began to systematically withdraw from the cluster and re-enter, like empty nodes that ran out of space. Much more time wasted than I would have liked. There was a risk of partial or complete unavailability of the cluster. 

Conclusions:

  • All cassandra servers should have no more than 60% space occupied on each partition. 
  • They should be loaded no more than 50% by cpu.
  • You should not forget about capacity planning and you need to think it over for each component, based on its specifics.
  • The more nodes in the cluster, the better. Servers containing a small amount of data overflow faster, and such a cluster is easier to reanimate. 

Second place - "Data disappeared from consul key-value storage"

For service discovery, we, like many others, use consul. But we also use its key-value for the blue-green layout of the monolith. It stores information about active and inactive upstreams, which are swapped during deployment. For this, a deployment service was written that interacted with KV. At some point, the data from KV disappeared. Restored from memory, but with a number of errors. As a result, when laying out, the load on upstreams was unevenly distributed, and we received a lot of 502 errors due to overloading the backends on the CPU. As a result, we moved from consul KV to postgres, from where it is not so easy to remove them.  

Conclusions:

  • Services without any authorization should not contain data that is critical for the operation of the site. For example, if you do not have authorization in ES, it would be better to prohibit access at the network level from everywhere where it is not needed, leave only the necessary ones, and also make action.destructive_requires_name: true.
  • Practice the backup and recovery mechanism in advance. For example, make a script in advance (for example, in python) that can both backup and restore.

First place - "Captain Obvious" 

At some point, we noticed an uneven load distribution on nginx upstreams in cases where there were 10+ servers in the backend. Due to the fact that round-robin sent requests from 1 to the last upstream in order, and each nginx reload started from the beginning, the first upstreams always had more requests than the rest. As a result, they worked more slowly and the entire site suffered. This became more and more noticeable as the amount of traffic increased. Just updating nginx to enable random did not work - you have to redo a bunch of lua code that did not take off on version 1.15 (at that moment). We had to patch our nginx 1.14.2, introducing random support into it. This solved the problem. This bug wins the "Captain Obvious" category.

Conclusions:

It was very interesting and exciting to explore this bug). 

  • Arrange monitoring so that it helps to find such fluctuations quickly. For example, you can use ELK to watch the rps per backend of each upstream, watch their response time from nginx's point of view. In this case, it helped us to identify the problem. 

As a result, most of the fails could have been avoided with a more scrupulous approach to what you are doing. We must always remember Murphy's Law: Anything that can go wrong will go wrong, and build components based on it. 

Source: habr.com

Add a comment