Slurm SRE. Continuous experiment with experts from Booking.com and Google.com

Our team loves experiments. Each Slurm is not a static repetition of the previous ones, but a reflection on experience and a transition from good to better. But with Slurm SRE we decided to apply a completely new format - to give the participants conditions that are as close as possible to "combat".

To briefly describe what we did at the intensive: β€œWe build, break, repair,
we study." SRE is worth little in pure theory - only practice, real solutions, real problems.

The participants were divided into teams so that the vigorous competitive spirit would not let anyone fall asleep or launch β€œAngry Birds” on the iPhone following the example of Dmitry Anatolyevich.

Problems, glitches, bugs and tasks provided participants with four mentors. Ivan Kruglov, Principal Developer at Booking.com (Netherlands). Ben Tyler, Principal Developer at Booking.com (USA). Eduard Medvedev, CTO at Tungsten Labs (Germany). Eugene Varavva, General Developer at Google (San Francisco).

Moreover, the participants are divided into teams - and compete with each other. Interesting?

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
Ivan, Ben, Eduard and Evgeny look at the poor Slurm SRE participants with a kind Leninist squint before the start of the competition.

So the task is:

We are ours, we will build a new world ...

There is a movie ticket aggregator website. Incidents are invented by mentors in a pre-designed scenario (although no one excludes a particularly refined and insidious improvisation), the performance of the site is described by various metrics. Problems can be very different: Moulin Rouge theater tickets are not loaded into the database; posters of films and performances are loaded into the database in more than 10 seconds; the description of a particular movie freezes; 0,1% of orders already fall into reserved places; Periodically, the payment processing system falls off for a minute or two. And many, many, many unpleasant things that can fall on the Slurm SRE member in his real job.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
We are ready to handle everything... and everyone.

Our long-suffering site consists of several microservices. Its task is to aggregate data on screenings, prices and free seats from all cinemas, it shows movie announcements, allows you to choose a cinema, screening, hall and seat, book and pay for tickets. In general, everything that the viewer can only dream of. Only now the user does not even suspect what a titanic struggle for the stability and availability of the site is going on inside.

For the intensive site, we formed SLO, SLI, SLA indicators, developed the architecture and infrastructure, deployed the site, set up monitoring and alerting. And away we go.

SLO, SLI, SLA

SLI - Service Level Indicators. SLO - Service Level Objectives. SLAs are Service Level Agreements.

SLA is an ITIL methodology term that refers to a formal contract between a service customer and its provider, containing a description of the service, the rights and obligations of the parties, and, most importantly, an agreed level of quality for the provision of this service.

SLO is a Service Level Goal: The target value or range of values ​​for a Service Level that is measured by SLI. The normal value for SLO is "SLI ≀ target" or "lower bound ≀ SLI ≀ upper bound".

SLI is a service level indicator, a carefully defined quantitative measure of one aspect of the level of service being delivered. For most services, the key SLI is request latency - how long it takes to return a response to a request. Other common SLIs include error rates, often expressed as a fraction of all requests received, and system throughput, usually measured in requests per second.

First of all, we will break the planes, but the girls, and then the girls ...

Internal and external factors from the very first minutes began to "spoil" SLO. Everything fell on the admins' heads - developer errors, infrastructure failures, an influx of visitors, and DDoS attacks. Anything that worsens the SLO.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
β€œ- Dear participants, I hasten to please you, the first thing you do is fall ... everything!”

Along the way, the speakers discussed stability, error budget, testing practices, interrupt management and operational load.

We are not stokers, we are not carpenters ...

Then the participants began to repair - the main thing is to understand what to grab first.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
β€œ- Lord, I have never seen it break like this, in such a form and in such a pose!”

So, there was an accident. Payment processing service How to act to restore performance in the shortest possible time?

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
Experts affectionately looking at the participants are preparing another trick.

Each team organizes the work of the accident liquidation group - connects colleagues, notifies stakeholders (stakeholders). At the same time priorities are lined up. So the participants trained to work under pressure in conditions of extremely limited time.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
β€œ- What kind of horror got out ?!”

Exhale ... and finish the exercise

Together with the speakers, after each problem solved and the site temporarily stabilized, the team studied the incidents from the point of view of the SRE. We analyzed the problems in detail - the causes of occurrence, the course of elimination. After that, both team-by-team and collectively, they made a decision on their further prevention: how to improve monitoring, how to competently change the architecture, how to correct the approach to development and operation, how to correct regulations. The speakers demonstrated the practice of conducting post-mortem.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com
β€œWho else wants torment! - I!"

The successes of the teams were strictly and clearly recorded on the electronic scoreboard.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com

For the first places - a bonus from stakeholders.

Slurm SRE. Continuous experiment with experts from Booking.com and Google.com

Source: habr.com

Add a comment