"Hope is a bad strategy." SRE intensive in Moscow, February 3-5
We are announcing the first practical course on SRE in Russia: Slurm SRE.
At the intensive, we will build, break, repair and improve the site-aggregator for selling movie tickets for three days.
We chose a ticket aggregator because it has many failure scenarios: an influx of visitors and DDoS attacks, a failure of one of the many critical microservices (authorization, reservation, payment processing), unavailability of one of the many cinemas (exchange of information about free seats and reservations), and further down the list.
We will form the Reliability concept of our aggregator site, which we will be in the next Engineering, analyze the design from the point of view of SRE, select metrics, set up their monitoring, eliminate emerging incidents, conduct training for team work with incidents in conditions close to combat, organize debriefing .
The program is run by Booking.com and Google employees.
This time there will be no remote participation: the course is built on personal interaction and teamwork.
Details under the cut
Sessions
Ivan Kruglov
Principal Developer at Booking.com (Netherlands)
Since joining Booking.com in 2013, he has worked on infrastructure projects such as distributed message delivery and processing, BigData and web-stack, and search.
Now he is working on building an internal cloud and Service Mesh.
Ben Tyler
Principal Developer at Booking.com (USA)
Engaged in internal development of the Booking.com platform.
Specializes in service mesh / service discovery, batch job scheduling, incident response and postmortem process.
Speaks and teaches in Russian.
Eugene Barabbas
General Developer at Google (San Francisco).
Experience from high-load web projects to research in computer vision and robotics.
Since 2011, he has been creating and operating distributed systems at Google, participating in the full life cycle of a project: conceptualization, design and architecture, launch, folding, and all intermediate stages.
Eduard Medvedev
CTO at Tungsten Labs (Germany)
He worked as an engineer at StackStorm, responsible for the ChatOps functionality of the platform. Developed and implemented ChatOps for data center automation. Speaker at Russian and international conferences.
Program
The program is being actively developed. Now it looks like this, by February it can improve and expand.
Topic #1: Basic principles and methods of SRE
What does it take to become an SRE?
DevOps vs SRE
Why developers value SREs and are very sad when they are not in the project
SLI, SLO and SLA
Error budget and its role in SRE
Topic #2: Design of distributed systems
Application architecture and functionality
Non-Abstract Large System Design
Operability / Design for failure
gRPC or REST
Versioning and backwards compatibility
Topic #3: How the SRE project is received
Best practices from SRE
Project acceptance checklist
Logging, metrics, tracing
Taking CI/CD into our own hands
Topic #4: Designing and Running a Distributed System
Reverse engineering - how does the system work?
Coordinating SLI and SLO
Capacity planning practice
Launching traffic to the application, our users begin to “use” it
Launching Prometheus, Grafana, Elastic
Topic #5: Monitoring, Observability and Alerting
monitoring vs. observability
Setting up monitoring and alerting with Prometheus
Practical monitoring of SLI and SLO
Symptoms vs. Causes
Black box vs. White box monitoring
Distributed monitoring of application and server availability
4 golden signals (anomaly detection)
Topic #6: The Practice of System Reliability Testing
If you have a complex idea for participation, for example, for a CEO, CTO and a development team to come to the course, and they would have an internship taking into account the management vertical, write to me in a personal.