"Hope is a bad strategy." SRE intensive in Moscow, February 3-5

We are announcing the first practical course on SRE in Russia: Slurm SRE.

At the intensive, we will build, break, repair and improve the site-aggregator for selling movie tickets for three days.

"Hope is a bad strategy." SRE intensive in Moscow, February 3-5

We chose a ticket aggregator because it has many failure scenarios: an influx of visitors and DDoS attacks, a failure of one of the many critical microservices (authorization, reservation, payment processing), unavailability of one of the many cinemas (exchange of information about free seats and reservations), and further down the list.

We will form the Reliability concept of our aggregator site, which we will be in the next Engineering, analyze the design from the point of view of SRE, select metrics, set up their monitoring, eliminate emerging incidents, conduct training for team work with incidents in conditions close to combat, organize debriefing .

The program is run by Booking.com and Google employees.
This time there will be no remote participation: the course is built on personal interaction and teamwork.

Details under the cut

Sessions

Ivan Kruglov
Principal Developer at Booking.com (Netherlands)
Since joining Booking.com in 2013, he has worked on infrastructure projects such as distributed message delivery and processing, BigData and web-stack, and search.
Now he is working on building an internal cloud and Service Mesh.

Ben Tyler
Principal Developer at Booking.com (USA)
Engaged in internal development of the Booking.com platform.
Specializes in service mesh / service discovery, batch job scheduling, incident response and postmortem process.
Speaks and teaches in Russian.

Eugene Barabbas
General Developer at Google (San Francisco).
Experience from high-load web projects to research in computer vision and robotics.
Since 2011, he has been creating and operating distributed systems at Google, participating in the full life cycle of a project: conceptualization, design and architecture, launch, folding, and all intermediate stages.

Eduard Medvedev
CTO at Tungsten Labs (Germany)
He worked as an engineer at StackStorm, responsible for the ChatOps functionality of the platform. Developed and implemented ChatOps for data center automation. Speaker at Russian and international conferences.

Program

The program is being actively developed. Now it looks like this, by February it can improve and expand.

Topic #1: Basic principles and methods of SRE

  • What does it take to become an SRE?
  • DevOps vs SRE
  • Why developers value SREs and are very sad when they are not in the project
  • SLI, SLO and SLA
  • Error budget and its role in SRE

Topic #2: Design of distributed systems

  • Application architecture and functionality
  • Non-Abstract Large System Design
  • Operability / Design for failure
  • gRPC or REST
  • Versioning and backwards compatibility

Topic #3: How the SRE project is received

  • Best practices from SRE
  • Project acceptance checklist
  • Logging, metrics, tracing
  • Taking CI/CD into our own hands

Topic #4: Designing and Running a Distributed System

  • Reverse engineering - how does the system work?
  • Coordinating SLI and SLO
  • Capacity planning practice
  • Launching traffic to the application, our users begin to “use” it
  • Launching Prometheus, Grafana, Elastic

Topic #5: Monitoring, Observability and Alerting

  • monitoring vs. observability
  • Setting up monitoring and alerting with Prometheus
  • Practical monitoring of SLI and SLO
  • Symptoms vs. Causes
  • Black box vs. White box monitoring
  • Distributed monitoring of application and server availability
  • 4 golden signals (anomaly detection)

Topic #6: The Practice of System Reliability Testing

  • Work under pressure
  • failure-injection
  • Chaos monkey

Topic #7: Practice of incident response

  • Stress management algorithm
  • Interaction between incident participants
  • Postmortem
  • knowledge sharing
  • Formation of culture
  • Fault monitoring
  • Conducting blameless debriefing

Topic #8: Load Management Practices

  • Load balancing
  • Application fault tolerance: retry, timeout, failure injection, circuit breaker
  • DDoS (we create a load) + Cascading Failures

Topic #9: Incident Response

  • Debriefing
  • Practice On-Call
  • Various types of accidents (testing, configuration change, hardware failure)
  • Incident Management Protocols

Topic #10: Diagnosing and Solving Problems

  • Logging
  • Debugging
  • Practice analysis and debugging on our application

Topic #11: System Reliability Testing

  • Stress Testing
  • Configuration Testing
  • Performance testing
  • canary release

Topic #12: Independent work and review

Recommendations and requirements for participants

SRE is teamwork. We highly recommend taking the course as a team. Therefore, we give big discounts for ready-made teams.

The price of the course is 60 ₽ per person.
If the company sends a group of 5+ people — 40 ₽.

The course is built on Kubernetes. To pass, you need to know Kubernetes at a basic level. If you do not work with him, you can go through Slurm Basic (Online Training: For those who prefer the flexibility of learning from a distance or cannot make it to our Sofia location, we offer comprehensive online courses. or intensive November 18-20).
In addition, you need to be proficient in Linux, know Gitlab and Prometheus.

Register

If you have a complex idea for participation, for example, for a CEO, CTO and a development team to come to the course, and they would have an internship taking into account the management vertical, write to me in a personal.

Source: habr.com

Add a comment