👦🏼 👩🏻 💂 “Hope is a bad strategy.” SRE intensive in Moscow, February 3-5 🚰 ✈️ ❕

We are announcing the first SRE practical course in Russia: Slurm SRE .

At the intensity, we will build, break, repair and improve the site-aggregator for the sale of movie tickets for three days.

We chose the ticket aggregator because it has a lot of refusal scenarios: an influx of visitors and DDoS attacks, the fall of one of the many critical microservices (authorization, reservation, payment processing), the inaccessibility of one of the many cinemas (data exchange about available seats and reservations), and further down the list.

We will formulate the Reliability concept of our aggregator site, which we will continue in Engineering, we will analyze the design from the point of view of SRE, we will select metrics, we will set up their monitoring, we will eliminate the incidents that occur, we will conduct training for team work with incidents in conditions close to combat, we will organize debriefing .

The program is run by Booking.com and Google.

This time there will be no remote participation: the course is built on personal interaction and teamwork.

Details under the cut

Speakers

Ivan Kruglov

Principal Developer at Booking.com (Netherlands)

Since joining Booking.com in 2013, he has worked on such infrastructure projects as distributed delivery and message processing, BigData and web-stack, search.

Now he is engaged in issues of building an internal cloud and Service Mesh.

Ben tyler

Principal Developer at Booking.com (USA)

Engaged in the internal development of the Booking.com platform.

Specializes in service mesh / service discovery, batch job scheduling, incident response and postmortem process.

Speaks and teaches in Russian.

Eugene Varavva

Google Wide Profile Developer (San Francisco).

Experience from highly loaded web projects to research in computer vision and robotics.

Since 2011, he has been engaged in the development and operation of distributed systems at Google, participating in the full life cycle of the project: conceptualization, design and architecture, launching, minimizing and all the intermediate stages.

Eduard Medvedev

CTO at Tungsten Labs (Germany)

He worked as an engineer at StackStorm, was responsible for the ChatOps functionality of the platform. Developed and implemented ChatOps in the automation of data centers. Speaker at Russian and international conferences.

Program

The program is being actively developed. Now it looks like this, by February it can improve and expand.

Theme №1: Basic principles and methods of SRE

What does it take to become an SRE?
DevOps vs SRE
Why developers appreciate SRE and are very sad when they are not in the project
SLI, SLO and SLA
Error budget and its role in SRE

Theme number 2: Design of distributed systems

Application Architecture and Functionality
Non-Abstract Large System Design
Operability / Design for failure
gRPC or REST
Versioning and Backward Compatibility

Theme №3: How to accept the SRE project

Best Practices from SRE
Project Admission Checklist
Logging, metrics, tracing
We take CI / CD into our own hands

Theme №4: Design and launch of a distributed system

Reverse engineering - how does the system work?
We coordinate SLI and SLO
Capacity planning practice
Launching traffic to the application, our users begin to "use it"
Launch Prometheus, Grafana, Elastic

Topic # 5: Monitoring, Observability and Alerting

Monitoring vs. Observability
Set up monitoring and alerts with Prometheus
Practical monitoring of SLI and SLO
Symptoms vs. Causes
Black-Box vs. White box monitoring
Distributed application and server availability monitoring
4 gold signals (anomaly detection)

Theme №6: The practice of testing the reliability of systems

Work under pressure
Failure injection
Chaos monkey

Theme # 7: Practice incident response

Stress management algorithm
Interaction between incident participants
Post mortem
Knowledge sharing
Culture formation
Fault monitoring
Carrying out blameless debriefing

Topic # 8: Workload Management Practice

Load balancing
Application Fault Tolerance: retry, timeout, failure injection, circuit breaker
DDoS (create load) + Cascading Failures

Topic # 9: Incident Response

Debriefing
On-Call Practice
Different types of crashes (testing, configuration changes, hardware failures)
Incident Management Protocols

Theme №10: Diagnosis and problem solving

Logging
Debugging
Analysis and debugging practice on our application

Topic # 11: System Reliability Testing

Stress Testing
Configuration testing
Performance testing
Canary release

Theme №12: Independent work and review

Recommendations and requirements for participants

SRE - teamwork. We strongly recommend that the whole team take the course. Therefore, we give big discounts for ready-made teams.

Course price - 60 000 ₽ per person.

If the company sends a group of 5+ people - 40 000 ₽.

The course is built on Kubernetes. To pass you need to know Kubernetes at a basic level. If you don’t work with him, you can go through Slurm Basic ( online or intensive November 18-20 ).

In addition, you need a good command of Linux, know Gitlab and Prometheus.

check in

If you have a difficult idea to participate, for example, for the CEO, technical director and development team to come to the course, and they will practice based on the managerial vertical, write to me in PM.

“Hope is a bad strategy.” SRE intensive in Moscow, February 3-5

Speakers

Program

Recommendations and requirements for participants

check in

More articles: