We are announcing the first SRE practical course in Russia: Slurm SRE .
At the intensity, we will build, break, repair and improve the site-aggregator for the sale of movie tickets for three days.
We chose the ticket aggregator because it has a lot of refusal scenarios: an influx of visitors and DDoS attacks, the fall of one of the many critical microservices (authorization, reservation, payment processing), the inaccessibility of one of the many cinemas (data exchange about available seats and reservations), and further down the list.
We will formulate the Reliability concept of our aggregator site, which we will continue in Engineering, we will analyze the design from the point of view of SRE, we will select metrics, we will set up their monitoring, we will eliminate the incidents that occur, we will conduct training for team work with incidents in conditions close to combat, we will organize debriefing .
The program is run by Booking.com and Google.
This time there will be no remote participation: the course is built on personal interaction and teamwork.
Details under the cut
Speakers
Ivan Kruglov
Principal Developer at Booking.com (Netherlands)
Since joining Booking.com in 2013, he has worked on such infrastructure projects as distributed delivery and message processing, BigData and web-stack, search.
Now he is engaged in issues of building an internal cloud and Service Mesh.
Ben tyler
Principal Developer at Booking.com (USA)
Engaged in the internal development of the Booking.com platform.
Specializes in service mesh / service discovery, batch job scheduling, incident response and postmortem process.
Speaks and teaches in Russian.
Eugene Varavva
Google Wide Profile Developer (San Francisco).
Experience from highly loaded web projects to research in computer vision and robotics.
Since 2011, he has been engaged in the development and operation of distributed systems at Google, participating in the full life cycle of the project: conceptualization, design and architecture, launching, minimizing and all the intermediate stages.
Eduard Medvedev
CTO at Tungsten Labs (Germany)
He worked as an engineer at StackStorm, was responsible for the ChatOps functionality of the platform. Developed and implemented ChatOps in the automation of data centers. Speaker at Russian and international conferences.
Program
The program is being actively developed. Now it looks like this, by February it can improve and expand.
Theme β1: Basic principles and methods of SRE
- What does it take to become an SRE?
- DevOps vs SRE
- Why developers appreciate SRE and are very sad when they are not in the project
- SLI, SLO and SLA
- Error budget and its role in SRE
Theme number 2: Design of distributed systems
- Application Architecture and Functionality
- Non-Abstract Large System Design
- Operability / Design for failure
- gRPC or REST
- Versioning and Backward Compatibility
Theme β3: How to accept the SRE project
- Best Practices from SRE
- Project Admission Checklist
- Logging, metrics, tracing
- We take CI / CD into our own hands
Theme β4: Design and launch of a distributed system
- Reverse engineering - how does the system work?
- We coordinate SLI and SLO
- Capacity planning practice
- Launching traffic to the application, our users begin to "use it"
- Launch Prometheus, Grafana, Elastic
Topic # 5: Monitoring, Observability and Alerting
- Monitoring vs. Observability
- Set up monitoring and alerts with Prometheus
- Practical monitoring of SLI and SLO
- Symptoms vs. Causes
- Black-Box vs. White box monitoring
- Distributed application and server availability monitoring
- 4 gold signals (anomaly detection)
Theme β6: The practice of testing the reliability of systems
- Work under pressure
- Failure injection
- Chaos monkey
Theme # 7: Practice incident response
- Stress management algorithm
- Interaction between incident participants
- Post mortem
- Knowledge sharing
- Culture formation
- Fault monitoring
- Carrying out blameless debriefing
Topic # 8: Workload Management Practice
- Load balancing
- Application Fault Tolerance: retry, timeout, failure injection, circuit breaker
- DDoS (create load) + Cascading Failures
Topic # 9: Incident Response
- Debriefing
- On-Call Practice
- Different types of crashes (testing, configuration changes, hardware failures)
- Incident Management Protocols
Theme β10: Diagnosis and problem solving
- Logging
- Debugging
- Analysis and debugging practice on our application
Topic # 11: System Reliability Testing
- Stress Testing
- Configuration testing
- Performance testing
- Canary release
Theme β12: Independent work and review
Recommendations and requirements for participants
SRE - teamwork. We strongly recommend that the whole team take the course. Therefore, we give big discounts for ready-made teams.
Course price - 60 000 β½ per person.
If the company sends a group of 5+ people - 40 000 β½.
The course is built on Kubernetes. To pass you need to know Kubernetes at a basic level. If you donβt work with him, you can go through Slurm Basic ( online or intensive November 18-20 ).
In addition, you need a good command of Linux, know Gitlab and Prometheus.
check in
If you have a difficult idea to participate, for example, for the CEO, technical director and development team to come to the course, and they will practice based on the managerial vertical, write to me in PM.