Slurm DevOps: from Git to SRE with all the stops

September 4-6, in St. Petersburg, in the Selectel conference room, a three-day DevOps Slurm will be held.

We built the program on the basis of the idea that everyone can read theoretical works on DevOps, as well as manuals for tools, independently. Only experience and practice are interesting: an explanation of how to do and do not need, and a story of how we do it.

Each company, each administrator or developer has its own DevOps level. Some misuse Git, others implement SRE. The course is organized so that everyone finds something relevant that can be implemented here and now.

We start with Git, then look at application development, the interaction of code and infrastructure, build CI / CD, describe the infrastructure as code (IaC), test the resulting solution, set up monitoring, collect and study logs, and finally get to SRE: turn reliability into a measurable and manageable story.

Git

Now Geet is not used only by the one who bought the first laptop yesterday. This is a trivial and ubiquitous tool, and nevertheless we often encounter its misuse: from force push to the master, and ending with copying files from Gita to the server via Ctrl-C, Ctrl-V.

We tell how to do it is not necessary, how to do it, how to do in Southbridge.

We pass the practice: the basics of the Gita, teamwork.

Topic # 1: Git Basics

Basic commands git init, commit, add, diff, log, status, pull, push
Git flow, branches and tags, merge strategies
Work with multiple remote repo

Topic # 2: Teamwork with Git

Github flow
Fork, remote, pull request
Conflicts, releases, once again about Gitflow and other flow in relation to teams

The material is organized so that administrators and developers can immediately implement all the work practices.

From the point of view of DevOps, proper work with Git organizes and automates the development and administration processes, eliminates a number of recurring problems, and increases productivity.

DevOps Developer

We look at DevOps through the eyes of the developer: we launch the local environment, write the application, configure its monitoring and logging, test it locally, organize the storage of variables / secrets and service discovery, and watch tracing (opentracing).

Theme №3: Working with the application from a development point of view

Setting Up Your Local Environment: Best Practices
We write a microservice in Python (including tests)
Using docker-compose in development

Topic # 4: Code and Infrastructure Interoperability

Config practice

As a result, developers will see how the code should send logs, how to test it, how it will be debugged in the future. Administrators will understand the needs of developers: what errors happen in the code, how to organize testing for developers, and how to test the project yourself.

At this stage, the main task of DevOps is solved: mutual understanding and collaboration between virgins and ops are built. This is a key step in moving from task rollthrough to responsible engagement.

As a result, the speed and quality of work is growing.

CI / CD

Modern automation involves CI / CD. We will start by looking at manual automation: makefiles, githukes, scripts. We will analyze when these tools are still relevant, and when they should not be used.

Then, let's look at the best practices of modern CI using Gitlab as an example.

Topic # 5: CI / CD Introduction to Automation

Introduction to Automation
Tools (bash, make, gradle)
Using git-hooks to automate processes
Factory conveyor assembly lines and their application in IT
An example of building a “common” pipeline
Modern CI / CD software: Drone CI, BitBucket Pipelines, Travis, etc.

Topic # 6: CI / CD: Working with Gitlab

Gitlab CI - General
Gitlab Runner, their types and application
Gitlab CI, customization features, best practices
Gitlab CI Stages
Gitlab CI Variables
Assembly, testing, deployment
Control and execution restrictions: only, when
Work with artifacts
Templates inside .gitlab-ci.yml, reusing actions on different parts of the pipeline
Include - sections
Centralized management of gitlab-ci.yml (one file and automatic push to other repositories)

The joint work of administrators and developers goes to a new level: the administrator writes the CI template, and the developers correct it by building their CI independently of the administrator.

The dependence of developers on administrators is reduced, the amount of manual work is reduced, the problem of "the only person who knows how to work with the make-up file" is eliminated. Rollouts occur reliably and quickly.

IaC

The topic of Infrastructure as Code on the example of Terraform will tell the administrator of the cloud Selectel Alexei Stepanenko. It will show how to quickly and automatically deploy and scale up servers, how to automatically pack images, how to use configuration templates to immediately receive customized machines.

The person who made thousands of IaC solutions will tell you how to do it right and how to do it.

The Minimal Edit Selectel Cloud Solution is suitable for Google and Amazon clouds.

An employee at Southbridge Nikolay Mesropyan will show how to deploy a working application without downtime and test its functionality.

If you manage the infrastructure by hand (configure servers, install libraries, packages as needed), when you try to raise a copy of the environment, you will need to remember and reproduce all your actions. This task easily takes 3-5 days. Working with the infrastructure as with code ensures that you have an up-to-date description of the environment that can be deployed in minutes.

Nikolay will tell you how to write playbooks, what mistakes happen, why sometimes playbooks work slowly or not as expected. This is the experience of many years of using IaC in Southbridge.

Topic # 7: Infrastructure as Code

IaC: an approach to infrastructure as a code
Cloud providers as infrastructure providers
System initialization tools, image assembly (packer)
IaC on the example of Terraform
Configuration storage, collaboration, application automation
Practice creating ansible playbooks
Idempotency, declarativeness
IaC as Ansible
Database as a Code / PostgreSQL Failover

Infrastructure becomes declarative and idempotent.

The administrator learns to manage a complex infrastructure: quickly create new environments, maintain the unity of all environments, see the history of changes, which is critical when several teams work on a project.

The developer can study the infrastructure, independently deploy their environment.

Section bonus - creating and configuring a fault tolerant PostgreSQL database cluster. We will give the finished playbook that we use in Southbridge, you will deploy a cluster at the training stand and you can use this solution in your company.

Infrastructure Testing and Monitoring

Automation allows you to roll out the error immediately to a thousand servers. Each change requires testing. On the other hand, manual testing takes so much time that it negates the benefits of automation.

Let’s show in practice how to write role testing. As a result, you can write tests for your company. You no longer need to remember the settings made, describe them in tests and automatically check that all past decisions and crutches are in place.

Then we will learn how to automatically add all new servers to monitoring. Consider separately monitoring infrastructure and applications. We show bad and good practices.

Topic # 8: Testing Infrastructure

Testing and continuous integration with Molecule and Gitlab CI
Vagrant Application

Topic # 9: Monitoring Infrastructure with Prometheus

Why monitoring is needed
Types of monitoring
Notifications in the monitoring system
How to build a healthy monitoring system
Human readable notifications, for everyone
Health Check: what you should pay attention to
Automation based on monitoring data

Improper monitoring is a lack of monitoring. It doesn’t matter for business that the main page of an online store is available if the form of payment gives an error.

Developers and administrators are equally involved in setting up monitoring and troubleshooting. Moreover, traditionally, monitoring tasks fall on administrators. Our course will show developers what role they play in creating effective monitoring. Administrators will receive Southbridge best practices. As a result, the number of losses caused by crashes and brakes of the site or application will quickly decline.

Section bonus: monitoring-based automation. For example, monitoring reports that a load has arrived on the site, and scaling of web servers starts automatically.

Logging

The main mistake in working with logs is that administrators and developers watch them directly on the servers. If you have more than one server, this is a long time. This is not secular: the developer goes to the server, where he should not be.

DevOps requires centralized collection, processing and analytics of logs.

Topic # 10: Application Logging with ELK

The main applications and features of elastic (search, storage, scaling features, configuration flexibility)
Overview of kibana (main features, query language, dashboard management, charting)
Overview of elastic products and their application
Collecting metrics in APM (application tracing)
Optional: New Product Review - SIEM

Implementation of this approach will make the logs a simple and understandable tool for analyzing, configuring and adjusting the application and infrastructure.

SRE

And we get to the topic that Southbridge is just looking at and for which other speakers want to stay on Slurm's last day. We are glad that Ivan Kruglov from Booking.com agreed to read it.

A project lives in the real world, where reliability is never absolute, and every solution costs money.

What is an SLA for a complex project? Let's say how to evaluate that the site is available, but the pictures are loaded with a delay. What are the SLA metrics, where to shoot them, how to shoot them?

How to install SLA? How to withstand them?

Theme # 11: SRE

Definition of SLA, SLO, Error Budget and other scary terms from the world of SRE

SRE: Practice Monitoring SLI and SLO

SRE: Error Budget Practice

SRE: Interrupt and operational load management (apigateway, service mesh, circuit brackers)

Business wants SRE. At least at the simplest level: take a backup server or raise from backup? One database or cluster? Is DDoS protection proactive or only at the time of the attack?

The directors are not satisfied with the story that the “site is working” when a client calls him and informs that the order form does not open.

Therefore, it is important for the DevOps engineer to at least superficially understand SRE in order to adequately talk with the business about its needs.

Total

During Slurm DevOps, administrators and developers will learn:

- work correctly with Git;

- organize local development;

- configure (administrators) and use (developers) CI / CD;

- work with infrastructure as a code;

- test the infrastructure;

- monitor infrastructure and application;

- configure logging;

- understand, and ideally - use SRE.

For attentive readers - with the habrapost promo code, a 15% discount.

For all points, we are preparing practice and tools. So each participant, upon returning from Slurm, will be able to take his company to the next level of DevOps.

For business, this means cheaper administration and development, reduced downtime, increased reliability, faster delivery of features and the elimination of bugs.

All Articles