👩🏾‍🤝‍👨🏽 👨‍❤️‍👨 🛰️ c.tech on HighLoad ++ 2019 🤷🏻 👩🏾‍🏭 🥖

Highload ++ is very close! On November 7-8, more than 3,000 developers of highly loaded systems will gather in Skolkovo for the thirteenth time. The event aims to exchange knowledge about technologies that can simultaneously serve many thousands and millions of users.

The program covers such aspects of web development as the architecture of large projects, databases and storage systems, system administration, load testing, operation of large projects and other areas related to highly loaded systems.

We are actively involved in Highload ++ 2019 and today we will tell you what reports our employees prepared for the conference participants.

November 7th

New Count of Classmates . Anton Ivanov, lead platform developer

Time: 12:00

Place: Moscow Hall

The Count of Friends is one of the most important and loaded services in Odnoklassniki. It is needed for almost any function of the site: form a feed, find new friends, check permissions when viewing photos, and much more. All this creates up to 700,000 requests per second to 300,000,000,000 connections between users.

Such a load is not only strict requirements for performance, but also for fault tolerance, because any problems with the graph can paralyze the work of the entire site. For a long time we lived on the classical scheme of shard bases and caches. But she had a lot of problems with both data consistency and fault tolerance.

In the report, we will talk in detail about the transition to the new graph architecture, start with a story about the old version and the problems that arose during its use, then dive into the new graph architecture and the surprises that awaited us during the migration.

Efficient, reliable microservices . Oleg Anastasiev, chief engineer

Time: 5 p.m.

Location: Singapore Hall

In Odnoklassniki, user requests are served by more than 200 types of unique types of services. Many of these services use the JVM combining technique of business logic and the Cassandra distributed fault-tolerant database. This allows us to build highly loaded services that manage hundreds of billions of records with millions of operations per second on them.

In this report we will talk about what advantages appear when combining business logic and database; discuss how condition affects the reliability and availability of services; and also discuss how this technique has significantly improved the performance of our services.

But not all databases are suitable for this. We will examine in detail which databases are suitable for embedding in your next microservice, and which are not.

November 8th

Rise of the Machines is OK . Leonid Talalaev, lead developer in the platform team

Time: 10:00

Location: Cape Town Hall

Classmates consist of more than 6,000 servers located in several data centers. Almost half of them are part of our cloud, one-cloud, about which we already talked about two years ago on HighLoad ++.

When managing more than 10,000 containers, typical tasks arise, whose manual execution would take too much time and inevitably lead to human errors. Therefore, we strive to automate all processes in the cloud so as to minimize human participation. We called this complete automation “Rise of the Machines”.

In the report, we will consider topics such as:

- Layout of security patches on all containers. At the same time, we will learn how to replace docker image layers in 1 second;

- ensuring the availability of distributed statefull-services during operations in the cloud;

- The problem of fragmentation in the cloud. We’ll tell you how to save a million dollars by changing the placement algorithm.

Transfer from a TCP needle to UDP with millions of users . Alexander Tobol, head of development of video and tape platforms

Time: 2 p.m.

Location: Main Hall (Congress Hall)

Alexander will tell:

as Odnoklassniki transferred millions of users from TCP to UDP, 3/4 OK Android users already use UDP for network communication
how to accelerate up to 30% delivery of content to users according to product client statistics
about approaches to building network protocols and methods for testing and modeling the network

In addition, OK will not only share the results of TCP and QUIC tests on different networks, but also the source code of the network emulator on which such tests are conducted.

200 TB + Elasticsearch Cluster . Petr Zaitsev, System Administrator, Elasticsearch Specialist

Time: 4 p.m.

Location: Main Hall (Congress Hall)

The purpose of the report: to talk about the pitfalls and the architecture of the Elasticsearch cluster for storing logs in a particularly large volume.

In the report, I will talk about how we organized storage and access to logs for developers as part of the Odnoklassniki project.

Initially, high demands were placed on the service. Everyone understood that the volume of processed data would be large, fault tolerance was also needed, and the peak load could increase to 2 million rows per second. For these reasons, the task turned out to be completely nontrivial, with a high content of "pitfalls" and piquant features.

I will describe the history of our “winding” path to solving this problem, as well as tell you which cluster architecture we ultimately came to and which decisions that seemed right at first glance “shot in the foot” at the most unexpected moment.

We had 4 data centers, 500 instances for elastic, 200TB + of data, up to 2 million lines per second at the peak, and 100% service uptime requirements at all costs.

How we managed to realize this, you will find out on our report!

c.tech on HighLoad ++ 2019