🧙🏿 👋🏼 📙 Interview with Ivan Kruglov, Principal Developer: Service Mesh and “non-standard” Booking.com tools 🍃 😽 🕴🏾

Ivan Kruglov, Principal Developer at Booking.com, spoke on the SlOmer DevOps with the theme SRE , and after speaking, agreed to talk over Kubernetes, Service Mesh, open source and “non-standard” solutions at Booking.com over a cup of coffee

Since the topic of SRE turned out to be much broader, Ivan and his colleague Ben Tyler, Principal Developer at Booking.com, agreed to become speakers of Slurm SRE, which will be held on February 3-5, 2020 . It will discuss the theory and practice of applying SLI / SLO / error budget, post-mortem analysis, effective elimination of IT incidents, building reliable systems (monitoring and alerting, graceful degradation, failure-injection, capacity planning, prevention of cascading failures )

And now a word to Ivan.

What has been an interesting professional challenge for you lately?

In the last two years I have been doing two things. First: doing the inner cloud of Booking.com. It is built on Kubernetes. On this occasion, I had a long and comprehensive report on Highload.

We had a long and interesting way how we built our cloud. This is my previous project, which I left to a colleague.

Now I am dealing with a topic called Service Mesh. This is, in fact, a hot topic, as Big Data and Kubernetes were at one time.

The idea there is, on the one hand, simple, on the other hand, complex - in a microservice architecture, all interaction goes through the network, well, it’s as if it is an integral part of microservices. And the interaction itself is a complicated operation, a lot can go wrong there. This whole thing needs to be controlled. Including restrictions are imposed - if we have a monolith and two functions work, and these two functions could trust each other, because they are part of the same process, then now, in theory, microservices cannot trust each other.

How do you know where the request comes from? Here is a microservice, an HTTP request comes to it. Did he really come from the service from which I think? And in the same way, the service that another service calls for. I need to be sure that the service that I call is the service I need, and not some kind of fake one. In small organizations, this is probably not so true. And in large ones, when you have thousands and tens of thousands of cars, you will not keep track of each machine. And for large companies this is a rather serious problem. That is, let's say, everything is built at zero trust. Whenever you do some kind of communication, you must carry out verification. It turns out that at the level of network interaction you need to authenticate and authorize the operation. These are all quite difficult processes in terms of implementation. And it turns out that the Service Mesh takes on these tasks to ensure a secure interaction. This is just one of the features that Service Mesh does. There is much more - reliability, monitoring, tracing and others.

And do you think this technology is the future?

Service Mesh is a trend that is growing. It's my personal opinion. It is already quite common. For example, there is Istio. Then in the clouds, in the same Amazon, appeared Service Mesh. I think that all major suppliers will soon have or already have a full-fledged Service Mesh.

That is, the same breakthrough technology as it once was, and now there are Kubernetes?

I think yes. Although it is interesting to note here that in my opinion neither Kubernetes nor Service Mesh invent anything new. Kubernetes took an existing technology stack and made it automation. Service Mesh does just the same. He gives new tools on an existing base. Envoy has appeared in Service Mesh, which I will demonstrate today. (Note Ivan addressed this issue in a speech at the Slurm DevOps intensive ). The idea is that Service Mesh is a higher-level instrument that allows you to orchestrate communications of a large fleet of microservices. I will explain ... In order to start the microservice architecture, you need, first, this so-called runtime - this is the place where the application will be launched. This is what Kubernetes does. Service Mesh complements it, in the sense that it provides the interaction between these microservices that will be run in runtime.

Will you develop this technology in the near future?

I can speak for myself. I am involved in infrastructure. And in terms of infrastructure, in the next few years this will be one of the main topics - Kubernetes and Service Mesh.

Will they develop in parallel?

Certainly, because they complement each other. Kubernetes does runtime. Service Mesh provides interoperability.

More precisely, Kubernetes has some components that seem to cover aspects of Service Mesh. But in Kubernetes they are too basic. That is, Kubernetes, in terms of networking, gives you only low-level networking. I mean, IP packets can go from point A to point B. That's it. Okay, there are Ingress controllers, there are moments of higher-level routing, that is, not only network interaction. Nevertheless, in Kubernetes, for example, there are no built-in mechanisms for ensuring the reliability of query execution. Such a very simple example. In Kubernetes, if the “under” (Pod) falls, Kubernetes will pick it up itself. Default. This is a retry mechanism. But at the level of network interaction this does not happen. That is, if the home page service sends a request to the cart service, and for some reason this does not work, the request will not be retried.

Service Mesh in this regard adds functionality. It allows, if the request failed, repeat it again. There are other mechanisms such as outlier detection. If, for example, we have a fleet of “hearths” that work for the “home page” service, and a fleet of “hearths” that are a basket service. If they are geographically separated, then things can appear in them that one part of the “hearths” sees one part of the “hearths” and the other part of the “hearths” sees the other part of the “hearths”. Accordingly, in the Service Mesh there are mechanisms that allow you to dynamically build a picture of who is available to anyone, and switch between them - and all this in real time. And if one of the “hearths” has too much latency, then throw it away. Everyone “under” can decide that with this “hearth” my conversation is slow - everyone else is normal, and this is slow - because I will throw him out of my pool. This is how the mechanism for determining anomalies works. When we have ten "hearths", nine "hearths" work without errors, and the tenth constantly sends errors. Or nine “hearths” respond with latency of 15 ms, and one answers with latency of 400 ms. And Service Mesh decides to throw it away.

Service Mesh is also good at allowing you to collect statistics on the client side. That is, we have a client, and we have a server. Usually statistics are collected on the server side. Well, because it's the easiest. We want to use metrics to understand how well our user interacts with our service. Accordingly, in theory, you need to measure on the client side, and not on the server side. Because there is a big gap between them, which is filled with network interactions.

Each component of this whole variety can fail.

And the Service Mesh is good in that it puts the agent back and forth - and sends statistics from two ends. And situations may arise when on the service side latency is 20ms, and on the client side 2 seconds. For example, on the server side, we collect statistics from a web server, but 5% of our packages are lost for some reason. As a result, due to retransmit in the TCP stack, it turns out that our client sees latency in 2 seconds. And on the server side, we still see excellent latency: as everything was sent to the buffer, that's it! I'm fine, I have a latency of 20 ms. And what is the customer like ?!

And how do you solve this?

Fundamentally, this is solved by customer tooling. In a good way, statistics should be collected as close as possible to the client. But customer tooling is not always possible and not always convenient.

What are your company’s reliability and availability metrics?

Everything is by and large standard. Today I will talk about it. (Note. Ivan Speech on the third day of Slurm DevOps ) There are five or six key indicators. The so-called Service Level Indicators: latency, durability, freshness, correctness ... When I was preparing for the presentation on the Slurm, I tried to find non-standard cases at Booking.com, interesting SLI examples that did not fit into this model. Because, in theory, the key idea of SRE is based on one such high-level statement - we need to highlight a metric or metrics that would reflect the user experience. And for some services, latency, durability is suitable, for others it is not. And how to find this equilibrium point that would reflect the user experience - this is an interesting task.

What unique solutions did you see at Booking.com when you came to work there? Or is everything standard?

Not. We have a lot of things that are “non-standard”. Let’s explain why it is not standard in quotation marks. Where does non-standard come from ... Non-standard often comes from the fact that the company ran into a problem earlier than the market, and therefore there is no “standard” solution. In this regard, Booking.com, being a company that has been operating in the market since 1997 and has grown to such a size, at one time faced with a number of problems that were not resolved.

For example, Google. According to my observations, it looks something like this. Inside they make major developments, which are laid out in public in five or ten years. For example, I talked with the guys from Google who patched the Linux kernel. Then I had certain problems in the TCP stack. I tell them: “This is clearly the core problem. How do you fight this? ” They say: “Ah, so we have a patched kernel. Here we can tweak the setting. In 2013, we patched it. And we are just rolling it out in open-source in 2018. ”

About the same with Service Mesh. It is also built in the image of the technology that Google uses internally. But they do not upload it directly to open source. Istio is essentially an open-source reimplementation of their internal system. With Kubernetes the same. In my opinion, this is because when a company is a pioneer, it creates solutions for itself. Because it’s faster, cheaper. Open source is expensive. And no matter who says that you need to spread it, you really need to build a community. And to build a community, you need to invest a huge bunch of effort. And only then will the return go to you. It seems to me that behind any serious “pharmacies” there is an even more serious marketing, in which money is naturally invested.

Why am I ... As I said, when solving a problem, a company does a lot for itself. And then put it in open source is quite difficult. You need to cut out business logic and a bunch of small details. We have our own service Service Mesh, our own monitoring system. And to spread it in open source, it needs to be redone. An open source publication has its advantages ...

Which for example?

Technology brand, loyalty, easier on-boarding. They are important.

You can’t count them directly.

Yes, that's right. Do not count directly. This is a long-term, strategic investment. I am not for or against open source. You need to be balanced, evaluating what the company gives the publication of a particular technology. Balance the long and short term strategy.

Returning to the question, how much is standard and non-standard in Booking.com. I will say this, not the majority is non-standard, but a lot. Because we were solving problems that were still unknown in the market, or other companies were only at the beginning of the journey. Just for the company is easier and faster, and cheaper to solve problems for themselves.

PS : It is not possible to cover the whole topic of SRE in one presentation. There are not only tools, but also the philosophy of the approach itself. Therefore, we highlighted a whole intensive Slurm SRE for this topic , which will be held on February 3-5, 2020 . The speakers will be Ivan Kruglov, Principal Developer at Booking.com, his colleague Ben Tyler, Principal Developer at Booking.com, Eduard Medvedev , CTO at Tungsten Labs, Eugene Varavva, Google’s broad-profile developer.

Interview with Ivan Kruglov, Principal Developer: Service Mesh and “non-standard” Booking.com tools

More articles: