🤒 😐 🎐 “Kubernetes has increased the delay 10 times”: who is to blame? 🛳️ ♑️ 🙏🏼

Note perev. : This article, written by Galo Navarro, who holds the position of Principal Software Engineer at Adevinta in Europe, is a fascinating and instructive “investigation” in the field of infrastructure operation. Its original name was slightly supplemented in the translation for a reason that the author explains at the very beginning.

Note from the author : This publication appears to have attracted much more attention than expected. I still get angry comments that the title of the article is misleading and that some readers are saddened. I understand the reasons for what is happening, therefore, despite the risk of disrupting the whole intrigue, I want to immediately tell you what this article is about. When switching teams to Kubernetes, I observe a curious thing: every time a problem arises (for example, an increase in delays after migration), Kubernetes is first blamed, but then it turns out that the orchestra is not to blame. This article is about one such case. Its name repeats the exclamation of one of our developers (then you will see that Kubernetes has nothing to do with it). You will not find unexpected revelations about Kubernetes in it, but you can count on a couple of good lessons about complex systems.

A couple of weeks ago, my team was engaged in the migration of one microservice to the main platform, including CI / CD, a working environment based on Kubernetes, metrics and other utilities. The move was a trial one: we planned to take it as a basis and transfer about 150 more services in the coming months. All of them are responsible for the work of some of the largest online sites in Spain (Infojobs, Fotocasa, etc.).

After we deployed the application to Kubernetes and redirected some of the traffic to it, an alarming surprise awaited us. Latency of requests in Kubernetes was 10 times higher than in EC2. In general, it was necessary either to look for a solution to this problem, or to refuse to migrate a microservice (and, possibly, from the entire project).

Why is the delay in Kubernetes so much higher than in EC2?

To find the bottleneck, we collected metrics all along the query path. Our architecture is simple: the API gateway (Zuul) proxies requests to microservice instances in EC2 or Kubernetes. In Kubernetes, we use the NGINX Ingress Controller, and the backends are regular Deployment objects with a JVM application based on the Spring platform.

EC2 +---------------+ | +---------+ | | | | | +-------> BACKEND | | | | | | | | | +---------+ | | +---------------+ +------+ | Public | | | -------> ZUUL +--+ traffic | | | Kubernetes +------+ | +-----------------------------+ | | +-------+ +---------+ | | | | | xx | | | +-------> NGINX +------> BACKEND | | | | | xx | | | | +-------+ +---------+ | +-----------------------------+

It seemed that the problem was related to the delay at the initial stage of work in the backend (I marked the problem area on the chart as “xx”). In EC2, the application response took about 20 ms. In Kubernetes, the delay increased to 100-200 ms.

We quickly discarded probable suspects related to a change in the runtime environment. The JVM version remains the same. Containerization problems also had nothing to do with it: the application was already running successfully in containers in EC2. Loading? But we observed high delays even at 1 request per second. Pauses for garbage collection could also be neglected.

One of our Kubernetes administrators asked if the application had external dependencies, because in the past, DNS queries caused similar problems.

Hypothesis 1: DNS Name Resolution

With each request, our application calls from one to three times an instance of AWS Elasticsearch in a domain like elastic.spain.adevinta.com

. We have a shell inside the containers, so we can check whether the domain search really takes a long time.

DNS queries from the container:

 [root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done ;; Query time: 22 msec ;; Query time: 22 msec ;; Query time: 29 msec ;; Query time: 21 msec ;; Query time: 28 msec ;; Query time: 43 msec ;; Query time: 39 msec

Similar requests from one of the EC2 instances where the application is running:

 bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done ;; Query time: 77 msec ;; Query time: 0 msec ;; Query time: 0 msec ;; Query time: 0 msec ;; Query time: 0 msec

Given that the search takes about 30 ms, it became clear that resolving DNS when accessing Elasticsearch does contribute to the increase in latency.

However, this was strange for two reasons:

We already have tons of applications in Kubernetes that interact with AWS resources but do not suffer from long delays. Whatever the reason, it relates specifically to this case.
We know that the JVM provides in-memory DNS caching. In our images, the TTL value is written in $JAVA_HOME/jre/lib/security/java.security

and set to 10 seconds: networkaddress.cache.ttl = 10

. In other words, the JVM should cache all DNS queries for 10 seconds.

To confirm the first hypothesis, we decided to temporarily refuse calls to the DNS and see if the problem disappears. First, we decided to reconfigure the application so that it communicates with Elasticsearch directly by IP address, and not through a domain name. This would require revision of the code and a new deployment, so we just mapped the domain to its IP address in /etc/hosts

:

 34.55.5.111 elastic.spain.adevinta.com

Now the container received IP almost instantly. This led to some improvement, but we only slightly approached the expected level of delay. Although DNS resolution took a long time, the real reason still eluded us.

Network Diagnostics

We decided to analyze the traffic from the container using tcpdump

in order to track what exactly is happening on the network:

 [root@be-851c76f696-alf8z /]# tcpdump -leni any -w capture.pcap

Then we sent some requests and downloaded their capture ( kubectl cp my-service:/capture.pcap capture.pcap

) for further analysis in Wireshark .

There was nothing suspicious in the DNS queries (except for one little thing, which I will discuss later). But there were certain oddities in how our service processed each request. Below is a screenshot of capture showing acceptance of the request before the response:

Package numbers are shown in the first column. For clarity, I highlighted the various TCP streams in color.

The green stream starting at the 328th packet shows how the client (172.17.22.150) established a TCP connection with the container (172.17.36.147). After the initial handshake (328-330), packet 331 brought HTTP GET /v1/..

- an incoming request to our service. The whole process took 1 ms.

The gray stream (from packet 339) indicates that our service sent an HTTP request to an Elasticsearch instance (there is no TCP handshake, because an existing connection is used). It took 18 ms.

So far, everything is fine, and the times roughly correspond to the expected delays (20-30 ms when measuring from the client).

However, the blue section takes 86 ms. What is going on in it? With packet 333, our service sent an HTTP GET request to /latest/meta-data/iam/security-credentials

, and immediately after it, via the same TCP connection, another GET request to /latest/meta-data/iam/security-credentials/arn:..

We found that this is repeated with every request in the entire trace. DNS resolution is really a bit slower in our containers (the explanation of this phenomenon is very interesting, but I will save it for a separate article). It turned out that the reason for the large delays is calls to the AWS Instance Metadata service for each request.

Hypothesis 2: Extra AWS Calls

Both endpoints are owned by the AWS Instance Metadata API . Our microservice uses this service while working with Elasticsearch. Both calls are part of the basic authorization process. The endpoint that is accessed on the first request issues the IAM role associated with the instance.

 / # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ arn:aws:iam::<account_id>:role/some_role

The second request goes to the second endpoint for temporary authority for this instance:

 / # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role` { "Code" : "Success", "LastUpdated" : "2012-04-26T16:39:16Z", "Type" : "AWS-HMAC", "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE", "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", "Token" : "token", "Expiration" : "2017-05-17T15:09:54Z" }

The client can use them for a short period of time and periodically must receive new certificates (before their Expiration

). The model is simple: AWS performs frequent rotation of temporary keys for security reasons, but clients can cache them for several minutes, compensating for the performance degradation associated with obtaining new certificates.

The AWS Java SDK must assume responsibility for organizing this process, but for some reason this does not happen.

After searching for issues on GitHub, we came across issue # 1921 . She helped us determine the direction in which to "dig" further.

The AWS SDK renews certificates when one of the following conditions occurs:

The expiration EXPIRATION_THRESHOLD

( Expiration

) falls into EXPIRATION_THRESHOLD

, which is tightly set in the code for 15 minutes.
Since the last attempt to renew the certificates, more time has passed than REFRESH_THRESHOLD

, which is 'hardcode' for 60 minutes.

In order to see the actual expiration date of the certificates we receive, we ran the above cURL commands from the container and from the EC2 instance. The validity period of the certificate received from the container turned out to be much shorter: exactly 15 minutes.

Now everything became clear: for the first request, our service received temporary certificates. Since their validity did not exceed 15 minutes, upon subsequent request, the AWS SDK decided to update them. And this happened with every request.

Why are certificate expiration dates shorter?

AWS Instance Metadata is designed to work with EC2 instances, not Kubernetes. On the other hand, we did not want to change the application interface. To do this, we used KIAM , a tool that, with the help of agents on each Kubernetes node, allows users (engineers deploying applications to the cluster) to assign IAM roles to containers in pods as if they were EC2 instances. KIAM intercepts calls to the AWS Instance Metadata service and processes them from its cache, having previously received it from AWS. From the point of view of the application, nothing changes.

KIAM delivers short-term certificates to pods. This is reasonable when you consider that the average lifetime of a pod is shorter than that of an EC2 instance. By default, certificates are valid for the same 15 minutes .

As a result, if you put both default values on top of each other, a problem arises. Each certificate provided to the application expires in 15 minutes. At the same time, the AWS Java SDK forcibly renews any certificate that expires in less than 15 minutes.

As a result, the temporary certificate is forcibly renewed with each request, which entails a couple of calls to the AWS API and leads to a significant increase in the delay. In the AWS Java SDK, we found a feature request that mentions a similar issue.

The solution was simple. We simply reconfigured KIAM to request longer-term certificates. As soon as this happened, requests began to pass without the participation of AWS Metadata service, and the delay dropped even to a lower level than in EC2.

conclusions

Based on our experience with migrations, we can say that one of the most common sources of problems is not errors in Kubernetes or other elements of the platform. Also, it is not associated with any fundamental flaws in the microservices that we transfer. Problems often arise simply because we connect the various elements together.

We mix complex systems that have never before interacted with each other, expecting that together they form a single, larger system. Alas, the more elements, the more room for errors, the higher the entropy.

In our case, high latency was not the result of errors or bad decisions in Kubernetes, KIAM, AWS Java SDK or our microservice. It was the result of combining two independent default parameters: one in KIAM, the other in the AWS Java SDK. Separately, both parameters make sense: the active certificate renewal policy in the AWS Java SDK and the short certificate validity period in KAIM. But if you put them together, the results become unpredictable. Two independent and logical decisions do not have to make sense when combined.

PS from the translator

You can learn more about the KIAM utility architecture for integrating AWS IAM with Kubernetes in this article from its creators.

And on our blog also read:

“Kubernetes has increased the delay 10 times”: who is to blame?