🐞 🖖 🗺️ 3 Kubernetes crash stories in production: anti-affinity, graceful shutdown, webhook 🙍🏿 ⏩ 🤲

Note perev. : We present a mini-selection of post-mortem about the fatal problems that engineers of various companies faced when operating the infrastructure based on Kubernetes. Each note talks about the problem itself, its causes and consequences, and, of course, about a solution that helps to avoid similar situations in the future.

As you know, learning from someone else's experience is cheaper, and therefore - let these stories help you be prepared for possible surprises. By the way, a large and regularly updated selection of links to such “failure stories” is published on this site (according to data from this Git repository ).

No. 1. How kernel panic crashed a site

Original: Moonlight .

Between January 18 and 22, the Moonlight website and API experienced intermittent malfunctions. It all started with random API errors and ended with a complete shutdown. The problems were resolved and the application returned to normal.

General information

Moonlight uses software known as Kubernetes. Kubernetes runs applications on server groups. These servers are called nodes. Copies of the application running on the node are called pods. Kubernetes has a scheduler that dynamically determines which pods on which nodes should work.

Chronology

The first errors on Friday were related to problems connecting to the Redis database. The Moonlight API uses Redis to verify sessions for each authenticated request. Our Kubernetes monitoring tool has notified that some nodes and pods are not responding. At the same time, Google Cloud reported a malfunction of network services , and we decided that they were the cause of our problems.

As traffic on the weekend decreased, the errors seemed to be resolved in their bulk. However, on Tuesday morning, Moonlight’s site fell, and external traffic did not reach the cluster at all. We found another person on Twitter with similar symptoms and decided that the Google hosting had a network failure. We contacted Google Cloud support, which promptly referred the issue to the technical support team.

The Google tech support team revealed some pattern in the behavior of nodes in our Kubernetes cluster. The CPU utilization of individual nodes reached 100%, after which the kernel panic occurred in the virtual machine and it crashed.

Causes

The cycle that caused the failure was as follows:

The Kubernetes scheduler hosted several pods with high CPU consumption on the same node.
Pods ate all the CPU resources on the node.
Next came the kernel panic, which led to a period of downtime during which the node did not respond to the scheduler.
The scheduler moved all fallen pods to a new node, and the process repeated, exacerbating the overall situation.

Initially, the error occurred in the Redis pod, but in the end all the pods that work with traffic fell, which led to a complete shutdown. Exponential delays during re-planning have led to longer periods of inactivity.

Decision

We were able to restore the site by adding anti-affinity rules to all major Deployments. They automatically distribute pods over nodes, increasing fault tolerance and performance.

Kubernetes itself is designed as a fault-tolerant host system. Moonlight uses three nodes on different servers to ensure stability, and we run three copies of each application that serves traffic. The idea is to have one copy on each node. In this case, even a failure of two nodes will not lead to downtime. However, Kubernetes sometimes hosted all three pods with the site on the same node, thus creating a bottleneck in the system. At the same time, other applications demanding CPU power (namely, server-side rendering) ended up on the same node, and not on a separate one.

A properly configured and properly functioning Kubernetes cluster is required to cope with long periods of high CPU load and place pods in such a way as to maximize the use of available resources. We continue to work with Google Cloud support to identify and address the root cause of kernel panic on servers.

Conclusion

Anti-affinity rules make applications that work with external traffic more fault-tolerant. If you have a similar service at Kubernetes, consider adding them.

We continue to work with the guys from Google to find and eliminate the cause of failures in the OS kernel on nodes.

No. 2. The “dirty” secret of Kubernetes and Ingress endpoint

Original: Phil Pearl of Ravelin .

Elegance is overrated

We at Ravelin migrated to Kubernetes (on GKE). The process has been very successful. Our pod disruption budgets are as full as ever, statefuls are truly stately (a hard-to-translate pun: "our statefulsets are very stately" - approx. Transl.) , And the sliding replacement of nodes goes like clockwork.

The final piece of the puzzle is moving the API layer from old virtual machines to the Kubernetes cluster. To do this, we need to configure Ingress so that the API is accessible from the outside world.

At first, the task seemed simple. We simply define the Ingress controller, tweak the Terraform to get a certain number of IP addresses, and Google takes care of almost everything else. And all this will work as if by magic. Class!

However, over time, they began to notice that integration tests periodically receive errors 502. From this, our journey began. However, I will save you time and go straight to the conclusions.

Graceful shutdown

Everyone is talking about graceful shutdown ("graceful", gradual shutdown). But you really shouldn't rely on him at Kubernetes. Or, at least, it should not be the graceful shutdown that you absorbed with your mother’s milk . In the Kubernetes world, this level of "elegance" is not needed and threatens with serious problems.

Perfect world

Here's how the majority view deletes the pod from a service or load balancer in Kubernetes:

The replication controller decides to remove the pod.
The endpoint pod is removed from the service or load balancer. New traffic to the pod no longer arrives.
A pre-stop hook is called, or the pod receives a SIGTERM signal.
Pod "gracefully" is disconnected. It stops accepting incoming connections.
The "elegant" disconnect is completed, and the pod is destroyed after all of its existing connections are stopped or terminated.

Unfortunately, the reality is completely different.

Real world

Most of the documentation hints that everything happens a little differently, but they do not write about it explicitly anywhere. The main problem is that step 3 does not follow step 2. They occur simultaneously. In regular services, endpoints are removed so quickly that the likelihood of encountering problems is extremely low. However, with Ingresss, everything is different: they usually respond much more slowly, so the problem becomes obvious. Pod can get SIGTERM long before changes in endpoints get into Ingress.

As a result, a graceful shutdown is not at all what is required of a pod. He will receive new connections and must continue to process them, otherwise the clients will start to receive the 500th errors and the whole wonderful story about uncomplicated deployments and scaling will start to fall apart.

Here's what actually happens:

The replication controller decides to remove the pod.
The endpoint pod is removed from the service or load balancer. In the case of Ingresss, this can take some time, and new traffic will continue to flow into the pod.
A pre-stop hook is called, or the pod receives a SIGTERM signal.
To a large extent, the pod should ignore this, continue to work and maintain new connections. If possible, he should hint to customers that it would be nice to switch to another place. For example, in the case of HTTP, it can send Connection: close

in the response headers.
Pod exits only when the “elegant” wait period expires and it is killed by SIGKILL.
Make sure that this period is longer than the time it takes to reprogram the load balancer.

If this is third-party code and you cannot change its behavior, then the best thing you can do is add a pre-stop hook that will just sleep for an “elegant” period, so that the pod will continue to work as if nothing occurred.

No. 3. How a simple webhook caused a cluster failure

Original: Jetstack .

Jetstack offers its customers multi-tenant platforms on Kubernetes. Sometimes there are special requirements that we cannot satisfy with the standard Kubernetes configuration. To implement them, recently we started using the Open Policy Agent (we wrote about the project in more detail in this review - approx. Transl.) As an access controller for implementing special policies.

This article describes the failure caused by improperly configured this integration.

Incident

We were engaged in updating the wizard for the dev cluster, in which various teams tested their applications during the working day. It was a regional cluster in the europe-west1 zone on the Google Kubernetes Engine (GKE).

Commands were warned that an upgrade was in progress, with no downtime expected. Earlier that day, we already did a similar update to another pre-production environment.

We started the upgrade using our GKE Terraform pipeline. The wizard update did not complete until the Terraform timeout expired, which we set for 20 minutes. This was the first wake-up call that something went wrong, although in the GKE console the cluster was still listed as “upgrading”.

Restarting the pipeline led to the following error

 google_container_cluster.cluster: Error waiting for updating GKE master version: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-..." is unhealthy

This time, the connection with the API server began to be periodically interrupted and the teams could not deploy their applications.

While we were trying to understand what was happening, all the nodes began to be destroyed and recreated in an endless cycle. This has led to an indiscriminate denial of service for all of our customers.

We establish the root cause of the failure

With Google support, we were able to determine the sequence of events that led to the failure:

GKE completed the upgrade on one instance of the wizard and began to accept all traffic to the API server on it as the next wizards were updated.
During the upgrade of the second instance of the wizard, the API server was unable to execute PostStartHook to register the CA.
During the execution of this hook, the API server tried to update ConfigMap called extension-apiserver-authentication

in kube-system

. It was not possible to do this because the backend for the Open Policy Agent (OPA) checking webhook that we configured did not respond.
For the wizard to pass a health check, this operation must complete successfully. Since this did not happen, the second master entered the emergency cycle and stopped the update.

The result was periodic API crashes, due to which kubelet's could not report the operation of the node. In turn, this led to the fact that the mechanism for automatic restoration of GKE nodes (node auto-repair) began to restart the nodes. This feature is described in detail in the documentation :

An unhealthy status can mean: Within a given time (approximately 10 minutes), the node does not give out any status at all.

Decision

When we found out that the ValidatingAdmissionWebhook

resource was causing the intermittent access to the API server, we deleted it and restored the cluster to work.

Since then, the ValidatingAdmissionWebhook

for OPA has been configured to monitor only those namespaces where the policy is applicable and to which the development teams have access. We also limited the webhook to Ingress

and Service

, the only ones our policy works with.

Since we first deployed the OPA, the documentation has been updated to reflect this change.

We also added a liveness test to ensure that the OPA restarts in case it becomes unavailable (and made appropriate amendments to the documentation).

We also considered disabling the automatic recovery mechanism for GKE nodes, but still decided to abandon this idea.

Summary

If we enabled the API server response time alerts, we would initially be able to notice its global increase for all CREATE

and UPDATE

requests after deploying the webhook for OPA.

This underlines the importance of setting up tests for all workloads. Looking back, we can say that the deployment of OPA was so deceptively simple that we did not even get involved in the Helm chart (although it should). The chart makes a number of adjustments beyond the basic settings described in the manual, including the livenessProbe setting for containers with an admission controller.

We were not the first to encounter this problem: the upstream issue remains open. Functionality in this matter can clearly be improved (and we will follow up on this).

3 Kubernetes crash stories in production: anti-affinity, graceful shutdown, webhook

No. 1. How kernel panic crashed a site

General information

Chronology

Causes

Decision

Conclusion

No. 2. The “dirty” secret of Kubernetes and Ingress endpoint

Elegance is overrated

Graceful shutdown

Perfect world

Real world

No. 3. How a simple webhook caused a cluster failure

Incident

We establish the root cause of the failure

Decision

Summary

PS from the translator

More articles: