top of page

Why Did My Kubernetes Pod Stop Abruptly?


Why Did My Kubernetes Pod Stop Abruptly?

We’ve all been there. Your Pod is running along, doing its job, and then suddenly - it stops. No graceful shutdown, no clear reason. It’s frustrating.


A quick glimpse of the Pod lifecycle runs through our senses, for sure.




In fact, more or less, we’re ready for the frequent and obvious ones like:


📌 Pod stops with 'Evicted' when disk pressure hits the node.


📌 Pod stops with 'OOMKilled' when the node runs out of memory.


📌 Pod stops with 'CrashLoopBackOff' when it keeps failing to start.


📌 Pod stops with 'ImagePullBackOff' when it can’t fetch the container image.


One of the clients reached out a while back for a consultation to solve this recurring issue.


In a cluster, a critical Pod running a multi-threaded app intermittently failed without clear logs. It vanished as 'Failed' with a blank reason, while other Pods on the node seemed fine - until they weren’t.


What happened behind this mess-up?

Application spawned subprocesses without cleanup, leaving zombie processes behind.


These zombies accumulated, exhausting all available PIDs on the node.


Kubernetes couldn’t allocate PIDs for new Pods, causing abrupt failures.


Basic processes like the pause container couldn’t start, resulting in Pod terminations with unclear logs.


Process ID Limits and Reservations is a fantastic guide to help you understand PID exhaustion in Kubernetes.


This wasn’t a straightforward problem, but here’s how we cracked it:


1. Analyzing the Node State

SSH’ed into the node hosting the failing Pods and checked the available PIDs:


cat /proc/sys/kernel/pid_max

This showed a max limit of 32,768 PIDs.


Running processes (ps aux | wc -l) revealed that nearly all PIDs were in use.


2. Inspecting Zombie Processes

looked for zombie processes (stat status Z):


ps -e -o pid,ppid,stat,cmd | grep 'Z'

Hundreds of zombie processes were tied to the legacy application.


3. Identifying the Offending Pod

Cross-referenced the zombie process PIDs with Pod logs to identify the application responsible for spawning these processes.


4. Correlating with Kubernetes Events

Ran kubectl describe node <node-name> to confirm PIDPressure. Kubernetes marked the node as unhealthy due to PID exhaustion.


The Fix:

Increased node PID limit temporarily (sysctl -w kernel.pid_max=4194304).


Fixed application to handle child processes and clean up zombies with s6-overlay.


Isolated the legacy app to a dedicated node pool to protect other workloads.


Ofcourse, this could have been completely avoided.


📌 Use a process supervisor like s6-overlay in containerized environments to manage child processes effectively.


📌 Low-density nodes can also hit PID exhaustion. Monitor PIDPressure with

kubectl get nodes -o wide


Hope this use case was interesting and equally informative.

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page