Debugging 'Too many open files' in Kubernetes: nofile vs inotify/fsnotify

March 12, 2026 · 5 min read

I like to pick things apart and see how they work inside

When you see too many open files in a containerized app, it’s tempting to jump straight to ulimit -n. Sometimes that’s correct. But on Linux (especially with Go apps using fsnotify), the error can also be caused by inotify limits—even if your process has a huge file-descriptor limit.

This post is a practical, copy/paste-friendly checklist to debug the problem on a real Kubernetes cluster.

Step 0: Decide which “limit” you’re hitting

There are (at least) three common failure modes that can all look like “too many open files”:

Per-process file descriptor limit (classic EMFILE)
- Think: ulimit -n, /proc/<pid>/limits, or systemd LimitNOFILE.
System-wide file table exhaustion (ENFILE)
- Think: /proc/sys/fs/file-nr approaching /proc/sys/fs/file-max.
inotify instance/watch limits (common with fsnotify)
- Think: fs.inotify.max_user_instances and fs.inotify.max_user_watches.

The rest of the tutorial helps you quickly identify which one applies.

Step 1: Check the real limits of the failing process (not your shell)

Inside the pod:

# Find your process (adjust the pattern for your app)
ps -eo pid,comm,args | grep -E 'myapp|server|gateway' | grep -v grep

# Replace <pid> with the real PID
cat /proc/<pid>/limits | sed -n '/Max open files/p'

# Count currently open FDs
ls /proc/<pid>/fd | wc -l

Why this matters:

ulimit -n shows the limit for your current shell, which may be totally different from a process started by an init system (systemd, supervisord, Kubernetes runtime, etc.).

Step 2: Check node-wide file table pressure (`ENFILE`)

On the Kubernetes node:

cat /proc/sys/fs/file-nr
cat /proc/sys/fs/file-max

file-nr is usually three numbers: allocated, unused, max. If allocated is near max, you have a node-level exhaustion problem that can break unrelated workloads.

Step 3: Check inotify limits (the usual `fsnotify` culprit)

On the node (or inside the container—these are node kernel settings):

cat /proc/sys/fs/inotify/max_user_instances
cat /proc/sys/fs/inotify/max_user_watches
cat /proc/sys/fs/inotify/max_queued_events

If your app uses Go file watching, fsnotify’s Linux notes are worth skimming: fsnotify README (Linux).

Key concept:

max_user_instances is per-UID. In Kubernetes, multiple containers/processes can share the same numeric UID (e.g., a non-root “app user”), which means they share the same inotify instance budget on that node.

Step 4: Count inotify “instances” currently in use (who’s consuming them?)

On Linux, inotify instances show up as file descriptors named anon_inode:inotify.

Run this on the node to see which processes (and UIDs) are holding inotify instances:

for pid in /proc/[0-9]*; do
  p=${pid#/proc/}
  [ -r "$pid/fd" ] || continue
  c=0
  for fd in "$pid"/fd/*; do
    [ "$(readlink "$fd" 2>/dev/null)" = "anon_inode:inotify" ] && c=$((c+1))
  done
  [ "$c" -gt 0 ] || continue
  uid=$(awk '/^Uid:/{print $2}' "$pid/status" 2>/dev/null)
  cmd=$(tr '\0' ' ' < "$pid/cmdline" 2>/dev/null)
  echo "uid=$uid pid=$p inotify_instances=$c $cmd"
done | sort -k3 -nr | head -n 50

What to look for:

A single UID with inotify_instances totals near max_user_instances
Your app process holding many anon_inode:inotify FDs
A node “agent” (log collector, metrics sidecar, etc.) consuming a lot

Step 5: (Optional) Count how many watches each inotify FD holds

If you suspect “too many watches” (not instances), you can inspect /proc/<pid>/fdinfo/*:

pid=<pid>
for fd in /proc/$pid/fd/*; do
  [ "$(readlink "$fd" 2>/dev/null)" = "anon_inode:inotify" ] || continue
  n=$(grep -c 'inotify wd' /proc/$pid/fdinfo/${fd##*/} 2>/dev/null || true)
  echo "pid=$pid fd=${fd##*/} watches=$n"
done | sort -t= -k3 -nr | head -n 20

Step 6: “How do I kill the file watcher?”

You generally can’t “kill a watch” directly. The kernel releases it when the owning process closes the FD.

Practical actions:

Restart the pod / process that owns the inotify instances.
If a node is saturated, cordon/drain and reschedule to another node as a temporary workaround.
Fix the root cause by raising limits and/or reducing watcher usage.

Step 7: Mitigation—raise inotify instance limits (and make it persistent)

If you’ve confirmed max_user_instances is the bottleneck, increasing it is often the quickest fix.

Temporary (until reboot):

sysctl -w fs.inotify.max_user_instances=1024

Persistent:

Add a sysctl config file on the node (exact location varies by distro), for example:
- /etc/sysctl.d/99-inotify.conf
Then reload sysctls (varies by environment), commonly:
- sysctl --system

In managed Kubernetes, you may prefer:

baking sysctl settings into your node image/bootstrap
setting them via a privileged DaemonSet (policy-dependent)

Step 8: Don’t forget `nofile` (it’s still real)

Even if inotify was the cause this time, it’s worth capturing nofile facts for your runbook:

Systemd defaults:

systemctl show --property DefaultLimitNOFILE
systemctl show --property DefaultLimitNOFILESoft

Unit-specific limits (examples):

systemctl show kubelet --property LimitNOFILE
systemctl show containerd --property LimitNOFILE

And always verify the actual process limit via /proc/<pid>/limits (Step 1).

Summary: the fastest “root cause” loop

Check the failing process’s actual Max open files and FD count.
Check node file-nr/file-max for system-wide FD exhaustion.
Check inotify sysctls and enumerate anon_inode:inotify to find the UID/process consuming instances.
Apply the smallest safe mitigation (restart offender, reschedule, raise sysctl) and confirm the error disappears.

Step 0: Decide which “limit” you’re hitting​

Step 1: Check the real limits of the failing process (not your shell)​

Step 2: Check node-wide file table pressure (ENFILE)​

Step 3: Check inotify limits (the usual fsnotify culprit)​

Step 4: Count inotify “instances” currently in use (who’s consuming them?)​

Step 5: (Optional) Count how many watches each inotify FD holds​

Step 6: “How do I kill the file watcher?”​

Step 7: Mitigation—raise inotify instance limits (and make it persistent)​

Step 8: Don’t forget nofile (it’s still real)​

Summary: the fastest “root cause” loop​