When running Linux in embedded or high-performance environments, you may encounter a rare but disruptive issue: all processes freeze for about one second. This guide explains the root cause, how to detect it, and what configuration changes can prevent it.
Symptoms
- System-wide stalls (~1s): All processes pause simultaneously.
- Dropped frames or data gaps: If running real-time workloads (e.g., video capture, LiDAR streaming), data loss may appear.
- Occurs intermittently: Often after hours of continuous operation, sometimes once every ~20 hours.
- Hard to reproduce: Typically appears only under long-running, memory-intensive, I/O-heavy workloads.
Start by ruling out the usual suspects:
- Check CPU load and storage latency.
- Review system logs (
dmesg
,journalctl
). - Verify if it’s tied to specific hardware, uptime, or processes.
In this case, none of those showed clear patterns.
To identify what’s happening inside the kernel at the freeze point:
- Collect events:
sudo trace-cmd record -e sched_switch -e irq_handler_entry -e irq_handler_exit
- Visualize with KernelShark.
Observation: Just before every freeze, the kernel thread kswapd
was active.
Understand what kswapd
does:
kswapd
is the kernel memory reclamation daemon.- It frees pages when memory runs low by moving data to swap or freeing cache.
- While reclaiming, it can hold critical memory locks. If the reclaim takes too long, other processes block — causing a stall.
Root cause:
- The system had two memory zones:
- Zone Normal (~60GB): Regular allocations.
- Zone DMA (~2GB): Reserved for device drivers.
- When Zone Normal was low, allocations spilled into Zone DMA.
kswapd
then tried to free Zone DMA memory. But:- Very few DMA pages were reclaimable.
- The reclaim scan looped through huge amounts of Zone Normal pages.
- This extended the lock-holding time, blocking all user processes for ~1s.
The fix:
To prevent general processes from consuming Zone DMA, adjust the lowmem_reserve_ratio
:
echo "30 30 30 0" | sudo tee /proc/sys/vm/lowmem_reserve_ratio
Why it works
- This increases the reserve ratio, effectively telling Linux: “Don’t allocate from Zone DMA unless explicitly required.”
- Since Zone Normal was sufficiently large, workloads no longer touched Zone DMA.
- Result:
kswapd
never got stuck scanning endlessly → no more freezes.
Remaining considerations:
- This is a workaround, not a kernel-level fix.
- If your hardware or drivers truly require Zone DMA, test carefully before applying.
- Ideally, the Linux memory management code should better handle this corner case — consider reporting it upstream if reproducible.
Also, if you encounter unexplained system-wide freezes in Linux:
- Trace with KernelShark → look for
kswapd
activity before stalls. - Check memory zones (
cat /proc/zoneinfo
) to see if DMA is in use. - Adjust
lowmem_reserve_ratio
to restrict general use of Zone DMA. - Verify stability under your real workload (especially if devices rely on DMA).
The issue isn’t a “broken kernel” but a pathological interaction between kswapd
and Zone DMA memory scanning. With careful tuning, you can eliminate 1-second freezes and ensure stable real-time performance for embedded and data-intensive systems.
kswapd
stall monitor + auto trace capture
What the script below does:
- Samples every 100 ms.
- Detects event-loop stalls (e.g., when everything pauses) by measuring loop delay.
- Tracks
kswapd*
CPU usage and reclaim counters (pgscan_kswapd
,pgsteal_kswapd
, etc.). - When it sees a likely
kswapd
-related stall, it:- Logs a structured alert (timestamped JSON).
- Snapshots
/proc/zoneinfo
(to see DMA/Normal pressure). - Optionally runs
trace-cmd record
for ~3 s (sched + mm reclaim events) to a timestamped file.
No kernel changes required. Works on stock distros; trace-cmd
is optional but recommended.
kswapd_stall_monitor.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# by Alex for DropletDrift.com
# article: https://dropletdrift.com/fixing-1-second-freezes-in-linux-due-to-kswapd-and-zone-dma/
"""
kswapd_stall_monitor.py
Detects ~1s system stalls likely tied to kswapd reclaim behavior.
Optionally auto-captures a short trace-cmd recording for later analysis in KernelShark.
Requirements:
- Python 3.7+
- Optional: trace-cmd (for auto-capture)
Usage:
sudo ./kswapd_stall_monitor.py
# or tweak thresholds:
sudo ./kswapd_stall_monitor.py --stall-ms 800 --kswapd-cpu 15 --scan-delta 5000
Output:
- JSON log lines to stdout (easy to ship to journald/files)
- If trace-cmd present: trace-YYYYmmdd-HHMMSS.dat in cwd
"""
import argparse
import json
import os
import re
import shutil
import signal
import subprocess
import sys
import time
from pathlib import Path
from typing import Dict, Tuple, List
HZ = os.sysconf(os.sysconf_names['SC_CLK_TCK'])
def now_iso():
return time.strftime("%Y-%m-%dT%H:%M:%S", time.localtime()) + f".{int((time.time()%1)*1000):03d}"
def read_proc(filepath: str) -> str:
with open(filepath, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
def parse_vmstat() -> Dict[str, int]:
vm = {}
for line in read_proc("/proc/vmstat").splitlines():
parts = line.strip().split()
if len(parts) == 2 and parts[1].isdigit():
vm[parts[0]] = int(parts[1])
return vm
def list_kswapd_pids() -> List[int]:
pids = []
for d in Path("/proc").iterdir():
if not d.is_dir() or not d.name.isdigit():
continue
try:
comm = read_proc(d / "comm").strip()
except Exception:
continue
if re.match(r"^kswapd\d+$", comm):
try:
pids.append(int(d.name))
except Exception:
pass
return pids
def read_pid_jiffies(pid: int) -> int:
# /proc/<pid>/stat: utime(14) + stime(15) are indices 13 and 14 (0-based)
try:
stat = read_proc(f"/proc/{pid}/stat")
fields = stat.split()
utime = int(fields[13])
stime = int(fields[14])
return utime + stime
except Exception:
return 0
def read_total_jiffies() -> int:
# from /proc/stat first 'cpu ' line: sum of numeric columns
for line in read_proc("/proc/stat").splitlines():
if line.startswith("cpu "):
parts = line.split()[1:]
return sum(int(x) for x in parts)
return 0
def read_zoneinfo() -> str:
return read_proc("/proc/zoneinfo")
def write_snapshot(prefix: str, content: str) -> str:
ts = time.strftime("%Y%m%d-%H%M%S", time.localtime())
fn = f"{prefix}-{ts}.txt"
Path(fn).write_text(content, encoding="utf-8")
return fn
def run_trace_cmd(duration_s: int = 3) -> Tuple[int, str]:
"""
Records a short trace with events useful for kswapd stalls.
Returns (rc, filename or stderr).
"""
if not shutil.which("trace-cmd"):
return (127, "trace-cmd not found")
ts = time.strftime("%Y%m%d-%H%M%S", time.localtime())
out = f"trace-{ts}.dat"
events = [
"sched:sched_switch",
"sched:sched_wakeup",
"mm_vmscan:*", # reclaim activity
"irq:*",
"workqueue:*",
"timer:*",
]
cmd = ["trace-cmd", "record", "-o", out]
for e in events:
cmd += ["-e", e]
# Record for 'duration_s' seconds then stop.
# Using timeout(1) if available, else rely on trace-cmd -d <secs> if installed version supports it.
if shutil.which("timeout"):
cmd = ["timeout", f"{duration_s}s"] + cmd
else:
# Best effort; user can Ctrl-C.
pass
try:
rc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True).returncode
return (rc, out if rc == 0 else (out if Path(out).exists() else "trace-cmd failed"))
except Exception as e:
return (1, str(e))
def log_event(kind: str, data: Dict):
payload = {"ts": now_iso(), "event": kind, **data}
print(json.dumps(payload, sort_keys=False), flush=True)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--interval-ms", type=int, default=100, help="Sampling interval in ms")
ap.add_argument("--stall-ms", type=int, default=800, help="Loop delay that counts as a stall")
ap.add_argument("--kswapd-cpu", type=float, default=10.0, help="Sum CPU%% across kswapd* to consider 'high'")
ap.add_argument("--scan-delta", type=int, default=1000, help="Min delta of pgscan_kswapd over window")
ap.add_argument("--trace-seconds", type=int, default=3, help="If trace-cmd exists, record this many seconds")
ap.add_argument("--no-trace", action="store_true", help="Disable trace capture even if trace-cmd exists")
args = ap.parse_args()
interval = args.interval_ms / 1000.0
prev_vm = parse_vmstat()
prev_total = read_total_jiffies()
prev_jiffies = {pid: read_pid_jiffies(pid) for pid in list_kswapd_pids()}
last_tick = time.perf_counter()
# graceful shutdown
stop = False
def _sig(_s, _f):
nonlocal stop
stop = True
for s in (signal.SIGINT, signal.SIGTERM):
signal.signal(s, _sig)
log_event("monitor_started", {
"interval_ms": args.interval_ms,
"stall_ms": args.stall_ms,
"kswapd_cpu_threshold": args.kswapd_cpu,
"scan_delta_threshold": args.scan_delta
})
while not stop:
time.sleep(interval)
now = time.perf_counter()
loop_delay_ms = (now - last_tick) * 1000.0 - args.interval_ms
last_tick = now
# detect stall (event-loop delay beyond threshold)
stall = loop_delay_ms >= args.stall_ms
# vmstat deltas
cur_vm = parse_vmstat()
vm_delta = {}
for k in ("pgscan_kswapd", "pgsteal_kswapd", "pgscan_direct", "pgsteal_direct"):
if k in cur_vm and k in prev_vm:
vm_delta[k] = cur_vm[k] - prev_vm[k]
prev_vm = cur_vm
# kswapd CPU%
cur_total = read_total_jiffies()
total_diff = max(1, cur_total - prev_total)
prev_total = cur_total
# refresh kswapd pids
pids = list_kswapd_pids()
# ensure we have baselines
for pid in pids:
if pid not in prev_jiffies:
prev_jiffies[pid] = read_pid_jiffies(pid)
kswapd_cpu_pct = 0.0
per_pid = {}
for pid in list(prev_jiffies.keys()):
if pid not in pids:
prev_jiffies.pop(pid, None)
continue
curp = read_pid_jiffies(pid)
diff = max(0, curp - prev_jiffies[pid])
prev_jiffies[pid] = curp
# process CPU% over interval: (diff / total_diff) * 100
pct = (diff / total_diff) * 100.0
per_pid[str(pid)] = round(pct, 2)
kswapd_cpu_pct += pct
kswapd_cpu_pct = round(kswapd_cpu_pct, 2)
# emit periodic heartbeat
log_event("tick", {
"loop_delay_ms": round(loop_delay_ms, 1),
"kswapd_cpu_pct_sum": kswapd_cpu_pct,
"kswapd_cpu_pct_per_pid": per_pid,
"vm_delta": vm_delta
})
# trigger condition: stall + kswapd CPU spike OR big kswapd scans
likely_kswapd_issue = (
stall and (kswapd_cpu_pct >= args.kswapd_cpu or vm_delta.get("pgscan_kswapd", 0) >= args.scan_delta)
) or (kswapd_cpu_pct >= args.kswapd_cpu and vm_delta.get("pgscan_kswapd", 0) >= args.scan_delta)
if likely_kswapd_issue:
zpath = write_snapshot("zoneinfo", read_zoneinfo())
evt = {
"loop_delay_ms": round(loop_delay_ms, 1),
"kswapd_cpu_pct_sum": kswapd_cpu_pct,
"vm_delta": vm_delta,
"zoneinfo_snapshot": zpath
}
if not args.no_trace:
rc, out = run_trace_cmd(args.trace_seconds)
if rc == 0 and out.endswith(".dat"):
evt["trace_file"] = out
else:
evt["trace_error"] = out
log_event("kswapd_stall_alert", evt)
log_event("monitor_stopped", {})
return 0
if __name__ == "__main__":
try:
sys.exit(main())
except KeyboardInterrupt:
pass
Recommended thresholds
--stall-ms 800
(flags ~≥0.8 s loop gaps)--kswapd-cpu 10
(sum acrosskswapd*
threads)--scan-delta 1000
(per 100 ms tick, tune to workload; bigger boxes may need larger)
Tip: start conservative, then tighten after you’ve seen a few alerts.
Quick start
sudo apt-get install -y trace-cmd # optional but useful
sudo python3 kswapd_stall_monitor.py
You’ll see JSON lines. On alert, the script writes:
zoneinfo-YYYYmmdd-HHMMSS.txt
trace-YYYYmmdd-HHMMSS.dat
(iftrace-cmd
exists)
Open the trace in KernelShark:
kernelshark trace-YYYYmmdd-HHMMSS.dat
Systemd service (optional)
/etc/systemd/system/kswapd-stall-monitor.service
[Unit]
Description=kswapd stall monitor and trace capture
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/kswapd_stall_monitor.py --stall-ms 800 --kswapd-cpu 10 --scan-delta 1000 --trace-seconds 3
WorkingDirectory=/var/log/kswapd-monitor
Restart=always
RestartSec=2
StandardOutput=append:/var/log/kswapd-monitor/monitor.jsonl
StandardError=append:/var/log/kswapd-monitor/monitor.err
User=root
[Install]
WantedBy=multi-user.target
sudo mkdir -p /var/log/kswapd-monitor
sudo install -m 0755 kswapd_stall_monitor.py /opt/kswapd_stall_monitor.py
sudo systemctl daemon-reload
sudo systemctl enable --now kswapd-stall-monitor
How to confirm it’s the same issue
When an alert fires:
- Check the JSON line:
- Large
loop_delay_ms
(≈800–1200 ms). kswapd_cpu_pct_sum
elevated.vm_delta.pgscan_kswapd
spiking.
- Open the trace (
trace-*.dat
) in KernelShark:
- Look for
kswapd*
running immediately prior to the gap. - On the timeline, you’ll often see a long period with few/no runnable user tasks;
sched_switch
showskswapd
holding CPU. - Expand
mm_vmscan
events to see reclaim bursts.
- Inspect
zoneinfo-*.txt
:
- Check DMA zone free/reclaim pressure relative to Normal.
- If DMA has low free pages while Normal is huge, you’re likely in the same pathological path.
If confirmed—apply the mitigation
Make Zone DMA off-limits to general allocations (tune to your env):
echo "30 30 30 0" | sudo tee /proc/sys/vm/lowmem_reserve_ratio
# persist:
printf "vm.lowmem_reserve_ratio = 30 30 30 0\n" | sudo tee /etc/sysctl.d/99-lowmem-reserve.conf
sudo sysctl --system
Re-test under real workload. If devices truly require DMA, test carefully.
Nice extras (optional)
- Prometheus: wrap the JSON in a sidecar that emits gauges for
kswapd_cpu_pct_sum
,pgscan_kswapd
,loop_delay_ms
. - Alerting: grep for
kswapd_stall_alert
and forward to Slack/Email. - Trace size: adjust
--trace-seconds
if you want a longer capture.