Skip to content

Harperbot/metal-guard

Repository files navigation

MetalGuard

English | 繁體中文 | 日本語

Stop MLX kernel panics from rebooting your Mac.

MetalGuard is a GPU safety layer for MLX on Apple Silicon. Running MLX models can trip a bug in Apple's IOGPUFamily GPU driver that kernel-panics and reboots your entire Mac instead of just failing the process. MetalGuard catches the conditions that trigger that bug — before they reach the kernel.

pip install metal-guard · zero dependencies · macOS / Apple Silicon · MIT

Current version: v1.1.0 — see CHANGELOG.md for release history.


Your Mac just kernel-panicked running MLX?

You ran an MLX model and your Mac suddenly restarted. That is not a hardware fault and not your mistake — it is a known bug in Apple's GPU driver. Here is the fix, start to finish:

1. Open Terminal. Press ⌘ + Space, type Terminal, press Enter.

2. Install metal-guard — copy this line, paste it into Terminal, press Enter:

pip install metal-guard

It has zero dependencies, so it installs in seconds and cannot fail with a missing-package error.

3. Run it — type this, press Enter:

metal-guard

metal-guard reads the panic report your Mac just wrote, explains in plain language what happened, and offers to install a one-line protection so it does not happen again. Answer y when it asks.

That's it. The next time an MLX model would have panicked your Mac, metal-guard pauses it with an explanation instead.

No pip? pip comes with Python. If pip install metal-guard says command not found, install Python from python.org first, then try again. You can also use pipx: pipx install metal-guard.


What it does

  • Diagnoses the panic. Reads the macOS panic report, identifies which Apple driver bug it was, and explains it in plain words — no kernel-log decoding required.
  • Prevents the next one. A reversible shell guard routes risky MLX runs through a cooldown check; models known to panic are flagged before they load.
  • Contains the damage. Runs MLX in an isolated subprocess, narrows the race windows that trigger the bug, and refuses to restart straight into a panic loop after a reboot.
  • Stays out of the way. Zero dependencies, advisory by default, and every gate has an off switch.

MetalGuard is a workaround, not a cure — the root bug is inside Apple's driver and only Apple can fix it. What MetalGuard does is take your Mac from "reboots without warning" to "pauses with an explanation."


The problem

Apple's Metal GPU driver on Apple Silicon has a bug: when GPU memory management fails, the kernel panics the entire machine instead of gracefully killing the process.

panic(cpu 4 caller 0xfffffe0032a550f8):
  "completeMemory() prepare count underflow" @IOGPUMemory.cpp:492

Any workflow that loads and unloads MLX models in sequence can trip it — the driver's internal reference count underflows and the machine reboots. This is not your code's fault. It is a driver-level bug with no fix timeline. See ml-explore/mlx-lm#883.

Workload Risk Why
Single-model server (LM Studio) Low One model, no switching
Multi-model pipeline High Every load/unload transition can panic
Long-running server (mlx_lm.server) High KV cache grows unbounded, Metal buffers accumulate
Agent framework + tool calling High 50–100 short generate() calls per conversation
24/7 daemon Critical Memory drift over days, no natural cleanup point
Searched for one of these error strings? You're in the right place.

If your Mac is panicking / rebooting while running MLX and you searched for any of these, MetalGuard is built for you:

IOGPUMemory.cpp:492 completeMemory() prepare count underflow · IOGPUMemory.cpp:550 kernel panic · kIOGPUCommandBufferCallbackErrorOutOfMemory · mlx::core::gpu::check_errorstd::terminateabort (SIGABRT) · mlx::core::metal::GPUMemoryAllocator / fPendingMemorySet · IOGPUGroupMemory.cpp:219 pending memory set panic · IOGPUGroupMemory::remove_memory_object memory object not found · mlx_lm.generate crashes mid-inference · mlx_lm.server OOM kernel panic / Mac reboot · com.apple.iokit.IOGPUFamily in a panic report · AGX_RELAX_CDM_CTXSTORE_TIMEOUT · GPU watchdog killing MLX on MacBook · M1 / M2 / M3 / M4 (Max / Ultra / Pro) kernel panic · long-context (≥ 65k) prefill triggers reboot · back-to-back MLX model loads cause IOGPU underflow panic.


Install

Just want the metal-guard command

pip install metal-guard

This gives you the metal-guard and mlx-safe-python command-line tools. To keep it isolated from your other Python packages, use pipx instead: pipx install metal-guard.

Use it as a library in your own code

pip install metal-guard also installs the metal_guard Python package:

import metal_guard as mg

verdict = mg.evaluate_panic_cooldown()
print(verdict.exit_code, verdict.reason)

Develop / run the tests

git clone https://github.com/Harperbot/metal-guard.git
cd metal-guard
pip install -e ".[test]"
pytest -q

Verify the install

$ metal-guard panic-gate
🟢 PROCEED  no recent IOGPU panics
  24h=0 72h=0

$ metal-guard status
metal-guard 1.1.0  🟢 OK
  mode    defensive — defensive mode (default)
  panics  0 in last 72h

If metal-guard is not found after install, your pip --user bin directory is probably not on PATHpython3 -m metal_guard_cli panic-gate works as a fallback.


Using metal-guard

Command line

Command What it does
metal-guard First-run wizard: scan for the recent panic, explain it, offer protection
metal-guard diagnose Scan for recent kernel panics and explain them (no changes made)
metal-guard guard install Install the reversible shell guard (see below)
metal-guard guard uninstall / status Remove the shell guard / report its state
metal-guard panic-gate Cooldown verdict — for use in launchd / CI scripts
metal-guard status Full status snapshot
metal-guard postmortem <dir> Collect a diagnostic bundle after a panic

The shell guard

metal-guard guard install adds a single delimited block to your shell rc (~/.zshrc or ~/.bashrc) that routes interactive-shell python / python3 through mlx-safe-python. While a panic cooldown is active, MLX runs are paused automatically; otherwise they pass straight through. It is fully reversible — metal-guard guard uninstall removes the block cleanly — and covers interactive terminals only (Terminal, iTerm, VS Code), never launchd jobs or scripts. Disable without uninstalling: export METALGUARD_SHELL_GUARD_DISABLED=1.

In Python

from metal_guard import metal_guard, require_cadence_clear, CircuitBreaker

# Refuse back-to-back loads, and refuse new workers after a panic cluster
require_cadence_clear("mlx-community/gemma-4-26b-a4b-it-4bit")
CircuitBreaker().check()

# Register GPU-bound threads so cleanup waits for them
metal_guard.register_thread(thread)
metal_guard.wait_for_threads()

# Safe unload, OOM-protected inference, pre-load headroom check
metal_guard.safe_cleanup()                                  # gc + flush GPU + cooldown
result = metal_guard.oom_protected(generate, model, tokenizer, prompt=p)
metal_guard.ensure_headroom(model_name="my-model-8bit")

Hardware-aware defaults in one line:

config = MetalGuard.recommended_config()
metal_guard.start_watchdog(warn_pct=config["watchdog_warn_pct"],
                           critical_pct=config["watchdog_critical_pct"])

Every API is listed under Reference below.


Embedding metal-guard in your app

If you ship an MLX-based app, server, or backend, embedding metal-guard means your users are protected from kernel panics without installing or configuring anything themselves — the most reliable way to reach users who would never find a safety tool on their own.

1. Add it as a dependency. metal-guard has zero third-party runtime dependencies, so adding it cannot pull in a conflicting package or break your build:

# pyproject.toml
dependencies = ["metal-guard>=1.1,<2"]

2. Guard the panic-prone transitions. Wrap model load, unload, and back-to-back inference with the API above — at minimum require_cadence_clear() before a load and metal_guard.safe_cleanup() after an unload.

3. Fail safe, not loud. metal-guard's gates raise typed exceptions (e.g. SpawnRefused, MLXLockConflict) instead of letting a panic reboot the machine — catch them and degrade gracefully, such as falling back to an API model.

4. (Optional) Explain panics to your users. After a reboot, call metal_guard.parse_panic_reports() and show users the same plain-language explanation the CLI gives — turning a mysterious crash into a handled event.

metal-guard follows semantic versioning; pin to a compatible range.


📋 Community Panic Registry — KNOWN_PANIC_MODELS

A community-curated list of MLX models that kernel-panic Apple Silicon Macs in production — with hardware contexts, root-cause hypotheses, and verified workarounds.

Apple's driver bug has no fix timeline. But which models trigger it under which workloads is community-knowable — it is just scattered across GitHub issues, LM Studio bug reports, Discord screenshots, and panic-full-*.panic files nobody publishes. MetalGuard gives that knowledge a structured home:

from metal_guard import check_known_panic_model, warn_if_known_panic_model

advisory = check_known_panic_model("mlx-community/gemma-4-31b-it-8bit")
if advisory is not None:
    print(advisory["recommendation"])
    # → "metal-guard narrows the race window but does NOT eliminate panic on
    #    this model. Switch backend (Ollama / llama.cpp) or pivot to an MoE variant."

warn_if_known_panic_model(model_id)   # fire-and-forget, per-process dedup

Each entry carries the panic_signature (the exact IOGPUMemory.cpp:NNN line to match), reproductions (hardware / RAM / time-to-panic / workload), community cross-references, an actionable recommendation, and upstream issue links.

Hit a panic on a specific model with metal-guard fully engaged? Your data point is valuable — open a Known Panic Model report. The registry is intentionally conservative: entries require a confirmed reproduction or a clear upstream issue, so working models are not falsely blacklisted.

A snapshot of a moving target

The registry records models known to panic — it cannot record models nobody has reported yet, and every entry reflects what was observed up to a point in time. A model's absence from the registry is not a safety certificate — it just means no one has reported it here. If you want to run a local model that isn't listed, test it yourself first on your own hardware and workload; if it panics, report it so the next person is warned.

The panic landscape also moves in the other direction. The root bug is upstream, and upstream is not standing still — recent MLX releases have already merged mitigations (e.g. mlx#3348, a thread-local CommandEncoder), and a future MLX or macOS release could narrow or close the bug entirely. When that happens, a registry entry's "switch backend" advice becomes unnecessary — and metal-guard's check_version_advisories() and observer mode (METALGUARD_MODE=observer, which relaxes the defensive layers once a fixed MLX runtime is installed) are how you track it. Treat the registry and these advisories as a point-in-time snapshot, not a permanent verdict — re-check against the MLX and macOS versions you actually run.


Reference

MetalGuard is organised as defence layers (L1–L13) — a defence-in-depth onion: L1–L8 narrow race windows during a run, L9 + L11 short-circuit just before a kernel-level abort, L10 + L12 handle recovery after a panic + reboot, and L13 surfaces it all as a JSON snapshot. See CHANGELOG.md for when each layer landed and the incident that motivated it.

L1 — Thread tracking

Register any thread that touches Metal so cleanup waits for GPU work to finish before mx.clear_cache().

API What it does
metal_guard.register_thread(thread) Add a GPU-bound thread to the registry
metal_guard.wait_for_threads(timeout=None) -> int Block until registered threads finish; returns count still alive

L2 — Safe cleanup

Ordered cleanup that avoids the "main thread freed while worker thread still generating" race — the original panic root cause.

API What it does
metal_guard.flush_gpu() mx.eval(sync) + mx.clear_cache() — only safe after wait_for_threads()
metal_guard.safe_cleanup() Full sequence: wait → gc.collect → flush → cooldown
metal_guard.guarded_cleanup() Context manager that runs safe_cleanup() on exit
kv_cache_clear_on_pressure(available_gb, growth_rate_gb_per_min) Ready-made on_pressure callback for the KV monitor

L3 — OOM recovery

Turn the raw C++ Metal OOM into a catchable Python exception with automatic cleanup and optional retry.

API What it does
metal_guard.oom_protected(fn, *args, max_retries=1, **kwargs) Run with OOM catch → cleanup → retry
metal_guard.oom_protected_context() Context-manager variant
metal_guard.is_metal_oom(exc) -> bool Classify an arbitrary exception
MetalOOMError Catchable exception, carries MemoryStats

L4 — Pre-load memory check

Refuse loads that will not fit, with model-size estimation from the HF model ID.

API What it does
metal_guard.can_fit(model_size_gb, overhead_gb=2.0) -> bool Non-raising check
metal_guard.require_fit(model_size_gb, model_name, overhead_gb=2.0) Clean up, then raise MemoryError if it still won't fit
MetalGuard.estimate_model_size_from_name(name) (static) Parse param count + quantisation → GB estimate

L5 — Long-running process safety

For mlx_lm.server, agent frameworks, and 24/7 daemons.

API What it does
metal_guard.memory_stats() -> MemoryStats Snapshot (active / peak / limit / available / pct)
metal_guard.is_pressure_high(threshold_pct=67.0) -> bool Quick pressure check
metal_guard.ensure_headroom(model_name, threshold_pct=67.0) Clean up if pressure high, no-op otherwise
metal_guard.start_watchdog(interval_secs, warn_pct, critical_pct, on_critical) Drift watchdog with escalating response
metal_guard.start_kv_cache_monitor(interval_secs, headroom_gb, growth_rate_warn, on_pressure) KV growth monitor, fires before OOM
bench_scoped_load(model_id, ...) Context manager for sequential benchmark runs — guarantees unload before next load

L6 — Dual-mode switcher

Runtime-selectable defensive vs observer posture, so you can A/B upstream mitigations without changing code.

API What it does
current_mode() -> str "defensive" (default) or "observer"
is_defensive() / is_observer() -> bool Convenience predicates
describe_mode() -> dict Mode name, description, env var

L7 — Subprocess isolation

Run MLX in a fresh multiprocessing child so a kernel-level abort cannot kill the parent.

API What it does
MLXSubprocessRunner(model_id, ...) Persistent worker subprocess, respawns on crash
call_model_isolated(model_id, prompt, ...) One-shot helper: spawn → generate → shut down
shutdown_all_workers() Force-terminate any runners tracked at exit
SubprocessCrashError / SubprocessTimeoutError Typed failures for callers
SpawnRefused Raised at runner construction when the model's advisory tier is panic (override: METALGUARD_LOCAL_PANIC_MODEL_BLOCK_DISABLED=1)

L8 — Cross-process mutual exclusion

File lock under MLX_LOCK_PATH so bench / server / pipeline never initialise Metal on the same box simultaneously.

API What it does
acquire_mlx_lock(label, force=False) Raise MLXLockConflict if held; force=True SIGTERMs the holder with timeout + cooldown
release_mlx_lock() -> bool Release if this process holds it
read_mlx_lock() -> dict | None Non-blocking inspect; self-heals stale + zombie holders
mlx_exclusive_lock(label) Context manager: acquire on enter, release on exit

L9 — Cadence, panic ingest, and circuit breaker

The last line of defence after the first eight layers — written in response to a kernel panic that lived below the SIGABRT layer: by the time Python saw anything, the machine had already rebooted. The only fix is to avoid the trigger.

API What it does
CadenceGuard(path=None, *, min_interval_sec=180) Persisted per-model load-timestamp store
require_cadence_clear(model_id, *, min_interval_sec=180) Atomic check + mark; raises CadenceViolation if a load happened too recently
parse_panic_reports(directory=None, *, since_ts=None) Scan macOS panic reports (/Library/Logs/DiagnosticReports, /var/db/PanicReporter, ~/Library/...; .panic + .ips) and classify
ingest_panics_jsonl(*, report_dir=None, jsonl_path=None) -> int Dedupe-append to ~/.cache/metal-guard/panics.jsonl
CircuitBreaker(*, window_sec=3600, panic_threshold=2, cooldown_sec=3600) Refuse new workers after a panic cluster
detect_panic_signature(text) -> (name, explanation) Classify a panic log: prepare_count_underflow / pending_memory_set / remove_memory_object / ctxstore_timeout / metal_oom

L10 — Panic cooldown gate

After a kernel panic + reboot, launchd auto-respawns plists ~14 minutes later — and the next MLX workload can immediately re-trigger the bug. L10 reads the macOS panic reports and applies a staircase cooldown (1 panic → 2h; ≥2 in 24h or ≥3 in 72h → lockout requiring an explicit ack).

API What it does
evaluate_panic_cooldown() -> CooldownVerdict Stdlib-only; verdict.exit_code ∈ {0=proceed, 2=cooldown, ≥3=gate broken}
scan_recent_panics(hours=72.0) -> list[PanicRecord] AND-pattern IOGPU-panic scan
ack_panic_lockout() Clear an active lockout
metal-guard panic-gate / metal-guard ack CLI wrappers for launchd scripts

Env: METALGUARD_PANIC_COOLDOWN_STAGE1_H / _LOCKOUT_24H_N / _LOCKOUT_72H_N / _LOCKOUT_MAX_H / _GATE_DISABLED=1.

L11 — Subprocess orphan monitor

Pre-panic signal: a SUBPROC_PRE breadcrumb without a matching SUBPROC_POST after 90 s strongly suggests Metal is stuck — kill the worker before the kernel does.

API What it does
scan_orphan_subproc_pre(threshold_sec=90.0) -> list[OrphanPre] FIFO-paired PRE↔POST scan over the breadcrumb tail
metal-guard orphan-scan [--threshold-sec N] CLI wrapper

L12 — Postmortem auto-collect

After a panic + reboot, collects the diagnostic bundle into one directory: panic files (capped), the breadcrumb-log tail, panics.jsonl history, mx.metal stats, and an index.md summary — and writes a sentinel cooldown so L10 defers further runs even if the panic reports rotate out.

API What it does
run_postmortem(output_dir) -> dict Full orchestration; returns paths + panic count
metal-guard postmortem <output_dir> CLI wrapper (kill-switch: METALGUARD_POSTMORTEM_DISABLED=1)

L13 — Status snapshot

Versioned JSON snapshot for cross-process consumers (menu-bar apps, dashboards, ssh inspection) that should not import metal_guard directly.

API What it does
get_status_snapshot(*, include_panics=True, breadcrumb_lines=20) -> dict Aggregate memory / KV monitor / panics / lock holder / mode / L10 verdict
write_status_snapshot(out_path=None) Atomic write to ~/.cache/metal-guard/status.json
metal-guard status-write [--once | --interval 30] CLI / daemon wrapper

Hardware awareness, advisories, audits

API What it does
MetalGuard.detect_hardware() -> dict (static) Chip, GPU memory, recommended working set, tier, IOGPUFamily kext version
MetalGuard.recommended_config() -> dict (classmethod) Safe defaults for every layer on the detected hardware
check_version_advisories(packages=None) -> list[dict] Warn if installed (mlx, mlx-lm, mlx-vlm, transformers) versions trip a known advisory
install_upstream_defensive_patches(force=False) -> dict[str, bool] Idempotent, version-gated monkey-patches for known upstream regressions
audit_wired_limit() -> dict Flag dangerous iogpu.wired_limit_mb overrides (mlx-lm#1047)
read_gpu_driver_version() -> str | None IOGPUFamily kext version (mlx#3186)

R-series preventive helpers & forensics

API What it does
lookup_dims(model_id) / estimate_prefill_peak_alloc_gb(...) / require_prefill_fit(...) GQA-aware prefill ceiling — refuse a prefill before a 30 GB single-alloc panic
recommend_chunk_size(...) / describe_prefill_plan(...) Advisory prefill chunking
KVGrowthTracker(...) Per-request cumulative KV guard — catches a runaway request the global monitor misses
detect_process_mode() -> ProcessMode "server" / "embedded" / "notebook" / "cli" / "subprocess_worker"
format_panic_for_apple_feedback(forensics, ...) Ready-to-paste Apple Feedback Assistant report
metal_guard.breadcrumb(msg) Write an fsync'd line to the breadcrumb log

Path defaults

All L9 artifacts use ~/.cache/metal-guard/: cadence.json (CadenceGuard), panics.jsonl (panic archive), breaker.json (CircuitBreaker), status.json (L13 snapshot). The breadcrumb log defaults to logs/metal_breadcrumb.log; override via MetalGuard(breadcrumb_path=...).

Architecture

┌─────────────────────────────────────────────────┐
│            Your Application Code                │
│  Agent loop / Server / Pipeline / Daemon        │
└──────────────────┬──────────────────────────────┘
┌──────────────────▼──────────────────────────────┐
│              MetalGuard                         │
│  L9  Cadence + CircuitBreaker  refuse bad loads │
│  L8  Process lock              cross-process    │
│  L7  Subprocess isolation      panic-isolated   │
│  L5  Watchdogs                 drift alerts     │
│  L3  OOM recovery              catch + retry    │
│  L2  Safe cleanup              gc + flush       │
│  L1  Thread registry           wait before free │
│  L10–L13  cooldown / postmortem / status        │
└──────────────────┬──────────────────────────────┘
┌──────────────────▼──────────────────────────────┐
│           MLX + Metal Driver                    │
│  ⚠️  Driver bug: panics instead of OOM          │
└─────────────────────────────────────────────────┘

When MetalGuard is not enough

If you engage every defence and still see repeat panics on the same model, the race window is wider than a userspace layer can narrow. Two escape hatches, by ROI:

  1. Switch backend. Ollama and llama.cpp use Metal under the hood but run a persistent-worker architecture that sidesteps the subprocess teardown race entirely. You lose some raw throughput; you gain "doesn't panic the machine."
  2. Pivot to an MoE model. Mixture-of-Experts variants (e.g. mlx-community/gemma-4-26b-a4b-it-4bit) have a smaller active-parameter footprint per forward pass and a narrower KV trajectory. Community reports converge on MoE as the most reliable same-ecosystem workaround.

MetalGuard is complementary to both — CadenceGuard still helps whenever you hot-swap models.

One hard-learned SOP note. Anything that imports torch, mlx, mlx_lm, mlx_vlm, sentence_transformers, transformers, diffusers, or accelerate initialises the Metal backend and can hit the same kernel bug — even an interactive version-check command. During an active cooldown, use pip show <pkg> or python -c "import importlib.metadata as m; print(m.version('<pkg>'))"; never python -c "import <ml-package>; print(<ml-package>.__version__)".


Limitations — this is a workaround, not a fix

The root bug lives inside Apple's IOGPUFamily kext (mlx#3186) and cannot be patched from Python. MetalGuard lowers the trigger rate (avoids the known trigger paths), contains the blast radius (subprocess isolation), and prevents post-reboot cascades (CircuitBreaker). It does not eliminate panics — especially the uncatchable completion-handler abort (mlx#3390) that fires before any Python signal handler. One production box went from ~1.4 panics/day to zero over a 24 h window after L9 landed — but that is risk-reduction, not elimination. Until Apple ships a fixed kext, this is the upper bound of what a Python-side layer can do.

Related upstream issues

Issue Problem Layer
mlx#3186 IOGPUFamily kernel panic (canonical) L1/L2/L8/L9
mlx#3346 fPendingMemorySet second signature detect_panic_signature + L9
mlx#3348 CommandEncoder thread-local (merged) Advisory-gated observer mode
mlx#3390 Uncatchable completion-handler abort L7 subprocess isolation
mlx-lm#883 Kernel panic from KV cache growth L1 thread + L2 safe cleanup
mlx-lm#854 Server OOM crash L3 oom_protected + L5
mlx-lm#1047 wired_limit correlation with panics audit_wired_limit

License

MIT

About

Defensive layer for mlx / mlx-lm / mlx-vlm on Apple Silicon. Prevents IOGPUFamily kernel panics, SIGABRT, and Mac reboots from MLX inference. Includes a community-curated registry of known-panic models (KNOWN_PANIC_MODELS) with hardware contexts, root-cause hypotheses, and verified workarounds.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors