GitHub - santhsecurity/keyhog: Open-source secret scanner in Rust. Service-specific detectors, SIMD on the CPU and an optional GPU path, live verification of which leaked keys are still active, and SARIF output.

_{Part of Santh · blog · @SanthProject}

keyhog scans source trees, git history, Docker images, GitHub/GitLab/Bitbucket repository collections, S3/GCS/Azure Blob buckets, and running systems for leaked credentials. 902 service-specific detectors, decode-through (base64/hex/url/protobuf), confidence scoring, SARIF output, zero runtime configuration. Default keyhog scan . works out of the box.

Add it to your CI (one workflow file)

# .github/workflows/keyhog.yml
name: keyhog
on: [push, pull_request]
permissions: { contents: read, security-events: write }
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: santhsecurity/keyhog/.github/actions/keyhog@v0.5.40
        with: { path: ., severity: high, format: sarif }

Cost to your CI: ~20 MB binary download (cacheable), ~400 ms cold-start on hosted runners (GPU auto-disabled, SIMD path), ~10 s wall-clock for a 5,000-file repo. Single libhyperscan5 apt package, no Python, no JVM, no Docker daemon. Findings auto-upload to GitHub code-scanning as SARIF; adopt without breaking an existing tree by committing a baseline (keyhog scan --create-baseline .keyhog-baseline.json) so the action fails only on NEW secrets.

For ultra-lean CI installs there's now cargo install keyhog --no-default-features --features ci: 13 MB binary (vs 22 MB full), ~140 ms cold-start, no Hyperscan dependency, no wgpu/Vulkan probe, no libstdc++ link. Same 902 detectors, same ML/entropy/decode/multiline data paths. Use this profile in self-built CI images where binary size or container cold-start matters; the prebuilt installer above stays the default for a turnkey single-binary download.

GitLab CI, CircleCI, Drone, BuildKite, Jenkins, Bazel, pre-commit, Husky, lefthook recipes: docs/DROP_IN_USAGE.md.

How it works

keyhog compiles its 902 detectors into a shared trigger/extraction plan, uses Hyperscan when that feature is present, decodes nested encodings before matching, and can apply explicit per-detector Bayesian Beta(α,β) confidence calibration. Hardware acceleration is an explicit backend selection layer; every selected backend must preserve the same detector ids and findings contract:

Layer / Backend	When	How
`simdsieve` prefilter	AVX-512 / AVX2 / NEON	Layer 1: skims every file for the 8 highest-value secret prefixes (AWS `AKIA`/`ASIA`, GitHub `ghp_`, OpenAI `sk-proj-`, Slack `xoxb-`/`xoxp-`, SendGrid `SG.`, Square `sq0csp-`) in a single SIMD pass, before the regex backend runs
`gpu-region-presence`	discrete GPU + persisted calibration proof	vyre literal-set region-presence pass on GPU via WGPU (cross-platform) or optional CUDA backend, followed by the shared CPU validation tail
`simd-regex`	AVX-512 / AVX2 / NEON; Hyperscan when compiled	parallel trigger scan plus full-regex extraction; portable builds keep the same backend label without linking Hyperscan
`cpu-fallback`	no SIMD, no GPU	Aho-Corasick prefix + Rust `regex` extraction

Autoroute Contract

The goal of autoroute is simple and strict: for every scan, on every supported OS, architecture, CPU, GPU, driver stack, detector set, config, and workload shape, keyhog must pick the fastest backend that returns the same findings.

That means autoroute is not a fixed threshold table, not a hardware-name heuristic, and not a fallback hierarchy. There is no "GPU primary with CPU fallback", no "CPU safe default", and no preferred backend that runs when the decision table is missing. GPU, Hyperscan/SIMD, scalar CPU, and any new engine are peer candidates. A backend is eligible only after calibration proves two things for the current binary, detector digest, host profile, and workload class:

Correctness parity: the candidate backend returns the same detector ids, locations, hashes, and finding counts as the reference scanner path for the sampled workload.
Measured speed: the candidate is faster than the alternatives on this host and workload class, including batching, detector digest, file-size distribution, accelerator state, and platform overhead. Calibration records store repeated parity-checked trials, not a single lucky timing sample.

The selected decision must be explainable and reproducible. Any cached routing decision is keyed by binary version, OS/arch, CPU features, GPU identity, detector digest, resolved scan-config digest including batch-pipeline route, explicit calibration controls, calibration schema, and workload-shape buckets; changing any of those invalidates the decision and requires a fresh calibration probe during install or explicit recalibration. Invalid existing cache records are rejected instead of being silently trusted. The installer runs a visible autoroute calibration phase and persists those measured decisions on disk. Normal scans do not benchmark candidates or rewrite routing records; they either find a valid persisted fastest-correct decision for the scan class or report an invalid autoroute state. A missing, stale, invalid, or incomplete decision is not permission to run SIMD/CPU/GPU as a substitute. Run keyhog calibrate-autoroute to re-prime every preset and workload bucket for the installed binary in place, or rerun install.sh --calibrate / install.ps1 -Calibrate to replace the persisted calibration at install time. Explicit --backend overrides are for diagnostics and benchmarking, not evidence that autoroute is correct.

A single-backend build — one compiled without Hyperscan (simd) or the GPU stack, such as the portable/static release — has no backend choice to route, so it resolves its lone CPU backend directly and never requires calibration (and never fails closed). Autoroute engages only when a build compiled more than one backend.

The visible calibration phase measures every real workload class on your hardware — stdin, small/large files, many-file trees, decode-heavy input, git history/blobs/diff, a loopback web URL, and a live container image — timing each backend per class and persisting only a route it can prove fastest (or the sound lowest-overhead tie-break when two routes are statistically tied). The install refuses to finish unless every class calibrates.

Because a scan-policy preset (--fast, --deep, --precision) changes the scanner fields hashed into the routing digest, each preset resolves a different decision than the default policy. The installer therefore calibrates the default policy and every preset the binary exposes, so keyhog scan . --fast (or --deep/--precision) resolves a persisted fastest-correct decision instead of failing closed. The decisions for the default policy and every preset coexist in one cache file (each keyed by its own resolved-config digest):

keyhog backend prints the live decision for this host: the hardware probe and the size-keyed routing matrix where small inputs stay on simd-regex and large chunks cross into gpu-region-presence once the per-tier byte thresholds are met — a measured, explainable function of host and input size, never a guess.

keyhog backend --autoroute is the companion view: it reads the persisted calibration cache and lists which resolved scan configs and workload buckets already have a fastest-correct decision (and the backend each resolved to), plus whether the cache is stale for this build. When a scan exits with autoroute calibration required: no decision for workload bucket …, this is how you see what is calibrated and recalibrate the gap. Add --json for a stable, scriptable shape.

The simdsieve prefilter is a performance layer, not a separate detector: a hit surfaces under its canonical detector id (aws-access-key, github-classic-pat, slack-bot-token, …) - identical on every platform and build, whether the fast path or the full regex engine made the find.

Backend selection is reported on startup (the host line also names the GPU and io_uring when present):

v0.5.40 · secret scanner · 902 detectors
⚡ 16 cores | SIMD: AVX-512 | Hyperscan | 902 detectors (6054 patterns) | backend=simd-regex

Full documentation: santhsecurity.github.io/keyhog - install, first scan, output formats, detection internals, suppressions, verification, pre-commit + CI integration, CLI reference, exit codes, env vars, contributing. Source under docs/.

Install

# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh | sh

# Windows (PowerShell)
iwr https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.ps1 -useb | iex

# From source — Linux (default = Hyperscan SIMD; needs libhyperscan-dev + pkg-config)
git clone https://github.com/santhsecurity/keyhog.git
cd keyhog && cargo build --release -p keyhog

# From source / crates.io — macOS, Windows, or any host without Hyperscan
# (the system-lib-free vyre CPU build — no pkg-config, no GPU stack)
cargo install keyhog --no-default-features --features portable

install.sh / install.ps1 (signed prebuilt) is the recommended path: it auto-selects the right per-host variant and is a ~20 MB download in ~1 s, versus a ~3-minute source build. For a source build, note that the default features link Hyperscan (a system lib available on Linux x86_64); on macOS (incl. Apple Silicon) and any host without the Hyperscan dev libraries, build with --no-default-features --features portable — the vyre CPU path, every detection feature, no system-lib or pkg-config dependency.

Works on Linux, macOS (Intel + Apple Silicon), Windows. Zero configuration. keyhog scan . works out of the box.

The installer auto-detects host state and picks a sensible default. On Linux x86_64, the default asset is the WGPU + Hyperscan/SIMD build: WGPU runs the same vyre AC / RulePipeline dispatch on the GPU via the vulkan backend, with a smaller binary and no libcuda.so runtime dependency. The dedicated keyhog-linux-x86_64-cuda variant is only auto-picked when the full CUDA toolkit is present (nvcc on PATH, $CUDA_HOME set, or /usr/local/cuda exists) - the signal that you actively run a CUDA development setup, not just an NVIDIA driver. macOS and Windows release assets are portable no-system-library builds: they include the scanner data/source surface without Hyperscan, WGPU, CUDA, or a native Metal asset in the current release. Each download is verified before it can replace your binary: the installer checks the release's minisign signature against keyhog's pinned public key and fails closed (refuses to install, touching nothing) if the signature is missing, wrong, or minisign itself is not installed - in which case it prints the one-line install command for your OS (sudo apt-get install minisign, brew install minisign, winget install -e --id jedisct1.minisign). It then SHA256-verifies the binary against the release-side checksum file. For an offline/air-gapped install without signature verification, pass --insecure (the SHA256 is still checked).

Override the variant with --variant=cuda (force the native CUDA build, requires libcuda.so at runtime) or --variant=cpu (force the default non-CUDA release asset and skip CUDA-asset auto-selection). Pin a version with KEYHOG_VERSION=v0.5.40. Change the install dir with --install-dir=/usr/local/bin. An explicit CUDA variant request requires the keyhog-linux-x86_64-cuda release asset and fails closed if that asset is missing; only auto-selected CUDA hosts may fall back to the default Linux asset.

Three diagnostic modes ship with the same script:

sh install.sh --diagnose    # print host + binary state, change nothing
sh install.sh --repair      # re-download the right variant for this host
sh install.sh --uninstall   # remove the binary + installer-owned shell wiring

For an interactive install (variant prompts + post-install wizard for PATH, shell completions, Claude Code / Cursor hook, git pre-commit hook), download the script first instead of piping into sh:

curl -fsSL https://raw.githubusercontent.com/santhsecurity/keyhog/main/install.sh \
    -o keyhog-install.sh
sh keyhog-install.sh

Daemon mode (sub-100 ms pre-commit scans) is Unix only. Everything else works identically on Windows.

Keep keyhog healthy and up to date

Once installed, keyhog maintains itself - the install script is only needed for the first install:

keyhog doctor                # health check: host probe + end-to-end scan self-test
keyhog backend --self-test --json # CI-readable GPU path health proof
keyhog update                # self-update to the latest release (verified download + atomic swap)
keyhog update --check        # is a newer release available? (exits 10 if yes, 0 if current)
keyhog update --variant cuda # update to the CUDA build instead of the portable one
keyhog repair                # reinstall a known-good binary if the self-test fails (--force to force)
keyhog uninstall             # remove the binary (dry run; pass --yes to actually delete)

keyhog doctor — host probe, install/PATH resolution, and a four-way self-test (scan engine end-to-end, GPU scan path, GPU literal set, GPU MoE shader vs CPU reference). It never reports healthy unless the GPU path proves itself on this host:

keyhog doctor reuses the scanner's own hardware probe and runs a real end-to-end self-test - it plants a synthetic secret and confirms the binary detects it - so it is the authoritative "will keyhog work here?" check (the installer runs it automatically after install). update and repair download the release binary over HTTPS, verify its minisign signature against keyhog's embedded public key, and atomically swap the running binary in place; a tampered or unsigned-mismatched binary is refused. On a healthy host keyhog update is the one-command upgrade path.

keyhog backend --self-test --json is the machine-readable GPU health gate for self-hosted runners. It exits 4 when the production GPU scan path degrades at runtime and emits stable ok, status, exit_code, recommended_backend, and per-probe fields for CI routing.

Quickstart

keyhog scan .                                          # scan a directory
keyhog scan --git-staged                               # pre-commit: only staged blobs
keyhog scan --git-diff main                            # files changed since base ref
keyhog scan --git-history .                            # every commit, every branch
keyhog scan --docker-image registry/app:v1             # Docker image layers
keyhog scan --s3-bucket logs-prod --s3-prefix /        # S3 objects (--s3-endpoint for non-AWS)
keyhog scan --gcs-bucket logs-prod --gcs-prefix config/ # GCS objects (--gcs-endpoint for compatible APIs)
keyhog scan --azure-container-url "$AZURE_CONTAINER_URL" --azure-prefix config/
keyhog scan --github-org acme --github-token "$GH_PAT" # every repo in a GitHub org (PAT required)
keyhog scan --gitlab-group acme --gitlab-token "$GL_PAT" # every project in a GitLab group
keyhog scan --bitbucket-workspace acme --bitbucket-username "$BB_USER" --bitbucket-token "$BB_APP_PASSWORD"
keyhog scan-system --space 50G                         # walk every drive, every git history

Filter, format, gate:

keyhog scan . --severity high                  # info | low | medium | high | critical
keyhog scan . --min-confidence 0.5             # raise the ML floor
keyhog scan . --format sarif -o keyhog.sarif   # GitHub code scanning
keyhog scan . --verify                         # live-verify against vendor APIs
keyhog scan . --create-baseline .keyhog-baseline.json
keyhog scan . --baseline .keyhog-baseline.json # only NEW findings vs snapshot
keyhog scan . --fast                           # pre-commit speed (skip ML + decode)
keyhog scan . --deep                           # max detection depth
keyhog scan . --incremental                    # BLAKE3 Merkle skip → 10–100× CI loop

One scan, every CI/SIEM dialect — text · json · jsonl · sarif · csv · html · junit · github · gitlab, all from the same engine:

Exit codes: 0 clean, 1 findings above the severity floor, 2 user error (bad path, bad config, unsupported flag), 3 system error or detector-corpus audit failure, 4 backend --self-test failed, 10 live credentials found (requires --verify), 11 scanner panic (thread panicked mid-scan), 12 required GPU unavailable, 13 requested source failed or input coverage was incomplete. Matches keyhog --help.

What it catches

902 service-specific detectors with checksum / companion validation:

Cloud providers . AWS (access key + secret + STS verification), Azure (subscription key, storage account key, SAS), GCP (service account, API key), Cloudflare, Heroku, Vercel, Supabase.
Payment processors . Stripe, Braintree, Razorpay, Paddle, Plaid, Square, PayPal . all with companion-required validation (a Braintree private key without its public counterpart never fires).
Source forges . GitHub PATs (with CRC32 checksum), GitLab tokens, Bitbucket app passwords, npm tokens (with checksum), Gitea / Forgejo / Codeberg.
Auth / SSO . Okta, Auth0, Clerk, JumpCloud, Kinde.
Comms . Slack, Discord, Twilio, SendGrid, Postmark, Mailgun, Resend, Loops.
AI / ML . OpenAI (sk-/sk-proj-), Anthropic, Google AI Studio, Cohere, Mistral, HuggingFace, Replicate.
Databases . Postgres connection strings, MongoDB Atlas, Supabase service-role, PlanetScale, Neon, Turso, MySQL, Redis URLs.
Generic + entropy fallback . API_KEY=<high-entropy-blob> catches credentials with no named detector, gated by per-context entropy thresholds + ML scoring.
Cryptographic material . RSA / EC / SSH private keys, PGP private blocks, JWT signing secrets.

Each detector ships as a TOML file (data, not code): service metadata, regex patterns, keywords, companion fields, verification handler. Adding a new detector is 5–10 lines of TOML; the contributor guide walks through it.

keyhog explain <id> dumps any detector's full spec — patterns, keywords, verification endpoint — plus a service-keyed rotation and step-by-step remediation guide, so a finding is never a black box:

$keyhog explain github-classic-pat — detector spec dump (pattern ghp_[A-Za-z0-9]{36}, keyword, verification URL) followed by the github rotation guide and step-by-step remediation$

Browse the full catalog at /site/detectors.html - loads all 902 with severity + service + keyword filter.

Why higher recall, fewer false positives

Decode-through scanning. Kubernetes Secret manifests, JWT payloads, base64-wrapped envs, helm values, docker-config auth: blobs . the structured preprocessor decodes them in place and feeds every downstream detector the plaintext, so detectors don't each need to re-implement decoding.
Multiline reassembly. "sk-proj-" + \ continuation in JavaScript, YAML multi-line strings, Makefile backslash-continuation, Helm / Jinja templated outputs . all reassembled before regex matching.
Companion-required validation. AWS access key without its 40-char secret? Skipped. Twilio API key without its auth token? Skipped. Two-out-of-two signals are required for the high-noise detectors, cutting the canonical git log -G ghp_ false-positive cluster.
Confidence scoring. Every finding carries a [0.0, 1.0] score derived from Shannon entropy, surrounding context, companion match, checksum (GitHub CRC32, npm, Slack), and a small ML classifier (~30k params). Default threshold 0.40 (the canonical ScanConfig::default() floor; same as the --min-confidence default and the [scan] min_confidence example below) filters low-quality matches without hiding real secrets.
Bayesian per-detector calibration. keyhog calibrate --fp generic-api-key writes a Beta(α,β) posterior. Scans use it only when --calibration-cache or [system].calibration_cache points at that file, so confidence tuning is explicit and reproducible instead of depending on stray host cache state.

Performance

Measured head-to-head against BetterLeaks, Kingfisher, TruffleHog, and Titus, scored identically by the reproducible harness in benchmarks/: the SecretBench containment rule, with the ground-truth manifest excluded from every scanner's scan tree so no tool is ever shown the answer key. The tables below are generated by make -C benchmarks report — do not edit them by hand.

Detection leaderboard

Corpus: mirror - 15000 fixtures, 3000 labeled positives. Every scanner scored identically (SecretBench overlap rule); the answer-key manifest is excluded from the scan tree.

Rank	Scanner	F1	Precision	Recall	Findings	Wall	Peak RSS
1	KeyHog	0.9258	0.9954	0.8653	2612	1.58s	1543 MB
2	TruffleHog	0.5265	1.0000	0.3573	1072	1.45s	322 MB
3	Kingfisher	0.4720	0.3912	0.5947	5241	3.81s	502 MB
4	Titus	0.4127	0.3318	0.5457	5159	4.13s	114 MB
5	Nosey Parker	0.4078	0.3414	0.5063	4532	0.82s	534 MB
6	BetterLeaks	0.3585	0.2313	0.7967	10828	1.04s	210 MB

Speed & memory

Scanner	Config	Corpus	Wall	Throughput	Peak RSS
Nosey Parker	`default-nocache-nodaemon-no-git-history`	mirror	0.75s	3.1 MB/s	285 MB
BetterLeaks	`default-nocache-nodaemon-no-validate`	mirror	0.77s	3.0 MB/s	192 MB
Nosey Parker	`default-nocache-nodaemon-no-git-history`	mirror	0.82s	2.8 MB/s	534 MB
Nosey Parker	`default-nocache-nodaemon-no-git-history`	creddata	0.92s	1056.3 MB/s	1743 MB
BetterLeaks	`default-nocache-nodaemon-no-validate`	mirror	1.04s	2.2 MB/s	210 MB
KeyHog	`simd-nocache-nodaemon-full`	mirror	1.27s	1.8 MB/s	1137 MB
KeyHog	`simd-nocache-nodaemon-full`	mirror	1.32s	1.8 MB/s	1153 MB
KeyHog	`simd-nocache-nodaemon-full`	mirror	1.40s	1.7 MB/s	1745 MB
TruffleHog	`default-nocache-nodaemon-no-verify`	mirror	1.45s	1.6 MB/s	322 MB
KeyHog	`simd-nocache-nodaemon-full`	mirror	1.58s	1.5 MB/s	1543 MB
TruffleHog	`default-nocache-nodaemon-no-verify`	mirror	1.73s	1.3 MB/s	308 MB
Titus	`default-nocache-nodaemon-no-validate`	mirror	2.53s	0.9 MB/s	117 MB
BetterLeaks	`default-nocache-nodaemon-no-validate`	creddata	2.83s	342.8 MB/s	252 MB
BetterLeaks	`default-nocache-nodaemon-no-validate`	creddata	3.07s	316.5 MB/s	261 MB
Titus	`default-nocache-nodaemon-no-validate`	creddata	3.16s	307.6 MB/s	2024 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	3.31s	293.8 MB/s	1887 MB
KeyHog	`cpu-nocache-nodaemon-full`	creddata	3.45s	281.7 MB/s	1821 MB
KeyHog	`auto-nocache-nodaemon-full`	creddata	3.52s	275.9 MB/s	1850 MB
KeyHog	`megascan-nocache-nodaemon-full`	creddata	3.70s	262.7 MB/s	1952 MB
Kingfisher	`default-nocache-nodaemon-low-no-validate`	mirror	3.81s	0.6 MB/s	502 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	3.91s	248.5 MB/s	1741 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	3.99s	243.7 MB/s	1720 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	4.02s	241.7 MB/s	1962 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	4.05s	240.0 MB/s	1677 MB
Titus	`default-nocache-nodaemon-no-validate`	mirror	4.13s	0.6 MB/s	114 MB
Kingfisher	`default-nocache-nodaemon-low-no-validate`	mirror	4.88s	0.5 MB/s	421 MB
KeyHog	`gpu-nocache-nodaemon-full`	creddata	5.12s	189.7 MB/s	3562 MB
KeyHog	`simd-nocache-nodaemon-full`	creddata	5.44s	178.6 MB/s	1641 MB
Kingfisher	`default-nocache-nodaemon-low-no-validate`	creddata	7.36s	131.9 MB/s	728 MB
Kingfisher	`default-nocache-nodaemon-low-no-validate`	creddata	8.13s	119.4 MB/s	657 MB
TruffleHog	`default-nocache-nodaemon-no-verify`	creddata	19.98s	48.6 MB/s	644 MB

Per-category recall gaps (where a competitor still wins recall)

Category	KeyHog P/R/F1	KeyHog TP/FN	Best competitor P/R/F1	Recall gap
`authentication-key`	1.000 / 0.973 / 0.986	498/14	BetterLeaks 0.893 / 0.977 / 0.933	+0.004
`generic-high-entropy-string`	1.000 / 0.348 / 0.516	63/118	BetterLeaks 1.000 / 0.807 / 0.893	+0.459

Reproduce: make -C benchmarks bench runs every scanner on the 15k SecretBench-mirror corpus and writes benchmarks/results/<host>/; make -C benchmarks report regenerates the tables above and benchmarks/reports/. See benchmarks/README.md for the corpora (mirror, competitor home-turf, Samsung/CredData) and the backend/cache/daemon/OS/GPU matrix.

CI integration

GitHub Actions

- uses: santhsecurity/keyhog/.github/actions/keyhog@v0.5.40
  with:
    path: .
    severity: high       # info | low | medium | high | critical
    format: sarif        # SARIF auto-uploads to GitHub code scanning
    baseline: .keyhog-baseline.json   # block only NEW findings

Release tags and explicit version: inputs require a matching prebuilt binary plus checksum and fail closed if the asset is missing or unverifiable. Branch/SHA action refs may build from source with Cargo. SARIF carries CWE-798 + OWASP A07:2021 taxa on every finding.

CI never needs a GPU

Hosted CI should run pure CPU/SIMD unless it has a real GPU. Use keyhog scan --no-gpu or .keyhog.toml [system] gpu = "off" on hosted runners. Use --require-gpu or [system] gpu = "required" on self-hosted GPU runners where a driver regression must fail closed. Detection results are identical on CPU and GPU - the GPU only changes throughput, never which secrets are found.

Building keyhog from source in CI (rather than the prebuilt binary)? Use the portable feature - every detection feature, no system-library build deps (skips the Hyperscan/Ghidra build step):

- run: cargo install keyhog --no-default-features --features portable
- run: keyhog scan . --format sarif --severity high > keyhog.sarif

Other CIs (GitLab, CircleCI, Drone, BuildKite, Jenkins), pre-commit recipes, Husky / lefthook, and the full SARIF schema: site/ci.html and docs/DROP_IN_USAGE.md.

Pre-commit hook

keyhog hook install                    # writes .git/hooks/pre-commit
keyhog hook uninstall                  # removes the keyhog-generated hook

The installed hook calls keyhog scan --fast --git-staged --backend simd on every commit. If keyhog is missing from PATH, the hook blocks the commit because the security scan did not run; install KeyHog, fix PATH, or remove .git/hooks/pre-commit if the repository should not be protected. Staged/diff scans use the in-process orchestrator because they need git-aware source expansion and policy handling. The daemon fast path is for editor-save and hook glue that scans stdin or one regular file.

Or via the pre-commit framework:

repos:
  - repo: https://github.com/santhsecurity/keyhog
    rev: v0.5.40
    hooks:
      - id: keyhog

Daemon mode (105× faster re-scan)

Every keyhog invocation pays a ~2 s cold start in the default desktop build (Hyperscan compile + GPU adapter probe). The lean ci profile above drops that to ~140 ms by skipping both. For pre-commit and IDE save handlers where even 140 ms is too much, run keyhog as a daemon: the cost is paid once per host, every subsequent scan is ~7 ms:

keyhog daemon start                    # Unix socket on $XDG_RUNTIME_DIR
keyhog scan --stdin --daemon < .env    # 7 ms instead of 740 ms
keyhog daemon status
keyhog daemon stop

Daemon scans are scanner-only and apply to eligible stdin or single regular-file inputs. They return findings before baseline filtering, Merkle skip-cache, and live verification; directory, git, remote, baseline, --verify, backend/GPU/autoroute, and policy-changing scans run in-process. --daemon=on fails loudly when the daemon cannot honor the requested scan exactly.

Use it in IDE save handlers, stdin/single-file hook glue, or per-commit CI loops that feed one file at a time. systemd / launchd unit examples in site/daemon.html.

Watch-mode for IDEs:

keyhog watch ./src                     # inotify/FSEvents/RDCW; sub-100 ms per save

System-wide credential triage

sudo keyhog scan-system --space 50G                  # default 50 GiB ceiling
sudo keyhog scan-system --space 1T --include-network # also scan NFS / SMB
sudo keyhog scan-system --space 10G --no-git-history # skip historical blobs

Enumerates every mounted drive (skipping pseudo-FS like /proc, /sys, tmpfs, nsfs, fuse.snapfuse), auto-discovers every .git (worktrees + bare repos + submodules), and runs the full scan + git-history pipeline. Honors a hard --space <bytes> ceiling and exits 1 on findings. Built for incident-response triage, M&A inheritance audits, and quarterly developer-laptop sweeps.

Lockdown mode (security-critical embeddings)

For deployments where keyhog runs on the same machine that holds the secrets (e.g. paired with EnvSeal) and there is no trusted boundary between the scanner and the credentials it inspects:

keyhog scan . --lockdown

Enforces:

mlockall(MCL_CURRENT|MCL_FUTURE) on Linux . credentials never page to swap.
PR_SET_DUMPABLE = 0 (always on, even outside lockdown) . disables core dumps, ptrace, /proc/<pid>/mem reads. macOS gets PT_DENY_ATTACH.
setrlimit(RLIMIT_CORE, 0) on Linux . kernel refuses to write any core file regardless of the system coredump_filter, so anonymous pages can never reach disk via the dump path.
Refuses to run if ~/.cache/keyhog/* exists, refuses --incremental writes, refuses --verify, refuses --show-secrets, refuses --fast / --no-decode / --no-entropy / --no-ml / --no-unicode-norm / --no-default-excludes (each trades off detection completeness for speed; lockdown is for the highest-stakes runs where you want every gate engaged).

The always-on hardening (everything except mlock + cache refusal) is applied to every keyhog invocation . even without --lockdown a keyhog binary can't be coredumped or ptraced.

Library API

use keyhog_core::{Chunk, ChunkMetadata};
use keyhog_scanner::CompiledScanner;

// Built-in embedded detectors, parsed through the fail-closed loader.
let detectors = keyhog_core::load_embedded_detectors_or_fail()?;
let scanner = CompiledScanner::compile(detectors)?;

let findings = scanner.scan(&Chunk {
    data: "TOKEN=sk_live_EXAMPLE…".into(),
    metadata: ChunkMetadata::default(),
});

Mix shipped + custom detectors by concatenating before compile. The scanner is Send + Sync; share one across rayon workers. Streaming source helpers in keyhog-sources (file-system, git, stdin, Docker, S3, GCS, Azure Blob, GitHub org, GitLab group, Bitbucket workspace). Live verification in keyhog-verifier.

Full API surface + stability policy: site/api.html.

Configuration

Per-repo defaults via .keyhog.toml:

[scan]
severity = "high"
min_confidence = 0.40          # canonical default; raise toward 0.85 for fewer FPs
exclude = ["**/test/fixtures/**", "vendor/"]

[limits]
stdin_bytes = "10MB"
web_response_bytes = "10MB"
cloud_max_objects = 100000
git_total_bytes = "256MB"
hosted_git_pages = 1000
docker_tar_total_bytes = "8GB"

[detector.generic-api-key]
enabled = false                # noisy detector? turn it off (hot-* fast-path
                               # ids like `hot-aws_key` are disabled the same way)

[detector.twilio-api-key]
min_confidence = 0.6           # per-detector floor; overrides the global one

[lockdown]
require = true                 # refuse to run unless --lockdown is passed

[system]
autoroute_cache = "/home/alice/.cache/keyhog/autoroute.json"  # or "off"
calibration_cache = "/home/alice/.cache/keyhog/calibration.json"
batch_pipeline = false                                       # true only for diagnostics/calibration
gpu = "auto"                                                 # auto | off | required
autoroute_gpu = false                                        # true only for calibration candidates

[aws]
canary_accounts = []           # extra 12-digit canary issuer accounts
knockoff_accounts = []         # treated the same way: do not live-verify

[tuning]
fallback_hs = true             # scanner recall-route defaults; printed by config --effective
hs_prefilter_max_len = 4096
hs_shard_target = 320
decode_focus = true
confirmed_suffix_gate = true
no_candidate_gate = true
gpu_recall_floor = false
gpu_moe_timeout_ms = 30000

Precedence (rightmost wins): compiled defaults → .keyhog.toml (walked up from the scan path) → CLI flags. The canonical defaults live in ScanConfig::default() (crates/core/src/config.rs). Full reference: docs/src/reference/configuration.md.

keyhog config --effective <path> prints the exact resolved configuration that would reach the scanner — without scanning — so the precedence chain is provable (here a CLI --min-confidence 0.6 overrides the compiled 0.40 default):

Suppress specific findings (not whole detectors) with a .keyhogignore file by hash, path glob, or detector id - see suppressions.

Allowlist a known leak with a hash, path glob, or detector id . plus optional reason / expires / approved_by governance metadata:

# .keyhogignore . gitignore-style shorthand
*.log
node_modules/
9d6060e21ef8d5daec9cfe4a44b1b1bc9792246bfad28210edaaa1782a8a676a

# Explicit form with governance
hash:9f86d081…    ; reason="rotated 2026-04-25" ; expires=2026-07-01 ; approved_by="security@acme"
detector:demo-token
path:**/fixtures/*.env

Entries past expires fail allowlist load with an actionable error, forcing the approval to be renewed or removed before the scan can proceed.

Architecture

Contributor map: docs/ARCHITECTURE.md is the one-page guide to the whole repo — every top-level directory, the crate layering, and the bytes→finding pipeline with each stage pointing at the module that owns it. Start there to navigate the code.

crates/
  core/       Detector loading, finding types, reporting (text/JSON/SARIF), allowlists
  scanner/    Hardware routing, Hyperscan, GPU, decode-through, entropy, ML, multiline
  sources/    File system, git (staged/diff/history), stdin, Docker, S3, GCS, Azure Blob, GitHub/GitLab/Bitbucket, web
  verifier/   Live credential verification (344 detectors carry an active `[detector.verify]` endpoint)
  cli/        CLI binary, daemon, watch, baselines, calibrate, hook installer
detectors/    902 TOML files (data, not code)
site/         Documentation site (17 pages, GitHub-Pages-ready)
benchmarks/   Reproducible eval harness: corpus generators, scanner adapters, scorer, gate, README report generator
tools/        Contract generators (gen_contracts.py, gen_companion_contracts.py)

Two-phase coalesced scan:

Phase 1 . shared trigger scan on raw bytes, parallel across all files via rayon. Hyperscan accelerates this phase when compiled; portable builds use the pure-Rust trigger path. 95 %+ of files have no hits and pay zero cost.
Phase 2 . full extraction on hits only: regex capture groups, companion matching, checksum validation, entropy gating, ML confidence + explicit Bayesian damping when configured.

Result: a multi-GB monorepo scans in seconds. Determinism is part of the contract . same input → same output, byte-exact, every time.

Full architecture writeup, hardware routing matrix, profiling tips: site/architecture.html and site/performance.html.

Other useful subcommands

keyhog detectors --search aws --verbose      # list / inspect detectors
keyhog explain aws-access-key                # spec, regex, severity, rotation guide
keyhog diff before.json after.json           # NEW / RESOLVED / UNCHANGED for CI gates
keyhog calibrate --tp aws-access-key         # record a true positive
keyhog calibrate --fp generic-api-key        # record a false positive
keyhog calibrate --show                      # posterior-mean bar chart per detector
keyhog scan . --calibration-cache ~/.cache/keyhog/calibration.json
keyhog backend                               # detected hardware + routing matrix
keyhog completion zsh                        # shell completions (bash/zsh/fish/powershell/elvish)

Contributing

New detector? Drop a TOML in detectors/, open a PR. The contributor guide (CONTRIBUTING.md) has the schema and a worked example.
Bug / missed secret / false positive? File an issue with the redacted credential shape and detector id; each report becomes a permanent test fixture under tests/contracts/.
Security issue in keyhog itself? Don't open a public issue - email security@santh.dev (PGP key on the org page).

Changelog. Open issues.

Credits

keyhog stands on prior secret-scanning work. Ideas borrowed from:

trufflehog . detector breadth + verification semantics
betterleaks . entropy/keyword fusion and false-positive suppression
titus . scanning ergonomics and severity calibration

Thanks to these projects and their contributors.

License

MIT. Use commercially, embed, fork, sell a hosted version. The detector TOMLs are also MIT . adding one is a 5-line PR with zero legal friction.

Star history

If keyhog has saved you from leaking a credential, a star is the cheapest way to tell the next person it exists.

Name		Name	Last commit message	Last commit date
Latest commit History 2,437 Commits
.github		.github
benchmarks		benchmarks
crates		crates
demo		demo
detectors		detectors
docs		docs
fuzz		fuzz
metrics		metrics
ml		ml
rules		rules
scripts		scripts
site		site
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.keyhog.toml.example		.keyhog.toml.example
.keyhogignore		.keyhogignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
NOTICE		NOTICE
PUBLISHING.md		PUBLISHING.md
README.md		README.md
SECURITY.md		SECURITY.md
audit.toml		audit.toml
deny.toml		deny.toml
install.ps1		install.ps1
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Add it to your CI (one workflow file)

How it works

Autoroute Contract

Install

Keep keyhog healthy and up to date

Quickstart

What it catches

Why higher recall, fewer false positives

Performance

Detection leaderboard

Speed & memory

Per-category recall gaps (where a competitor still wins recall)

CI integration

GitHub Actions

CI never needs a GPU

Pre-commit hook

Daemon mode (105× faster re-scan)

System-wide credential triage

Lockdown mode (security-critical embeddings)

Library API

Configuration

Architecture

Other useful subcommands

Contributing

Credits

License

Star history

About

Licenses found

Uh oh!

Releases 33

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Add it to your CI (one workflow file)

How it works

Autoroute Contract

Install

Keep keyhog healthy and up to date

Quickstart

What it catches

Why higher recall, fewer false positives

Performance

Detection leaderboard

Speed & memory

Per-category recall gaps (where a competitor still wins recall)

CI integration

GitHub Actions

CI never needs a GPU

Pre-commit hook

Daemon mode (105× faster re-scan)

System-wide credential triage

Lockdown mode (security-critical embeddings)

Library API

Configuration

Architecture

Other useful subcommands

Contributing

Credits

License

Star history

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages