Skip to content

feat: initial shape-scan implementation#1

Merged
DimaMenetro merged 1 commit into
mainfrom
devin/1777137563-initial-implementation
Apr 25, 2026
Merged

feat: initial shape-scan implementation#1
DimaMenetro merged 1 commit into
mainfrom
devin/1777137563-initial-implementation

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Initial implementation of shape-scan, a Rust CLI that measures the entropy and topological "shape" of files to flag suspicious binaries (packed/encrypted payloads).

What it computes per file:

  • Shannon entropy — whole-file (bits/byte, max 8.0) plus a sliding-window summary (mean, std-dev, min/max, fraction of high-entropy windows). Default window: 4 KiB.
  • Topological shape — the byte stream is treated as a Markov chain and the 256×256 byte-bigram transition graph is summarised by:
    • distinct vertices |V| and edges |E|
    • edge density |E| / 65 536
    • joint bigram entropy (bits/pair)
    • conditional entropy H(b_{i+1} | b_i) (bits/byte)
    • mean per-row entropy ± std-dev
    • a stable 64-bit structural fingerprint (FNV-1a over the quantised matrix)
  • Section-aware analysis — per-section entropy for ELF, PE, and Mach-O via goblin; falls back to a single <file> pseudo-section for unknown formats.
  • Heuristic risk score in [0.0, 1.0] with a transparent, additive weighting (documented in the README) and a low/medium/high bucket.

CLI:

  • shape-scan scan <paths...> — full scan; supports -r recursion, --min-risk, --format text|json|markdown, -j parallel workers (rayon), --max-size-mib.
  • shape-scan shape <path> — only the topology report.
  • shape-scan entropy <path> [--window N] — only the entropy profile.
  • Exit codes: 0 clean, 1 at least one high-risk file, 2 error.

Honest framing: the README is explicit that this is a triage signal, not a malware verdict — entropy/shape heuristics are well-known and sophisticated malware can be tuned to evade them.

Local verification:

$ shape-scan scan README.md /bin/ls /tmp/rand.bin
[ low  ] score=0.00 entropy=4.94  README.md
[ low  ] score=0.05 entropy=5.79  /bin/ls
[ high ] score=0.80 entropy=8.00  /tmp/rand.bin   <-- /dev/urandom
   - high overall entropy (8.00 bits/byte)
   - 100% of 4096-byte windows are high-entropy
   - near-complete byte-bigram graph (edge density 0.95)
   - uniform conditional entropy (7.74 bits/byte) — bytes are nearly memoryless

CI runs cargo fmt --check, cargo clippy -D warnings, cargo test, and cargo build --release on Linux, macOS, and Windows. 12 unit tests cover entropy, shape, scoring, and I/O.

Review & Testing Checklist for Human

  • Confirm the risk-scoring weights in src/scan.rs::score (and the matching table in the README) match the trade-offs you want — they're tunable knobs and worth a glance.
  • Spot-check the per-section parsing on a real PE or Mach-O sample if you have one handy: shape-scan scan /path/to/binary.exe --format json | jq '.[0].sections'.
  • Verify the GitHub Actions CI run on this PR passes on all three OSes; the matrix uses dtolnay/rust-toolchain@stable, so a stable-Rust regression in a transitive dep would surface here.
  • Decide whether you want this published to crates.io — if so, the package metadata in Cargo.toml is ready (license, description, keywords) but you'll want to claim the name and bump the version before publishing.

Notes

  • Cargo.lock is committed because this is a binary crate; remove it later if you decide to ship shape-scan only as a library.
  • The structural fingerprint is intentionally a 64-bit FNV — fast, deterministic, and good enough to cluster files. If you ever need a cryptographically strong fingerprint, swap in BLAKE3 over the same quantised matrix.
  • The gh integration token couldn't set the default branch to main, but the main ref was created via the GitHub API, so the PR base is correct. You may want to confirm main is the default branch in repo settings.

Link to Devin session: https://app.devin.ai/sessions/65678b8d19d74d5b97392a05c0f7d416
Requested by: @DimaMenetro

Adds entropy and topological-shape file scanner:
- Shannon entropy (whole-file + sliding window) in bits/byte
- Byte-bigram transition graph with density, joint/conditional entropy,
  per-row entropy stats, and stable structural fingerprint
- Section-aware analysis for ELF, PE, and Mach-O via goblin
- CLI with scan/shape/entropy subcommands (text/json/markdown output)
- Heuristic risk score with documented weights and small-file dampening
- 12 unit tests covering entropy, shape, scoring, and I/O
- GitHub Actions CI (fmt + clippy + test on linux/macos/windows)
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@DimaMenetro DimaMenetro merged commit c04e5ae into main Apr 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant