Skip to content

feat(models): introduce typed digest newtypes for cryptographic hashes#590

Merged
abraemer merged 6 commits intomainfrom
feat/digest-newtypes
Apr 9, 2026
Merged

feat(models): introduce typed digest newtypes for cryptographic hashes#590
abraemer merged 6 commits intomainfrom
feat/digest-newtypes

Conversation

@abraemer
Copy link
Copy Markdown
Collaborator

@abraemer abraemer commented Apr 9, 2026

Summary

  • Replace Option<String> hash fields with fixed-size [u8; N] newtypes (Sha1Digest, Md5Digest, Sha256Digest, Sha512Digest, GitSha1) across all model structs, parsers, output formats, and the incremental cache
  • Store digests as stack-allocated raw byte arrays with serde hex-string serialization for full JSON compatibility
  • Compile-time guarantee that a SHA-1 digest cannot be silently assigned to a SHA-256 field

Performance Impact

Self-scan benchmark (excluding target/reference/.git):

Metric Baseline After Change
Total time 212.44s 180.88s -14.9%
Scan time 7.76s 6.55s -15.6%
Throughput 583 MB/s 691 MB/s +18.5%

The improvement comes from eliminating heap allocations for hash strings (e.g., 20 bytes for SHA-1 vs 65+ bytes for a hex String), enabling Copy semantics throughout the pipeline, and reducing memory pressure.

Changes

  • New: src/models/digest.rs — Macro-defined digest types with from_hex, as_hex, from_bytes, as_bytes, Display, Ord, serde hex-string support
  • Models: FileInfo, PackageData, ResolvedPackage, Package, FileReference — all hash fields now typed
  • Utils: hash.rs returns typed digests directly from crypto crates
  • Output: EMPTY_SHA1 string literal → EMPTY_SHA1_DIGEST const; CycloneDX/SPDX/HTML use .as_hex()
  • Cache: content_sha256: Sha256Digest
  • 20+ parsers: Use DigestType::from_hex().ok() when parsing hash strings from manifests
  • 20+ test files: Updated assertions to use typed digest constructors

Golden Test Changes

The newtype migration enforces two invariants that the previous Option<String> did not: valid hexadecimal characters and exact byte-length matching the algorithm. This caused several categories of golden test changes:

1. Placeholder digests replaced with valid values (deno, python, yarn)

Test fixtures contained non-hex placeholder strings like "asserthash", "sdisthash", "test", and "sha512-chalkhash". These were valid as Option<String> but are rejected by the typed newtypes. The fixtures were updated with realistic hex/SRI values and the expected files updated to match the parser output.

Files: deno.lock, pypi.json, yarn-v2-protocol.lock (fixtures + expected)

2. Base64→hex conversion for SRI integrity strings (pnpm)

The pnpm lock parser's parse_integrity function was stripping the sha512- prefix but returning the raw base64 string instead of decoding it. This was invisible when sha512 was Option<String> but is now caught by Sha512Digest::from_hex(). The parser was fixed to use parse_sri (which properly base64-decodes and converts to hex), and the pnpm golden expected files were updated from base64 strings to 128-character hex strings.

Files: pnpm-v5/v6/v9.yaml.expected.json

3. Misassigned algorithm labels (gradle)

The Gradle module metadata file material-1.9.0.module has a sha512 field containing a 64-hex-character (32-byte) hash — the correct length for SHA-256, not SHA-512 (which requires 128 hex / 64 bytes). The parser now detects this mismatch and routes 64-hex-char values from the sha512 field to the sha256 slot. Since the fixture already has an explicit sha256 value, the misassigned hash is discarded and sha512 becomes null. This is the correct behavior: the upstream data labeled a SHA-256 hash as SHA-512.

Files: material-1.9.0.module-expected.json

4. Short/invalid checksums silently dropped (cargo)

The Cargo.lock fixture had a 12-character checksum ("abc123def456") that is too short for SHA-256 (requires 64 hex chars). The parser now routes valid 64-char checksums to sha256 and stores other-length valid hex checksums in extra_data["checksum"] as a fallback. The golden expected file reflects sha256: null with the short checksum preserved in extra_data.

5. SPDX verification code fix

The SPDX package verification code is defined as the SHA-1 hash of concatenated file SHA-1 hex string representations. After the newtype migration, as_bytes() returned raw 20-byte digests instead of 40-byte hex strings, producing incorrect verification codes. Fixed by using as_hex().as_bytes() to hash the hex representation as the SPDX spec requires.

How to Test

cargo build --release
cargo clippy --all-targets --all-features
cargo test

abraemer added 2 commits April 9, 2026 12:10
Replace Option<String> hash fields with fixed-size [u8; N] newtypes
(Sha1Digest, Md5Digest, Sha256Digest, Sha512Digest, GitSha1) across
FileInfo, PackageData, ResolvedPackage, Package, FileReference, and
the incremental cache.

Benefits:
- Compile-time guarantee: SHA-1 cannot be assigned to SHA-256 field
- Stack-allocated: no heap allocation for hash values
- Copy semantics: no cloning overhead
- Serde hex string serialization: JSON-compatible
- ~15% scan performance improvement (180s vs 212s on self-scan)
- Enables EMPTY_SHA1_DIGEST const replacing string literal

Benchmark (self-scan excluding target/reference/.git):
- Total: 180.88s vs 212.44s baseline (-14.9%)
- Scan: 6.55s vs 7.76s (-15.6%)
- Throughput: 691 MB/s vs 583 MB/s (+18.5%)
…wtype migration

- Add #[allow(dead_code)] on EMPTY, from_bytes, and as_bytes in
  define_digest macro: these are public API items used across type
  instantiations and in tests
- Fix deno_lock parser to use parse_sri for SRI-format integrity
  strings (sha256-<base64>, sha512-<base64>) instead of passing them
  raw to from_hex
- Replace placeholder hex strings in test fixtures with valid hex:
  - conan_data_test: xyz789 -> abc789 (valid hex chars)
  - python_test: ...ww / ...ss suffixes -> ...bb / ...cc
  - cargo_lock fixture: 12-char checksum -> proper 64-char sha256
  - deno_lock + assembly golden: use proper SRI format and valid hex
"md5": "3287103cfb083fb998a35ef8a1983c58",
"sha256": "6cc2359979269e4d9eddce7d84682d2bb06a35a14edce806bf0da6e8d4d31806",
"sha512": "7630aacb9e3073b2064397ed080b8d5bf7db06ba2022d6c927e05b7d53c5787d",
"sha512": null,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that some of these hashes in the golden fixtures became null?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

investigating that

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this changed because that is actually not a sha512. The value is just wrong in the lock file. So we actually don't know the sha512 value and thus should output null instead of the 256bit value from the lockfile which definitely is not a sha512.

@abraemer abraemer force-pushed the feat/digest-newtypes branch from ec871c8 to d635146 Compare April 9, 2026 16:18
abraemer added 3 commits April 9, 2026 18:34
The parse_integrity function stripped the algorithm prefix (e.g. sha512-)
but returned the raw base64 string instead of decoding it to hex. This
caused Sha512Digest::from_hex() to reject all pnpm integrity values,
silently dropping valid hash data.

Now uses parse_sri which correctly base64-decodes and converts to hex,
matching the behavior of the yarn_lock, npm_lock, deno_lock, and
bun_lock parsers.
…nd cargo

Gradle module metadata sometimes puts a 32-byte hash in the sha512
field. Detect this by length and assign to sha256 instead, preserving
the data that was previously silently dropped.

Cargo.lock checksum fields are algorithm-agnostic. Short checksums
that don't match SHA-256 length are now stored in extra_data instead
of being silently dropped.
…n test fixtures

Replace invalid placeholder digest strings in test fixtures and
expected files with properly formatted hex/SRI values that pass
typed newtype validation:
- deno.lock: asserthash/internalhash/oakmodhash -> valid SRI/hex
- yarn-v2-protocol.lock: checksum 'test' -> valid 128-char sha512 hex
- pypi.json: wheelhash/sdisthash -> valid 64-char sha256 hex
- pnpm_lock.rs: fix import ordering from cargo fmt
@abraemer abraemer force-pushed the feat/digest-newtypes branch from d635146 to b30a5e1 Compare April 9, 2026 17:45
…e and fix test sha1 values

The SPDX spec requires the package verification code to be the SHA-1
of the concatenated hex string representations of file SHA-1 values.
After the digest newtype migration, as_bytes() returns raw 20-byte
digests instead of 40-byte hex strings, producing incorrect verification
codes. Use as_hex().as_bytes() to hash the hex string as the spec
requires.

Also fix test helper calls that passed 'abc' as sha1 to
sample_plain_text_file, which now requires a valid 40-char hex string.
@abraemer abraemer merged commit 07ac1d1 into main Apr 9, 2026
14 checks passed
@abraemer abraemer deleted the feat/digest-newtypes branch April 9, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants