feat(models): introduce typed digest newtypes for cryptographic hashes#590
Merged
feat(models): introduce typed digest newtypes for cryptographic hashes#590
Conversation
Replace Option<String> hash fields with fixed-size [u8; N] newtypes (Sha1Digest, Md5Digest, Sha256Digest, Sha512Digest, GitSha1) across FileInfo, PackageData, ResolvedPackage, Package, FileReference, and the incremental cache. Benefits: - Compile-time guarantee: SHA-1 cannot be assigned to SHA-256 field - Stack-allocated: no heap allocation for hash values - Copy semantics: no cloning overhead - Serde hex string serialization: JSON-compatible - ~15% scan performance improvement (180s vs 212s on self-scan) - Enables EMPTY_SHA1_DIGEST const replacing string literal Benchmark (self-scan excluding target/reference/.git): - Total: 180.88s vs 212.44s baseline (-14.9%) - Scan: 6.55s vs 7.76s (-15.6%) - Throughput: 691 MB/s vs 583 MB/s (+18.5%)
…wtype migration - Add #[allow(dead_code)] on EMPTY, from_bytes, and as_bytes in define_digest macro: these are public API items used across type instantiations and in tests - Fix deno_lock parser to use parse_sri for SRI-format integrity strings (sha256-<base64>, sha512-<base64>) instead of passing them raw to from_hex - Replace placeholder hex strings in test fixtures with valid hex: - conan_data_test: xyz789 -> abc789 (valid hex chars) - python_test: ...ww / ...ss suffixes -> ...bb / ...cc - cargo_lock fixture: 12-char checksum -> proper 64-char sha256 - deno_lock + assembly golden: use proper SRI format and valid hex
mstykow
reviewed
Apr 9, 2026
| "md5": "3287103cfb083fb998a35ef8a1983c58", | ||
| "sha256": "6cc2359979269e4d9eddce7d84682d2bb06a35a14edce806bf0da6e8d4d31806", | ||
| "sha512": "7630aacb9e3073b2064397ed080b8d5bf7db06ba2022d6c927e05b7d53c5787d", | ||
| "sha512": null, |
Owner
There was a problem hiding this comment.
is it correct that some of these hashes in the golden fixtures became null?
Collaborator
Author
There was a problem hiding this comment.
investigating that
Collaborator
Author
There was a problem hiding this comment.
So this changed because that is actually not a sha512. The value is just wrong in the lock file. So we actually don't know the sha512 value and thus should output null instead of the 256bit value from the lockfile which definitely is not a sha512.
ec871c8 to
d635146
Compare
The parse_integrity function stripped the algorithm prefix (e.g. sha512-) but returned the raw base64 string instead of decoding it to hex. This caused Sha512Digest::from_hex() to reject all pnpm integrity values, silently dropping valid hash data. Now uses parse_sri which correctly base64-decodes and converts to hex, matching the behavior of the yarn_lock, npm_lock, deno_lock, and bun_lock parsers.
…nd cargo Gradle module metadata sometimes puts a 32-byte hash in the sha512 field. Detect this by length and assign to sha256 instead, preserving the data that was previously silently dropped. Cargo.lock checksum fields are algorithm-agnostic. Short checksums that don't match SHA-256 length are now stored in extra_data instead of being silently dropped.
…n test fixtures Replace invalid placeholder digest strings in test fixtures and expected files with properly formatted hex/SRI values that pass typed newtype validation: - deno.lock: asserthash/internalhash/oakmodhash -> valid SRI/hex - yarn-v2-protocol.lock: checksum 'test' -> valid 128-char sha512 hex - pypi.json: wheelhash/sdisthash -> valid 64-char sha256 hex - pnpm_lock.rs: fix import ordering from cargo fmt
d635146 to
b30a5e1
Compare
…e and fix test sha1 values The SPDX spec requires the package verification code to be the SHA-1 of the concatenated hex string representations of file SHA-1 values. After the digest newtype migration, as_bytes() returns raw 20-byte digests instead of 40-byte hex strings, producing incorrect verification codes. Use as_hex().as_bytes() to hash the hex string as the spec requires. Also fix test helper calls that passed 'abc' as sha1 to sample_plain_text_file, which now requires a valid 40-char hex string.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Option<String>hash fields with fixed-size[u8; N]newtypes (Sha1Digest,Md5Digest,Sha256Digest,Sha512Digest,GitSha1) across all model structs, parsers, output formats, and the incremental cachePerformance Impact
Self-scan benchmark (excluding target/reference/.git):
The improvement comes from eliminating heap allocations for hash strings (e.g., 20 bytes for SHA-1 vs 65+ bytes for a hex
String), enablingCopysemantics throughout the pipeline, and reducing memory pressure.Changes
src/models/digest.rs— Macro-defined digest types withfrom_hex,as_hex,from_bytes,as_bytes,Display,Ord, serde hex-string supportFileInfo,PackageData,ResolvedPackage,Package,FileReference— all hash fields now typedhash.rsreturns typed digests directly from crypto cratesEMPTY_SHA1string literal →EMPTY_SHA1_DIGESTconst; CycloneDX/SPDX/HTML use.as_hex()content_sha256: Sha256DigestDigestType::from_hex().ok()when parsing hash strings from manifestsGolden Test Changes
The newtype migration enforces two invariants that the previous
Option<String>did not: valid hexadecimal characters and exact byte-length matching the algorithm. This caused several categories of golden test changes:1. Placeholder digests replaced with valid values (deno, python, yarn)
Test fixtures contained non-hex placeholder strings like
"asserthash","sdisthash","test", and"sha512-chalkhash". These were valid asOption<String>but are rejected by the typed newtypes. The fixtures were updated with realistic hex/SRI values and the expected files updated to match the parser output.Files:
deno.lock,pypi.json,yarn-v2-protocol.lock(fixtures + expected)2. Base64→hex conversion for SRI integrity strings (pnpm)
The pnpm lock parser's
parse_integrityfunction was stripping thesha512-prefix but returning the raw base64 string instead of decoding it. This was invisible whensha512wasOption<String>but is now caught bySha512Digest::from_hex(). The parser was fixed to useparse_sri(which properly base64-decodes and converts to hex), and the pnpm golden expected files were updated from base64 strings to 128-character hex strings.Files:
pnpm-v5/v6/v9.yaml.expected.json3. Misassigned algorithm labels (gradle)
The Gradle module metadata file
material-1.9.0.modulehas asha512field containing a 64-hex-character (32-byte) hash — the correct length for SHA-256, not SHA-512 (which requires 128 hex / 64 bytes). The parser now detects this mismatch and routes 64-hex-char values from thesha512field to thesha256slot. Since the fixture already has an explicitsha256value, the misassigned hash is discarded andsha512becomesnull. This is the correct behavior: the upstream data labeled a SHA-256 hash as SHA-512.Files:
material-1.9.0.module-expected.json4. Short/invalid checksums silently dropped (cargo)
The
Cargo.lockfixture had a 12-character checksum ("abc123def456") that is too short for SHA-256 (requires 64 hex chars). The parser now routes valid 64-char checksums tosha256and stores other-length valid hex checksums inextra_data["checksum"]as a fallback. The golden expected file reflectssha256: nullwith the short checksum preserved inextra_data.5. SPDX verification code fix
The SPDX package verification code is defined as the SHA-1 hash of concatenated file SHA-1 hex string representations. After the newtype migration,
as_bytes()returned raw 20-byte digests instead of 40-byte hex strings, producing incorrect verification codes. Fixed by usingas_hex().as_bytes()to hash the hex representation as the SPDX spec requires.How to Test
cargo build --release cargo clippy --all-targets --all-features cargo test