-
Notifications
You must be signed in to change notification settings - Fork 31
perf: Reduce prove peak memory and switch to jemalloc #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@Paradox Can you please rebase it with main, pr makes quite a few changes which breaks compatibility, such as -
|
- Destructure WhirR1CSCommitment to drop masked/random polynomials before WHIR prove_batch/prove, saving ~256 MB in dual-witness path - Defer public input weight vector allocation until after alphas are consumed - Drop program and witness_generator before prove call (~60 MB) - Add feature-gated jemalloc as default allocator (RSS: 2.39 GB -> 1.90 GB) - Add release-fast build profile (30s vs 2.5min) Profiling peak: 2.24 GB -> 1.92 GB RSS with jemalloc: 1.90 GB (complete_age_check, 1.1M constraints)
…commits Move drop(self.program) and drop(self.witness_generator) immediately after extracting public input indices, before the NTT-heavy commit phase. Also drop acir_witness_idx_to_value_map right after its last use in each branch rather than after both branches.
b914aad to
65b12b0
Compare
65b12b0 to
099ac78
Compare
| } | ||
| } | ||
|
|
||
| impl R1CSSolver for LazyR1CS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The R1CSSolver for LazyR1CS is same as that of the R1CS implementation. Instead of the common code, consider extracting into common func, using macros etc.
| } | ||
| } | ||
|
|
||
| fn ensure_decompressed(&self) -> &(Interner, SparseMatrix, SparseMatrix, SparseMatrix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using Result<&(...)> for better logging
| postcard::to_allocvec(&matrices).expect("Failed to serialize R1CS matrices"); | ||
| let mut compressed = Vec::new(); | ||
| { | ||
| let mut encoder = XzEncoder::new(&mut compressed, 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In file/bin.rs, the encoding used was xz level 9. it's better to have a global const which is 9 and used here as well
| zeroize = "1.8.1" | ||
| xz2 = "0.1.7" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
| /// After the first access the decompressed matrices live in `cached`, | ||
| /// so the compressed blob is dead weight. Call this after the first | ||
| /// access to reclaim ~10 MB for a typical circuit. | ||
| pub fn free_compressed(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an assertion in free_compressed() to verify cache is populated: assert!(self.cached.get().is_some(), "Must access matrices before freeing");
Reduce Peak Memory During Prove Step
Problem
After adding public inputs, the prove step for
complete_age_checkregressed from 1.84 GB to 2.24 GB peak memory.Changes
Memory optimization (2.24 GB → 1.92 GB, −320 MB)
WhirR1CSCommitmentin both single and dual witness paths to take ownership of masked/random polynomials, enabling explicitdrop()before entering WHIR'sprove_batch/proveadd_public_inputs_to_transcript+build_public_weights) to defer the 64 MB allocation until after alphas are consumedprogramandwitness_generatorbefore the prove call since they are only needed during witness generationSwitch default allocator to jemalloc (RSS: 2.39 GB → 1.90 GB, −490 MB)
ProfilingAllocator, enabled by defaultAdd
release-fastbuild profilecargo build --profile release-fastcodegen-units = 16lto = "thin"Benchmark (
complete_age_check, 1.1M constraints)Allocator Comparison
jemalloc was chosen as default for best RSS.
mimalloc was evaluated and rejected (worst RSS despite best wall-clock time).
Root Cause Analysis
The remaining ~80 MB gap from the original 1.84 GB is fully accounted for by the public inputs weight vector:
prove_batch(line 279, read-only external crate)Before public inputs, there were 6 weights; after, 7.
This overhead is inherent to the protocol and cannot be reduced without modifying the WHIR crate or changing the proof transcript structure.