perf: Reduce prove peak memory and switch to jemalloc #277

Bisht13 · 2026-02-07T10:48:45Z

Reduce Peak Memory During Prove Step

Problem

After adding public inputs, the prove step for complete_age_check regressed from 1.84 GB to 2.24 GB peak memory.

Changes

Memory optimization (2.24 GB → 1.92 GB, −320 MB)

Destructured WhirR1CSCommitment in both single and dual witness paths to take ownership of masked/random polynomials, enabling explicit drop() before entering WHIR's prove_batch / prove
Split public input transcript interaction from weight vector allocation (add_public_inputs_to_transcript + build_public_weights) to defer the 64 MB allocation until after alphas are consumed
Dropped program and witness_generator before the prove call since they are only needed during witness generation

Switch default allocator to jemalloc (RSS: 2.39 GB → 1.90 GB, −490 MB)

Added feature-gated jemalloc support to ProfilingAllocator, enabled by default
System allocator remains available via:

–no-default-features –features profiling-allocator

Add `release-fast` build profile

cargo build --profile release-fast
30s build time vs 2.5min for full release
Uses codegen-units = 16
Uses lto = "thin"

Benchmark (`complete_age_check`, 1.1M constraints)

Metric	Before	After
Profiling peak	2.24 GB	1.92 GB
RSS (system alloc)	—	2.39 GB
RSS (jemalloc, default)	—	1.90 GB

Allocator Comparison

Allocator	RSS	Duration
System	2.39 GB	3.30s
jemalloc ✅	1.90 GB	3.81s
mimalloc	3.12 GB	3.37s

jemalloc was chosen as default for best RSS.
mimalloc was evaluated and rejected (worst RSS despite best wall-clock time).

Root Cause Analysis

The remaining ~80 MB gap from the original 1.84 GB is fully accounted for by the public inputs weight vector:

64 MB in the statement
64 MB cloned inside WHIR's prove_batch (line 279, read-only external crate)

Before public inputs, there were 6 weights; after, 7.

This overhead is inherent to the protocol and cannot be reduced without modifying the WHIR crate or changing the proof transcript structure.

ashpect · 2026-02-09T16:10:39Z

@Paradox Can you please rebase it with main, pr makes quite a few changes which breaks compatibility, such as -

Change in proof format : transcript to narg_string which breaks existing proof
In lazy r1cs, using old zstd method.
DomainSeparator.instance(&empty) causing panic in debug mode (i've pushed commit which fixes this)
Gnark verifier : this still expects the old proof format 'transcript' etc. (from 1)

- Destructure WhirR1CSCommitment to drop masked/random polynomials before WHIR prove_batch/prove, saving ~256 MB in dual-witness path - Defer public input weight vector allocation until after alphas are consumed - Drop program and witness_generator before prove call (~60 MB) - Add feature-gated jemalloc as default allocator (RSS: 2.39 GB -> 1.90 GB) - Add release-fast build profile (30s vs 2.5min) Profiling peak: 2.24 GB -> 1.92 GB RSS with jemalloc: 1.90 GB (complete_age_check, 1.1M constraints)

…oding

…commits Move drop(self.program) and drop(self.witness_generator) immediately after extracting public input indices, before the NTT-heavy commit phase. Also drop acir_witness_idx_to_value_map right after its last use in each branch rather than after both branches.

ashpect · 2026-02-10T18:29:48Z

provekit/prover/src/r1cs.rs

    }
 }
+
+impl R1CSSolver for LazyR1CS {


The R1CSSolver for LazyR1CS is same as that of the R1CS implementation. Instead of the common code, consider extracting into common func, using macros etc.

ashpect · 2026-02-10T18:31:35Z

provekit/common/src/lazy_r1cs.rs

+        }
+    }
+
+    fn ensure_decompressed(&self) -> &(Interner, SparseMatrix, SparseMatrix, SparseMatrix) {


Consider using Result<&(...)> for better logging

ashpect · 2026-02-10T18:33:47Z

provekit/common/src/lazy_r1cs.rs

+            postcard::to_allocvec(&matrices).expect("Failed to serialize R1CS matrices");
+        let mut compressed = Vec::new();
+        {
+            let mut encoder = XzEncoder::new(&mut compressed, 6);


In file/bin.rs, the encoding used was xz level 9. it's better to have a global const which is 9 and used here as well

ashpect · 2026-02-10T18:35:55Z

Cargo.toml

 zeroize = "1.8.1"
 xz2 = "0.1.7"

+


extra space

ashpect · 2026-02-10T19:40:02Z

provekit/common/src/lazy_r1cs.rs

+    /// After the first access the decompressed matrices live in `cached`,
+    /// so the compressed blob is dead weight. Call this after the first
+    /// access to reclaim ~10 MB for a typical circuit.
+    pub fn free_compressed(&mut self) {


Consider adding an assertion in free_compressed() to verify cache is populated: assert!(self.cached.get().is_some(), "Must access matrices before freeing");

Bisht13 requested a review from ashpect February 7, 2026 10:48

Bisht13 added 3 commits February 10, 2026 19:42

fix: truncate long tracing KV values to prevent WHIR debug output flo…

c3b05bb

…oding

Bisht13 force-pushed the px/reduce-prove-memory-jemalloc branch from b914aad to 65b12b0 Compare February 10, 2026 17:03

perf: add LazyR1CS with XZ-compressed matrices

099ac78

Bisht13 force-pushed the px/reduce-prove-memory-jemalloc branch from 65b12b0 to 099ac78 Compare February 10, 2026 17:23

ashpect requested changes Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Reduce prove peak memory and switch to jemalloc #277

perf: Reduce prove peak memory and switch to jemalloc #277

Uh oh!

Bisht13 commented Feb 7, 2026

Uh oh!

ashpect commented Feb 9, 2026

Uh oh!

ashpect Feb 10, 2026

Uh oh!

ashpect Feb 10, 2026 •

edited

Loading

Uh oh!

ashpect Feb 10, 2026

Uh oh!

ashpect Feb 10, 2026

Uh oh!

ashpect Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: Reduce prove peak memory and switch to jemalloc #277

Are you sure you want to change the base?

perf: Reduce prove peak memory and switch to jemalloc #277

Uh oh!

Conversation

Bisht13 commented Feb 7, 2026

Reduce Peak Memory During Prove Step

Problem

Changes

Memory optimization (2.24 GB → 1.92 GB, −320 MB)

Switch default allocator to jemalloc (RSS: 2.39 GB → 1.90 GB, −490 MB)

Add release-fast build profile

Benchmark (complete_age_check, 1.1M constraints)

Allocator Comparison

Root Cause Analysis

Uh oh!

ashpect commented Feb 9, 2026

Uh oh!

ashpect Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ashpect Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashpect Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ashpect Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ashpect Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `release-fast` build profile

Benchmark (`complete_age_check`, 1.1M constraints)

ashpect Feb 10, 2026 •

edited

Loading