perf: cache has_jump flag and pass buffer in _pack_location#203
Merged
MatthieuDartiailh merged 1 commit intoMay 28, 2026
Conversation
Two independent micro-optimisations on the roundtrip hot path, benchmarked together: 1. Cache `_is_jump` on BaseInstr at construction time `has_jump()` is called on every instruction in ControlFlowGraph.from_bytecode, BasicBlock.append, and _StackSizeComputer.run — the profiler showed it at ~2.9% own time, spending that time on `opcode in HAS_JUMP` (a set lookup) on every call. Added `_is_jump: bool` to BaseInstr.__slots__ and compute it once in `_set()` (the canonical setter used by __init__ and the public `set()` method). All fast-path constructors that bypass `_set()` — `copy()`, `_from_trusted()` on both BaseInstr and ConcreteInstr, and `_from_opcode()` on ConcreteInstr — now copy or compute the flag directly. `has_jump()` becomes a single slot read. 2. Eliminate per-call bytearray allocation in _pack_location `_assemble_locations` previously collected one `bytearray` per location group via `_push_locations -> _pack_location -> bytearray()`, then joined them with `b"".join(locations)` at the end. Each location entry is only 2-6 bytes, so the list of small bytearrays and the final join were measurable overhead (`_pack_location` at ~3.9% own in the profiler). Changed the signature of `_pack_location` and `_push_locations` to accept a shared `bytearray buf` and extend into it in place. `_assemble_locations` creates one `bytearray()` up-front and converts to `bytes` at the end -- zero intermediate allocations. Benchmark (perf.py, Bytecode.from_code(dis).to_code(), 30 runs, p95 r/s): | | p95 (r/s) | 95% CI | |---|---|---| | Baseline | 188 | [187, 188] | | This PR | 196 | [195, 196] | Delta: +8 r/s (+4.3%), Mann-Whitney p~0 (significant, threshold: p<0.01 and |delta|>=2%)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #203 +/- ##
=======================================
Coverage 95.45% 95.45%
=======================================
Files 7 7
Lines 2132 2135 +3
Branches 459 459
=======================================
+ Hits 2035 2038 +3
Misses 54 54
Partials 43 43 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
MatthieuDartiailh
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two independent micro-optimisations on the roundtrip hot path, benchmarked together:
_is_jumpon BaseInstr at construction timehas_jump()is called on every instruction in ControlFlowGraph.from_bytecode, BasicBlock.append, and _StackSizeComputer.run — the profiler showed it at ~2.9% own time, spending that time onopcode in HAS_JUMP(a set lookup) on every call.Added
_is_jump: boolto BaseInstr.slots and compute it once in_set()(the canonical setter used by init and the publicset()method). All fast-path constructors that bypass_set()—copy(),_from_trusted()on both BaseInstr and ConcreteInstr, and_from_opcode()on ConcreteInstr — now copy or compute the flag directly.has_jump()becomes a single slot read._assemble_locationspreviously collected onebytearrayper location group via_push_locations -> _pack_location -> bytearray(), then joined them withb"".join(locations)at the end. Each location entry is only 2-6 bytes, so the list of small bytearrays and the final join were measurable overhead (_pack_locationat ~3.9% own in the profiler).Changed the signature of
_pack_locationand_push_locationsto accept a sharedbytearray bufand extend into it in place._assemble_locationscreates onebytearray()up-front and converts tobytesat the end -- zero intermediate allocations.Benchmark (perf.py, Bytecode.from_code(dis).to_code(), 30 runs, p95 r/s):
Delta: +8 r/s (+4.3%), Mann-Whitney p~0 (significant, threshold: p<0.01 and |delta|>=2%)