feat(profiling): incremental delta export for heap tracker#18125
feat(profiling): incremental delta export for heap tracker#18125vlad-scherbich wants to merge 4 commits into
Conversation
Codeowners resolved as |
89c6829 to
1ee4530
Compare
2e4a3b9 to
2ab8114
Compare
|
@codex please reivew |
|
✨ Fix all issues with BitsAI or with Cursor
|
There was a problem hiding this comment.
Pull request overview
This PR changes the heap memory profiler export model from cumulative heap snapshots to incremental delta exports, including negative tombstones for freed allocations and periodic full-state resyncs.
Changes:
- Adds pending heap change tracking with ADD/REMOVE events in the native heap tracker.
- Adds negative heap sample export support to the profiling sample wrapper.
- Updates and expands memalloc tests for delta semantics, tombstones, steady state, churn, and resync behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ddtrace/profiling/collector/_memalloc_heap.cpp |
Implements heap delta tracking, tombstone emission, and periodic resync logic. |
ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp |
Adds negative heap sample export behavior. |
ddtrace/internal/datadog/profiling/dd_wrapper/include/sample.hpp |
Declares the new negative sample export API. |
tests/profiling/collector/test_memalloc.py |
Updates existing heap assertions and adds delta export regression tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ab8114437
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| flush_pending_to_pool_no_cpython(); | ||
| export_full_snapshot_no_cpython(); | ||
| snapshots_since_resync = 0; |
There was a problem hiding this comment.
Emit explicit reset marker before full resync snapshot
The resync branch flushes pending deltas and then emits allocs_m via plain export_sample() calls, but this payload is indistinguishable from normal positive ADD deltas. In a delta-integrating backend, that means the full live set is added on top of already-integrated state, so every resync interval can inflate heap totals instead of correcting drift. If resync is intended to recover from dropped uploads, this path needs an explicit reset/generation signal (or equivalent protocol change) before sending the full snapshot.
Useful? React with 👍 / 👎.
| pending_changes.clear(); | ||
| ++snapshots_since_resync; |
There was a problem hiding this comment.
Preserve pending deltas when sample export fails
In delta mode, export_sample() / export_sample_negative() return a success flag, but failures are ignored and pending_changes is always cleared afterward. If ddog_prof_Profile_add2 rejects any sample in this snapshot, those ADD/REMOVE deltas are dropped permanently (until a later resync), causing silent state drift and missing frees/allocations in the exported heap stream. The delta buffer should only be cleared after successful emission (or failed events should be retried).
Useful? React with 👍 / 👎.
176731d to
5a701ef
Compare
5a701ef to
e0c9c76
Compare
Each export now emits only what changed since the previous snapshot (positive samples for new allocations, negative-value tombstones for freed ones), with a periodic full-snapshot resync every 10 snapshots to bound backend drift if a delta upload is dropped. Adds Sample::export_sample_negative to the dd_wrapper for emitting tombstones; libdatadog's ddog_prof_Profile_add2 accepts negative i64 values end-to-end. Live tracebacks under untrack are moved into a REMOVE event so they stay alive until the next export, then returned to the pool. ADD events hold raw pointers into allocs_m; lifetime is guaranteed by either allocs_m or a subsequent REMOVE in the same pending buffer.
e0c9c76 to
6ba017f
Compare
|
@codex please review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ba017f576
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (evt.kind == change_event::REMOVE) { | ||
| pool_put_no_cpython(std::move(evt.owner)); | ||
| } |
There was a problem hiding this comment.
Keep REMOVE-owned traceback alive while ADD retries remain
When an ADD event fails export and is retained for retry, it only keeps a raw tb pointer. If the matching REMOVE event later succeeds in the same pass, this branch immediately returns evt.owner to the pool, which clears/reuses the same traceback_t. On the next snapshot, the retained ADD retry dereferences a stale/recycled pointer, causing corrupted samples or a use-after-free style crash. This occurs whenever Profile_add2 rejects the ADD but accepts the later REMOVE for the same allocation sequence.
Useful? React with 👍 / 👎.
…a export For allocations that are tracked AND freed within the same snapshot interval, the previous delta path emitted both an ADD (positive) and a REMOVE (negative tombstone) — libdatadog aggregated them to a net-zero bucket but still paid two Profile_add2 calls per pair. On alloc-then-free hot loops (per the rapid_python_http_smoke_test handle_alloc_pressure workload) this dominated CPU and surfaced as a ~7% CPU regression with +111% Locked Time vs the full-snapshot baseline. Side map void* -> pending ADD index makes the untrack hot path O(1) for finding the matching ADD; when found, the ADD is flagged `collapsed` and skipped at export time, no REMOVE event is queued, and the traceback is returned to the pool immediately. Microbench Scenario D (balanced churn, 500 pairs / snapshot) drops from ~13.5 ms to ~39 us — ~340x faster than the no-collapse delta path and within ~4x of the full-snapshot baseline. Side map is cleared at the end of each export; retained-on-failure ADDs lose their collapse capability on the next snapshot (acceptable since retain is rare and the existing aggregation-to-zero fallback still works correctly without collapse). Cleared on fork alongside the rest of the delta state. Correctness is covered by the existing delta tests (churn-net-heap-is-zero, heavy-churn-stays-consistent, large-heap-overhead); they pass on this build because the wire-format contract is unchanged — collapse just skips emission of pairs the backend would aggregate to zero anyway.
Previously, a same-snapshot ADD/REMOVE collapse marked the ADD as 'collapsed' in place and left it in pending_changes; the export loop iterated and skipped these entries. Under sustained alloc-then-free churn that buffer fills with skipped entries — at 500 churn pairs per snapshot, ~12.5K wasted branch checks on each export pass. Swap-and-pop removes the canceled ADD from pending_changes at collapse time: the last element moves into the freed slot, and if it's another indexed ADD, its pending_add_idx entry is rewritten to the new position. pending_changes stays compact across churn and the collapsed flag on change_event is no longer needed. Microbench Scenario D (balanced churn, 500 pairs per snapshot): before: 39 us median after: 13 us median At 50 pairs the cost is now the same as 500 pairs (~14 us median), confirming snapshot work is decoupled from churn volume.
…dary When export retains events for retry (libdatadog Profile_add2 rejection), the previous code cleared pending_add_idx entirely, which meant retained ADDs lost their ability to collapse with a later REMOVE — the next snapshot would queue a separate tombstone for that ptr instead of canceling the still-pending ADD, paying an extra Profile_add2 call. Rebuild pending_add_idx after the std::move(retained) so ADD events that survive the retry boundary stay indexed by ptr. Cost is O(retained.size()) per export, which is bounded: retained events only appear on libdatadog rejection and are capped at MAX_EXPORT_RETRIES per event before being dropped.
Description
Switches the heap tracker's snapshot export from emitting the full live set every time to emitting only the deltas since the last snapshot: positive samples for newly-tracked allocations (ADD) and negative-value tombstones for freed ones (REMOVE). The backend (libdatadog → agent → UI) integrates deltas to compute the running live-heap view.
Why this matters:
allocs_mon every snapshot — work that's mostly redundant when live state is stable. Delta path is empty in steady state.n(seescripts/heap_export_microbench.py).Mechanics:
untrack_no_cpythonmoves the freed traceback'sunique_ptrinto a REMOVE event inpending_changes(kept alive until the next export emits the tombstone, then returned to the pool).add_sample_no_cpythonqueues an ADD event with a raw pointer to the live traceback inallocs_m.export_heap_no_cpythondrainspending_changes; for REMOVEs it negatesheap_spacelazily at emit time (guarded bytombstone_appliedso retries don't re-flip the sign) and emits.Sample::negate_heap_space()API splits the negate step from the emit step so a failed emit can be retried without toggling sign on every attempt.Testing
Six new unit tests in
tests/profiling/collector/test_memalloc.py:test_delta_export_emits_negative_tombstones_after_free— second snapshot after a free emits negative-value tombstones for the allocator stack.test_delta_export_steady_state_skips_unchanged_samples— unchanged live set => second snapshot has zero positive allocator samples.test_delta_export_churn_net_heap_is_zero— balanced alloc/free churn nets to zero (positives + negatives cancel via pprof aggregation).test_delta_export_no_resync_for_steady_state— 15 unchanging snapshots; none re-emit live state (no sneak resync; see 'Risks').test_delta_export_net_heap_is_zero_across_snapshots— drain workload across multiple snapshots converges to net zero, exercising the retain-on-failure retry path.test_delta_export_heavy_churn_stays_consistent— sustained churn that growspending_changespast its initial reserve doesn't crash, hang, or leak.Full memalloc suite: 31 passed, 5 skipped, no failures. Existing tests
test_memory_collector_python_interface_with_allocation_trackingand..._no_deletionupdated to reflect that snapshots now carry deltas (not cumulative live state).Risks
i64end-to-end (libdd-profiling-protobuf/src/sample.rs:29 — expliciti64, no sign validation). The agent and UI need verification before this can ship.allocs_mandpending_changes). Profiler restarts are common in production Python (process recycling, fork), so drift is bounded in practice — but it is not bounded by snapshot cadence. Adding a true resync requires either a backend-side reset marker or emitting compensating negative+positive pairs for every live entry; both are out of scope here.Profile_add2returns false), the event is retained inpending_changesfor retry on the next snapshot, up toMAX_EXPORT_RETRIES = 2. After that the event is dropped to bound buffer growth on persistent rejection. ADD and REMOVE drops still pair up (the partner event for the same ptr lives in the same buffer and follows the same retry cap), so the backend's running state stays consistent even under drops.Additional Notes
scripts/heap_export_microbench.pyfor measuring delta vs baseline snapshot CPU on this branch.