Skip to content

feat(profiling): incremental delta export for heap tracker#18125

Draft
vlad-scherbich wants to merge 4 commits into
mainfrom
vlad/memalloc-incremental-export
Draft

feat(profiling): incremental delta export for heap tracker#18125
vlad-scherbich wants to merge 4 commits into
mainfrom
vlad/memalloc-incremental-export

Conversation

@vlad-scherbich
Copy link
Copy Markdown
Contributor

@vlad-scherbich vlad-scherbich commented May 17, 2026

Description

Switches the heap tracker's snapshot export from emitting the full live set every time to emitting only the deltas since the last snapshot: positive samples for newly-tracked allocations (ADD) and negative-value tombstones for freed ones (REMOVE). The backend (libdatadog → agent → UI) integrates deltas to compute the running live-heap view.

Why this matters:

  • At sample_size = default, today's full export iterates everything in allocs_m on every snapshot — work that's mostly redundant when live state is stable. Delta path is empty in steady state.
  • Microbench shows ~5× CPU reduction in representative steady-state workloads and ~500× p90 snapshot-latency reduction at large n (see scripts/heap_export_microbench.py).

Mechanics:

  • untrack_no_cpython moves the freed traceback's unique_ptr into a REMOVE event in pending_changes (kept alive until the next export emits the tombstone, then returned to the pool).
  • add_sample_no_cpython queues an ADD event with a raw pointer to the live traceback in allocs_m.
  • export_heap_no_cpython drains pending_changes; for REMOVEs it negates heap_space lazily at emit time (guarded by tombstone_applied so retries don't re-flip the sign) and emits.
  • New Sample::negate_heap_space() API splits the negate step from the emit step so a failed emit can be retried without toggling sign on every attempt.

Testing

Six new unit tests in tests/profiling/collector/test_memalloc.py:

  • test_delta_export_emits_negative_tombstones_after_free — second snapshot after a free emits negative-value tombstones for the allocator stack.
  • test_delta_export_steady_state_skips_unchanged_samples — unchanged live set => second snapshot has zero positive allocator samples.
  • test_delta_export_churn_net_heap_is_zero — balanced alloc/free churn nets to zero (positives + negatives cancel via pprof aggregation).
  • test_delta_export_no_resync_for_steady_state — 15 unchanging snapshots; none re-emit live state (no sneak resync; see 'Risks').
  • test_delta_export_net_heap_is_zero_across_snapshots — drain workload across multiple snapshots converges to net zero, exercising the retain-on-failure retry path.
  • test_delta_export_heavy_churn_stays_consistent — sustained churn that grows pending_changes past its initial reserve doesn't crash, hang, or leak.

Full memalloc suite: 31 passed, 5 skipped, no failures. Existing tests test_memory_collector_python_interface_with_allocation_tracking and ..._no_deletion updated to reflect that snapshots now carry deltas (not cumulative live state).

Risks

  1. Wire-format contract change (highest risk). The backend now sees deltas instead of full snapshots. Agent intake, pprof aggregation, and the heap-profile UI must accept negative-value heap-space samples and integrate across uploads. libdatadog already accepts negative i64 end-to-end (libdd-profiling-protobuf/src/sample.rs:29 — explicit i64, no sign validation). The agent and UI need verification before this can ship.
  2. No periodic resync. v1 does not emit a periodic full-snapshot — earlier drafts did, but emitting plain positives in a delta-integrating backend would double-count live state without a reset marker. Implication: if any delta upload is lost (network failure, agent crash mid-upload), the backend's running state diverges from live until the profiler restarts (which clears allocs_m and pending_changes). Profiler restarts are common in production Python (process recycling, fork), so drift is bounded in practice — but it is not bounded by snapshot cadence. Adding a true resync requires either a backend-side reset marker or emitting compensating negative+positive pairs for every live entry; both are out of scope here.
  3. Retain-on-failure with bounded retry. If libdatadog rejects a sample (Profile_add2 returns false), the event is retained in pending_changes for retry on the next snapshot, up to MAX_EXPORT_RETRIES = 2. After that the event is dropped to bound buffer growth on persistent rejection. ADD and REMOVE drops still pair up (the partner event for the same ptr lives in the same buffer and follows the same retry cap), so the backend's running state stays consistent even under drops.
  4. Cuckoo fast-reject filter not included. This PR is independent of the cuckoo-filter free-path optimization (PR perf(profiling): incremental data export in memalloc #18116). They compose cleanly and can land in either order.

Additional Notes

  • Microbench at scripts/heap_export_microbench.py for measuring delta vs baseline snapshot CPU on this branch.
  • Addresses Copilot review feedback (vector realloc safety in hook path) and chatgpt-codex-connector review feedback (preserve pending deltas on export failure, no silent drop on rejection).

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 17, 2026

Codeowners resolved as

ddtrace/profiling/collector/_memalloc_heap.cpp                          @DataDog/profiling-python

@vlad-scherbich vlad-scherbich force-pushed the vlad/memalloc-support-mem-domain branch from 89c6829 to 1ee4530 Compare May 18, 2026 02:07
@vlad-scherbich vlad-scherbich changed the base branch from vlad/memalloc-support-mem-domain to main May 18, 2026 13:32
@vlad-scherbich vlad-scherbich changed the base branch from main to vlad/memalloc-support-mem-domain May 18, 2026 13:32
@vlad-scherbich vlad-scherbich force-pushed the vlad/memalloc-incremental-export branch from 2e4a3b9 to 2ab8114 Compare May 18, 2026 13:53
@vlad-scherbich
Copy link
Copy Markdown
Contributor Author

@codex please reivew

@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented May 18, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

❄️ 2 New flaky tests detected

test_delta_export_churn_net_heap_is_zero[py3.10] from test_memalloc.py   View in Datadog (Fix with Cursor)
net heap-space for allocator stack should be 0 after balanced churn, got 640 (delta path may have skipped REMOVE tombstones)
assert 640 == 0

New test introduced in this PR is flaky.

test_delta_export_churn_net_heap_is_zero[py3.9] from test_memalloc.py   View in Datadog (Fix with Cursor)
net heap-space for allocator stack should be 0 after balanced churn, got 648 (delta path may have skipped REMOVE tombstones)
assert 648 == 0

New test introduced in this PR is flaky.

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: deef68f | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the heap memory profiler export model from cumulative heap snapshots to incremental delta exports, including negative tombstones for freed allocations and periodic full-state resyncs.

Changes:

  • Adds pending heap change tracking with ADD/REMOVE events in the native heap tracker.
  • Adds negative heap sample export support to the profiling sample wrapper.
  • Updates and expands memalloc tests for delta semantics, tombstones, steady state, churn, and resync behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ddtrace/profiling/collector/_memalloc_heap.cpp Implements heap delta tracking, tombstone emission, and periodic resync logic.
ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp Adds negative heap sample export behavior.
ddtrace/internal/datadog/profiling/dd_wrapper/include/sample.hpp Declares the new negative sample export API.
tests/profiling/collector/test_memalloc.py Updates existing heap assertions and adds delta export regression tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddtrace/profiling/collector/_memalloc_heap.cpp Outdated
Comment thread ddtrace/profiling/collector/_memalloc_heap.cpp Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ab8114437

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +350 to +352
flush_pending_to_pool_no_cpython();
export_full_snapshot_no_cpython();
snapshots_since_resync = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit explicit reset marker before full resync snapshot

The resync branch flushes pending deltas and then emits allocs_m via plain export_sample() calls, but this payload is indistinguishable from normal positive ADD deltas. In a delta-integrating backend, that means the full live set is added on top of already-integrated state, so every resync interval can inflate heap totals instead of correcting drift. If resync is intended to recover from dropped uploads, this path needs an explicit reset/generation signal (or equivalent protocol change) before sending the full snapshot.

Useful? React with 👍 / 👎.

Comment on lines +365 to +366
pending_changes.clear();
++snapshots_since_resync;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve pending deltas when sample export fails

In delta mode, export_sample() / export_sample_negative() return a success flag, but failures are ignored and pending_changes is always cleared afterward. If ddog_prof_Profile_add2 rejects any sample in this snapshot, those ADD/REMOVE deltas are dropped permanently (until a later resync), causing silent state drift and missing frees/allocations in the exported heap stream. The delta buffer should only be cleared after successful emission (or failed events should be retried).

Useful? React with 👍 / 👎.

Base automatically changed from vlad/memalloc-support-mem-domain to main May 18, 2026 16:33
@vlad-scherbich vlad-scherbich added changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling labels May 19, 2026
@vlad-scherbich vlad-scherbich force-pushed the vlad/memalloc-incremental-export branch 2 times, most recently from 176731d to 5a701ef Compare May 19, 2026 18:45
@vlad-scherbich vlad-scherbich requested a review from Copilot May 19, 2026 19:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread ddtrace/profiling/collector/_memalloc_heap.cpp
@vlad-scherbich vlad-scherbich force-pushed the vlad/memalloc-incremental-export branch from 5a701ef to e0c9c76 Compare May 19, 2026 20:02
Each export now emits only what changed since the previous snapshot
(positive samples for new allocations, negative-value tombstones for
freed ones), with a periodic full-snapshot resync every 10 snapshots
to bound backend drift if a delta upload is dropped.

Adds Sample::export_sample_negative to the dd_wrapper for emitting
tombstones; libdatadog's ddog_prof_Profile_add2 accepts negative i64
values end-to-end.

Live tracebacks under untrack are moved into a REMOVE event so they
stay alive until the next export, then returned to the pool. ADD
events hold raw pointers into allocs_m; lifetime is guaranteed by
either allocs_m or a subsequent REMOVE in the same pending buffer.
@vlad-scherbich vlad-scherbich force-pushed the vlad/memalloc-incremental-export branch from e0c9c76 to 6ba017f Compare May 19, 2026 20:03
@vlad-scherbich vlad-scherbich requested a review from Copilot May 19, 2026 20:19
@vlad-scherbich
Copy link
Copy Markdown
Contributor Author

@codex please review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ba017f576

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +371 to +373
if (evt.kind == change_event::REMOVE) {
pool_put_no_cpython(std::move(evt.owner));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep REMOVE-owned traceback alive while ADD retries remain

When an ADD event fails export and is retained for retry, it only keeps a raw tb pointer. If the matching REMOVE event later succeeds in the same pass, this branch immediately returns evt.owner to the pool, which clears/reuses the same traceback_t. On the next snapshot, the retained ADD retry dereferences a stale/recycled pointer, causing corrupted samples or a use-after-free style crash. This occurs whenever Profile_add2 rejects the ADD but accepts the later REMOVE for the same allocation sequence.

Useful? React with 👍 / 👎.

…a export

For allocations that are tracked AND freed within the same snapshot
interval, the previous delta path emitted both an ADD (positive) and a
REMOVE (negative tombstone) — libdatadog aggregated them to a net-zero
bucket but still paid two Profile_add2 calls per pair. On alloc-then-free
hot loops (per the rapid_python_http_smoke_test handle_alloc_pressure
workload) this dominated CPU and surfaced as a ~7% CPU regression with
+111% Locked Time vs the full-snapshot baseline.

Side map void* -> pending ADD index makes the untrack hot path O(1) for
finding the matching ADD; when found, the ADD is flagged `collapsed` and
skipped at export time, no REMOVE event is queued, and the traceback is
returned to the pool immediately. Microbench Scenario D (balanced churn,
500 pairs / snapshot) drops from ~13.5 ms to ~39 us — ~340x faster than
the no-collapse delta path and within ~4x of the full-snapshot baseline.

Side map is cleared at the end of each export; retained-on-failure ADDs
lose their collapse capability on the next snapshot (acceptable since
retain is rare and the existing aggregation-to-zero fallback still works
correctly without collapse). Cleared on fork alongside the rest of the
delta state.

Correctness is covered by the existing delta tests (churn-net-heap-is-zero,
heavy-churn-stays-consistent, large-heap-overhead); they pass on this
build because the wire-format contract is unchanged — collapse just skips
emission of pairs the backend would aggregate to zero anyway.
Previously, a same-snapshot ADD/REMOVE collapse marked the ADD as
'collapsed' in place and left it in pending_changes; the export loop
iterated and skipped these entries. Under sustained alloc-then-free
churn that buffer fills with skipped entries — at 500 churn pairs per
snapshot, ~12.5K wasted branch checks on each export pass.

Swap-and-pop removes the canceled ADD from pending_changes at collapse
time: the last element moves into the freed slot, and if it's another
indexed ADD, its pending_add_idx entry is rewritten to the new position.
pending_changes stays compact across churn and the collapsed flag on
change_event is no longer needed.

Microbench Scenario D (balanced churn, 500 pairs per snapshot):
  before: 39 us median
  after:  13 us median
At 50 pairs the cost is now the same as 500 pairs (~14 us median),
confirming snapshot work is decoupled from churn volume.
…dary

When export retains events for retry (libdatadog Profile_add2 rejection),
the previous code cleared pending_add_idx entirely, which meant retained
ADDs lost their ability to collapse with a later REMOVE — the next
snapshot would queue a separate tombstone for that ptr instead of
canceling the still-pending ADD, paying an extra Profile_add2 call.

Rebuild pending_add_idx after the std::move(retained) so ADD events that
survive the retry boundary stay indexed by ptr. Cost is O(retained.size())
per export, which is bounded: retained events only appear on libdatadog
rejection and are capped at MAX_EXPORT_RETRIES per event before being
dropped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants