Skip to content

feat: MergeboxCollector — attribute Meteor mergebox RAM residency#19

Merged
mvogttech merged 4 commits into
mainfrom
feat/mergebox-collector
Jun 19, 2026
Merged

feat: MergeboxCollector — attribute Meteor mergebox RAM residency#19
mvogttech merged 4 commits into
mainfrom
feat/mergebox-collector

Conversation

@mvogttech

Copy link
Copy Markdown
Contributor

Why

The mergebox — Meteor's per-session, server-side cache of every published document — is the single largest and least-visible consumer of server RAM in pub/sub-heavy apps, and the usual cause of OOMs. Nothing measured it before. This collector attributes mergebox RAM residency to the publications/collections holding it, so SkySignal can recommend switching hot publications to NO_MERGE.

The server side (ingest endpoint, MergeboxService, System → Mergebox UI) already exists in SkySignalAPM/core #55; this PR ships the agent half that produces the data.

Where the bytes come from (verified against ddp-server 3.x)

Meteor.server.sessionssession.collectionViewsSessionCollectionView.documentsSessionDocumentView.dataByKey (the resident field values, precedenceList[0].value) + existsIn (the subs referencing each doc). Subscription._documents is refcount-only (doc-id sets, no bytes) and is not used.

Attribution — pure even-split (sum-preserving)

Each resident doc's bytes/docCount are divided across its existsIn handles and bucketed by (publicationName, collectionName). Because the split divides by existsIn.size, the rows for a collection sum back to its true residency, so the server's existing $sum aggregations reconstruct exact per-collection / per-strategy / site totals with no special "truth" row (a truth row would have double-counted). Per-publication bytes are therefore an attribution of shared docs; per-collection is exact.

Strategy awareness

Read via getPublicationStrategy() and reverse-mapped by identity against DDPServer.publicationStrategiesSERVER_MERGE / NO_MERGE / NO_MERGE_NO_HISTORY (NO_MERGE_MULTI and anything unrecognized → unknown, with a structural fallback). NO_MERGE / NO_MERGE_NO_HISTORY keep no collectionView at all, so they correctly read ~0 residency — that absence is the optimization signal, never synthesized.

Safety

  • Read-only snapshot — never wraps session.send/processMessage (avoids the bug v1.0.16 still showing "DDPQueueCollector: Error in wrapped unblock: RangeError: Maximum call stack size exceeded" error #7 double-wrap stack overflow).
  • Default OFF (collectMergebox), low 60s cadence with staggered start.
  • Per-session sampling (mergeboxSampleRate, stamped on each row for server-side extrapolation), maxSessions / maxDocsPerSession caps, and a top-N row cap aligned to the 500-rows/POST limit.
  • Per-session try/catch and feature-detection that degrades to zero rows on any Meteor shape mismatch — these are undocumented internal structures, so every access is guarded.
  • connectionCount is a count of distinct sessions, never a list of connection ids.

Changes

  • New lib/collectors/MergeboxCollector.js + tests/unit/collectors/MergeboxCollector.test.js.
  • SkySignalClient.addMergeboxMetric + /api/v1/metrics/mergebox batch wiring.
  • collectMergebox / mergeboxInterval / mergeboxSampleRate / cap options in config.js + env.js, gated startup in skysignal-agent.js.
  • AGENT_VERSION + package.js1.1.0.

Tests

Unit tests cover the load-bearing invariants: a SERVER_MERGE doc shared by two subs even-splits and the rows sum back to full bytes; auto-publish (U) handles omit publicationName; NO_MERGE collections emit no row; connectionCount counts distinct sessions; sampleRate<1 stamps the rate; feature-detection skips malformed sessions; strategy reverse-map + DummyDocumentView (docCount>0/bytes~0). Not run here — npm test (mocha) executes them.

Follow-up (core, separate release): bump LATEST_AGENT_VERSION to 1.1.0 in VersionService.js once this is published.

Adds a new collector that measures Meteor's mergebox (the per-session,
server-side cache of published documents) and reports per-(publication,
collection) RAM residency to POST /api/v1/metrics/mergebox. The mergebox is the
single largest and least-visible consumer of server RAM in pub/sub-heavy apps,
and nothing measured it before.

Where the bytes come from (ddp-server internals, verified against 3.x):
Meteor.server.sessions -> session.collectionViews -> SessionCollectionView.documents
-> SessionDocumentView.dataByKey (the resident field values) + existsIn (the subs
referencing each doc). Subscription._documents is refcount-only and is NOT used.

Attribution is pure even-split: each resident doc's bytes/docCount are divided
across its existsIn handles and bucketed by (publicationName, collectionName).
Because the split divides by existsIn.size, the rows for a collection sum back to
its true residency (sum-preserving) — so the server's existing $sum aggregations
reconstruct exact per-collection/strategy/site totals with no special "truth"
row. Per-publication bytes are therefore an attribution (shared docs), per
collection is exact.

publicationStrategy is read via getPublicationStrategy() and reverse-mapped by
identity (SERVER_MERGE / NO_MERGE / NO_MERGE_NO_HISTORY; NO_MERGE_MULTI and
anything else -> unknown). NO_MERGE / NO_MERGE_NO_HISTORY keep no collectionView,
so they correctly read ~0 residency — that absence is the optimization signal.

Safety: read-only snapshot (never wraps session.send/processMessage, avoiding the
bug #7 double-wrap), default OFF (collectMergebox), low 60s cadence with staggered
start, per-session sampling (mergeboxSampleRate, extrapolated server-side),
maxSessions / maxDocsPerSession caps, top-N row cap aligned to the 500/POST limit,
per-session try/catch, and feature-detection that degrades to zero rows on any
Meteor shape mismatch.

Wires collectMergebox / mergeboxInterval / mergeboxSampleRate / cap options
through config.js + env.js, adds SkySignalClient.addMergeboxMetric +
/api/v1/metrics/mergebox batching, gates startup in skysignal-agent.js, and bumps
AGENT_VERSION + package version to 1.1.0. The core ingest endpoint, service, and
System > Mergebox UI already exist (SkySignalAPM/core PR #55).
The collector was mapping NO_MERGE_MULTI to "unknown", which conflated a known
Meteor strategy with "couldn't read the strategy". Resolve all four publication
strategies: add the NO_MERGE_MULTI identity match and the structural shape
(useCollectionView + doAccountingForCollection + useDummyDocumentView). "unknown"
is now reserved for a genuinely unreadable/unrecognized strategy. Tests updated.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an agent-side MergeboxCollector to measure and attribute Meteor mergebox RAM residency (per-session server-side published-doc cache) and ship rollups to the SkySignal ingest endpoint, enabling server/UI recommendations like switching hot pubs to NO_MERGE.

Changes:

  • Introduces MergeboxCollector with per-session sampling, caps, feature-detection, and strategy reverse-mapping.
  • Wires a new mergebox metrics batch type and endpoint (/api/v1/metrics/mergebox) into SkySignalClient.
  • Adds config/env support, startup gating, unit tests, and bumps agent version to 1.1.0.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lib/collectors/MergeboxCollector.js New collector that snapshots mergebox residency and emits per-(publication, collection) rollups.
tests/unit/collectors/MergeboxCollector.test.js Unit tests covering attribution, strategy mapping, sampling, caps, and row shape.
lib/SkySignalClient.js Adds mergebox batching + addMergeboxMetric() helper.
tests/unit/client/SkySignalClient.test.js Extends client tests for new batch type and endpoint.
lib/config.js Adds mergebox config defaults + validation.
lib/env.js Adds mergebox env var mappings.
skysignal-agent.js Gates MergeboxCollector startup behind collectMergebox and passes config through.
lib/collectors/SystemMetricsCollector.js Updates AGENT_VERSION to 1.1.0.
package.js Bumps Meteor package version to 1.1.0.
CHANGELOG.md Documents the 1.1.0 release and mergebox collector feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/collectors/MergeboxCollector.js Outdated
Comment on lines +251 to +266
const bytesShare = docBytes / n;
const docShare = 1 / n;
const fieldShare = fieldCount / n;

for (const handle of handles) {
const publicationName = this._resolvePublicationName(session, handle);
this._addToBucket(buckets, {
publicationName,
collectionName,
strategy,
bytesShare,
docShare,
fieldShare,
session
});
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 28987d6. Byte attribution now uses an integer largest-remainder split per doc (the first docBytes % n handles get +1 byte), accumulated as integers with no per-bucket rounding — so per-handle shares sum back to docBytes exactly and per-collection residency is exactly sum-preserving for any byte total, not just divisible ones. Added a 3-way non-divisible-bytes test asserting the rows sum to fullBytes and differ by at most 1.

Comment on lines +325 to +338
const row = {
host: this.host,
appVersion: this.appVersion,
timestamp,
windowStart: windowStartDate,
windowEnd: windowEndDate,
collectionName: b.collectionName,
strategy: b.strategy,
bytesHeld,
docCount: Math.round(b.docCount),
fieldCount: Math.round(b.fieldCount),
connectionCount,
sampleRate: this.sampleRate
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 28987d6. Rows now include buildHash when the agent resolved one (omitted otherwise, like publicationName). Added the optional buildHash field to the core mergebox_metrics schema (SkySignalAPM/core ffeb8d5) so it is first-class alongside appVersion, and added present/omitted tests.

Comment thread lib/SkySignalClient.js Outdated
Comment thread CHANGELOG.md Outdated

### v1.1.0 (Mergebox RAM Residency Collector)

- **New `MergeboxCollector`** - Measures Meteor's MERGEBOX RAM residency (the per-session, server-side cache of published documents) and posts per-(publication, collection) rollups to `POST /api/v1/metrics/mergebox`. The collector walks `Meteor.server.sessions` read-only, estimates the resident bytes each session's mergebox holds per published collection (sizing each `SessionDocumentView.dataByKey` field value directly), reads the publication strategy via `Meteor.server.getPublicationStrategy()` (reverse-mapped to `SERVER_MERGE` / `NO_MERGE` / `NO_MERGE_NO_HISTORY` / `unknown`), and attributes residency to subscriptions via a pure even-split across `existsIn`. The even-split is sum-preserving: the rows for a collection sum back to that collection's true residency. `connectionCount` is a count of distinct DDP sessions (never a list of connection ids).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 84f8eb1 (pushed before this review round was applied) — the v1.1.0 entry now reads "reverse-mapped to all four Meteor strategies — SERVER_MERGE / NO_MERGE / NO_MERGE_NO_HISTORY / NO_MERGE_MULTI; unknown only when the strategy genuinely can't be read".

Comment thread lib/collectors/MergeboxCollector.js Outdated
Comment on lines +209 to +214
documents.forEach((docView) => {
if (docsWalked >= this.maxDocsPerSession) {
// Bound the tick; remaining docs in this session are skipped.
return;
}
docsWalked++;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 28987d6. The walk is now for...of over collectionViews and documents.values() with break (plus a top-of-collection-loop check), so maxDocsPerSession actually stops iteration instead of doing a cheap check on every remaining entry. Added a test with a 5-doc session and cap=2 asserting only 2 docs are walked.

The README config reference omitted the new collector. Add collectMergebox to
the environment-variable and feature-flag tables (flagged opt-in / default
false), and a dedicated "Mergebox Residency (opt-in)" section with enable
snippets and a tuning table for mergeboxInterval / mergeboxSampleRate /
mergeboxMaxSessions / mergeboxMaxDocsPerSession (and their SKYSIGNAL_* env vars).
Also corrects the v1.1.0 CHANGELOG strategy list to include NO_MERGE_MULTI.
- Byte attribution now uses an integer largest-remainder split per doc (the
  first docBytes % n handles get +1 byte) instead of float docBytes/n with
  per-bucket rounding. Per-handle shares sum back to docBytes exactly, so
  per-collection residency is now EXACTLY sum-preserving for any byte total
  (previously off by rounding when not divisible).
- maxDocsPerSession is enforced with for...of + break over collectionViews /
  documents.values() instead of Map#forEach (where return doesn't stop
  iteration), so the cap actually bounds the walk on large sessions.
- Emit buildHash on rows when configured (omitted otherwise), mirroring other
  collectors so residency correlates to a deployed build; documented in the
  addMergeboxMetric JSDoc (which also now lists NO_MERGE_MULTI).

Tests: exact 3-way non-divisible split, buildHash present/omitted, and the
maxDocsPerSession cap stopping the walk.
@mvogttech mvogttech merged commit dd0b077 into main Jun 19, 2026
2 checks passed
@mvogttech mvogttech deleted the feat/mergebox-collector branch June 19, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants