feat: MergeboxCollector — attribute Meteor mergebox RAM residency#19
Conversation
Adds a new collector that measures Meteor's mergebox (the per-session, server-side cache of published documents) and reports per-(publication, collection) RAM residency to POST /api/v1/metrics/mergebox. The mergebox is the single largest and least-visible consumer of server RAM in pub/sub-heavy apps, and nothing measured it before. Where the bytes come from (ddp-server internals, verified against 3.x): Meteor.server.sessions -> session.collectionViews -> SessionCollectionView.documents -> SessionDocumentView.dataByKey (the resident field values) + existsIn (the subs referencing each doc). Subscription._documents is refcount-only and is NOT used. Attribution is pure even-split: each resident doc's bytes/docCount are divided across its existsIn handles and bucketed by (publicationName, collectionName). Because the split divides by existsIn.size, the rows for a collection sum back to its true residency (sum-preserving) — so the server's existing $sum aggregations reconstruct exact per-collection/strategy/site totals with no special "truth" row. Per-publication bytes are therefore an attribution (shared docs), per collection is exact. publicationStrategy is read via getPublicationStrategy() and reverse-mapped by identity (SERVER_MERGE / NO_MERGE / NO_MERGE_NO_HISTORY; NO_MERGE_MULTI and anything else -> unknown). NO_MERGE / NO_MERGE_NO_HISTORY keep no collectionView, so they correctly read ~0 residency — that absence is the optimization signal. Safety: read-only snapshot (never wraps session.send/processMessage, avoiding the bug #7 double-wrap), default OFF (collectMergebox), low 60s cadence with staggered start, per-session sampling (mergeboxSampleRate, extrapolated server-side), maxSessions / maxDocsPerSession caps, top-N row cap aligned to the 500/POST limit, per-session try/catch, and feature-detection that degrades to zero rows on any Meteor shape mismatch. Wires collectMergebox / mergeboxInterval / mergeboxSampleRate / cap options through config.js + env.js, adds SkySignalClient.addMergeboxMetric + /api/v1/metrics/mergebox batching, gates startup in skysignal-agent.js, and bumps AGENT_VERSION + package version to 1.1.0. The core ingest endpoint, service, and System > Mergebox UI already exist (SkySignalAPM/core PR #55).
The collector was mapping NO_MERGE_MULTI to "unknown", which conflated a known Meteor strategy with "couldn't read the strategy". Resolve all four publication strategies: add the NO_MERGE_MULTI identity match and the structural shape (useCollectionView + doAccountingForCollection + useDummyDocumentView). "unknown" is now reserved for a genuinely unreadable/unrecognized strategy. Tests updated.
There was a problem hiding this comment.
Pull request overview
Adds an agent-side MergeboxCollector to measure and attribute Meteor mergebox RAM residency (per-session server-side published-doc cache) and ship rollups to the SkySignal ingest endpoint, enabling server/UI recommendations like switching hot pubs to NO_MERGE.
Changes:
- Introduces
MergeboxCollectorwith per-session sampling, caps, feature-detection, and strategy reverse-mapping. - Wires a new
mergeboxmetrics batch type and endpoint (/api/v1/metrics/mergebox) intoSkySignalClient. - Adds config/env support, startup gating, unit tests, and bumps agent version to 1.1.0.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/collectors/MergeboxCollector.js | New collector that snapshots mergebox residency and emits per-(publication, collection) rollups. |
| tests/unit/collectors/MergeboxCollector.test.js | Unit tests covering attribution, strategy mapping, sampling, caps, and row shape. |
| lib/SkySignalClient.js | Adds mergebox batching + addMergeboxMetric() helper. |
| tests/unit/client/SkySignalClient.test.js | Extends client tests for new batch type and endpoint. |
| lib/config.js | Adds mergebox config defaults + validation. |
| lib/env.js | Adds mergebox env var mappings. |
| skysignal-agent.js | Gates MergeboxCollector startup behind collectMergebox and passes config through. |
| lib/collectors/SystemMetricsCollector.js | Updates AGENT_VERSION to 1.1.0. |
| package.js | Bumps Meteor package version to 1.1.0. |
| CHANGELOG.md | Documents the 1.1.0 release and mergebox collector feature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const bytesShare = docBytes / n; | ||
| const docShare = 1 / n; | ||
| const fieldShare = fieldCount / n; | ||
|
|
||
| for (const handle of handles) { | ||
| const publicationName = this._resolvePublicationName(session, handle); | ||
| this._addToBucket(buckets, { | ||
| publicationName, | ||
| collectionName, | ||
| strategy, | ||
| bytesShare, | ||
| docShare, | ||
| fieldShare, | ||
| session | ||
| }); | ||
| } |
There was a problem hiding this comment.
Fixed in 28987d6. Byte attribution now uses an integer largest-remainder split per doc (the first docBytes % n handles get +1 byte), accumulated as integers with no per-bucket rounding — so per-handle shares sum back to docBytes exactly and per-collection residency is exactly sum-preserving for any byte total, not just divisible ones. Added a 3-way non-divisible-bytes test asserting the rows sum to fullBytes and differ by at most 1.
| const row = { | ||
| host: this.host, | ||
| appVersion: this.appVersion, | ||
| timestamp, | ||
| windowStart: windowStartDate, | ||
| windowEnd: windowEndDate, | ||
| collectionName: b.collectionName, | ||
| strategy: b.strategy, | ||
| bytesHeld, | ||
| docCount: Math.round(b.docCount), | ||
| fieldCount: Math.round(b.fieldCount), | ||
| connectionCount, | ||
| sampleRate: this.sampleRate | ||
| }; |
There was a problem hiding this comment.
Fixed in 28987d6. Rows now include buildHash when the agent resolved one (omitted otherwise, like publicationName). Added the optional buildHash field to the core mergebox_metrics schema (SkySignalAPM/core ffeb8d5) so it is first-class alongside appVersion, and added present/omitted tests.
|
|
||
| ### v1.1.0 (Mergebox RAM Residency Collector) | ||
|
|
||
| - **New `MergeboxCollector`** - Measures Meteor's MERGEBOX RAM residency (the per-session, server-side cache of published documents) and posts per-(publication, collection) rollups to `POST /api/v1/metrics/mergebox`. The collector walks `Meteor.server.sessions` read-only, estimates the resident bytes each session's mergebox holds per published collection (sizing each `SessionDocumentView.dataByKey` field value directly), reads the publication strategy via `Meteor.server.getPublicationStrategy()` (reverse-mapped to `SERVER_MERGE` / `NO_MERGE` / `NO_MERGE_NO_HISTORY` / `unknown`), and attributes residency to subscriptions via a pure even-split across `existsIn`. The even-split is sum-preserving: the rows for a collection sum back to that collection's true residency. `connectionCount` is a count of distinct DDP sessions (never a list of connection ids). |
There was a problem hiding this comment.
Already addressed in 84f8eb1 (pushed before this review round was applied) — the v1.1.0 entry now reads "reverse-mapped to all four Meteor strategies — SERVER_MERGE / NO_MERGE / NO_MERGE_NO_HISTORY / NO_MERGE_MULTI; unknown only when the strategy genuinely can't be read".
| documents.forEach((docView) => { | ||
| if (docsWalked >= this.maxDocsPerSession) { | ||
| // Bound the tick; remaining docs in this session are skipped. | ||
| return; | ||
| } | ||
| docsWalked++; |
There was a problem hiding this comment.
Fixed in 28987d6. The walk is now for...of over collectionViews and documents.values() with break (plus a top-of-collection-loop check), so maxDocsPerSession actually stops iteration instead of doing a cheap check on every remaining entry. Added a test with a 5-doc session and cap=2 asserting only 2 docs are walked.
The README config reference omitted the new collector. Add collectMergebox to the environment-variable and feature-flag tables (flagged opt-in / default false), and a dedicated "Mergebox Residency (opt-in)" section with enable snippets and a tuning table for mergeboxInterval / mergeboxSampleRate / mergeboxMaxSessions / mergeboxMaxDocsPerSession (and their SKYSIGNAL_* env vars). Also corrects the v1.1.0 CHANGELOG strategy list to include NO_MERGE_MULTI.
- Byte attribution now uses an integer largest-remainder split per doc (the first docBytes % n handles get +1 byte) instead of float docBytes/n with per-bucket rounding. Per-handle shares sum back to docBytes exactly, so per-collection residency is now EXACTLY sum-preserving for any byte total (previously off by rounding when not divisible). - maxDocsPerSession is enforced with for...of + break over collectionViews / documents.values() instead of Map#forEach (where return doesn't stop iteration), so the cap actually bounds the walk on large sessions. - Emit buildHash on rows when configured (omitted otherwise), mirroring other collectors so residency correlates to a deployed build; documented in the addMergeboxMetric JSDoc (which also now lists NO_MERGE_MULTI). Tests: exact 3-way non-divisible split, buildHash present/omitted, and the maxDocsPerSession cap stopping the walk.
Why
The mergebox — Meteor's per-session, server-side cache of every published document — is the single largest and least-visible consumer of server RAM in pub/sub-heavy apps, and the usual cause of OOMs. Nothing measured it before. This collector attributes mergebox RAM residency to the publications/collections holding it, so SkySignal can recommend switching hot publications to
NO_MERGE.The server side (ingest endpoint,
MergeboxService, System → Mergebox UI) already exists in SkySignalAPM/core #55; this PR ships the agent half that produces the data.Where the bytes come from (verified against ddp-server 3.x)
Meteor.server.sessions→session.collectionViews→SessionCollectionView.documents→SessionDocumentView.dataByKey(the resident field values,precedenceList[0].value) +existsIn(the subs referencing each doc).Subscription._documentsis refcount-only (doc-id sets, no bytes) and is not used.Attribution — pure even-split (sum-preserving)
Each resident doc's bytes/docCount are divided across its
existsInhandles and bucketed by(publicationName, collectionName). Because the split divides byexistsIn.size, the rows for a collection sum back to its true residency, so the server's existing$sumaggregations reconstruct exact per-collection / per-strategy / site totals with no special "truth" row (a truth row would have double-counted). Per-publication bytes are therefore an attribution of shared docs; per-collection is exact.Strategy awareness
Read via
getPublicationStrategy()and reverse-mapped by identity againstDDPServer.publicationStrategies→SERVER_MERGE/NO_MERGE/NO_MERGE_NO_HISTORY(NO_MERGE_MULTIand anything unrecognized →unknown, with a structural fallback).NO_MERGE/NO_MERGE_NO_HISTORYkeep no collectionView at all, so they correctly read ~0 residency — that absence is the optimization signal, never synthesized.Safety
session.send/processMessage(avoids the bug v1.0.16 still showing "DDPQueueCollector: Error in wrapped unblock: RangeError: Maximum call stack size exceeded" error #7 double-wrap stack overflow).collectMergebox), low 60s cadence with staggered start.mergeboxSampleRate, stamped on each row for server-side extrapolation),maxSessions/maxDocsPerSessioncaps, and a top-N row cap aligned to the 500-rows/POST limit.try/catchand feature-detection that degrades to zero rows on any Meteor shape mismatch — these are undocumented internal structures, so every access is guarded.connectionCountis a count of distinct sessions, never a list of connection ids.Changes
lib/collectors/MergeboxCollector.js+tests/unit/collectors/MergeboxCollector.test.js.SkySignalClient.addMergeboxMetric+/api/v1/metrics/mergeboxbatch wiring.collectMergebox/mergeboxInterval/mergeboxSampleRate/ cap options inconfig.js+env.js, gated startup inskysignal-agent.js.AGENT_VERSION+package.js→ 1.1.0.Tests
Unit tests cover the load-bearing invariants: a
SERVER_MERGEdoc shared by two subs even-splits and the rows sum back to full bytes; auto-publish (U) handles omitpublicationName;NO_MERGEcollections emit no row;connectionCountcounts distinct sessions;sampleRate<1stamps the rate; feature-detection skips malformed sessions; strategy reverse-map +DummyDocumentView(docCount>0/bytes~0). Not run here —npm test(mocha) executes them.