Overview
For mixed-batch uploads (payloads containing both new and duplicate messages), known duplicates are still included in the bulk-upsert path and may still trigger bucket recalculation in /api/upload-stats. This is a follow-up optimization to PR #42, which fixed the all-duplicate timeout regression.
Problem
When a batch contains some new and some duplicate messages:
messagesForDb (which includes known duplicates) is passed to the bulk-upsert loop unchanged.
affectedBuckets is populated for all messages, including those that are duplicates, so bucket aggregation runs for buckets that have no new data.
Proposed Fix
After the duplicate detection query, filter messagesForDb to only include genuinely new messages:
const existingHashSet = new Set(existingMessages.map((m) => m.globalHash));
const newMessagesForDb = messagesForDb.filter(
(m) => !existingHashSet.has(m.globalHash)
);
const duplicateCount = messagesForDb.length - newMessagesForDb.length;
Then use newMessagesForDb for all subsequent upsert and recalculation steps (and update timing/metrics accordingly), so duplicates are never sent to the DB or cause unnecessary bucket recalculations.
References
Overview
For mixed-batch uploads (payloads containing both new and duplicate messages), known duplicates are still included in the bulk-upsert path and may still trigger bucket recalculation in
/api/upload-stats. This is a follow-up optimization to PR #42, which fixed the all-duplicate timeout regression.Problem
When a batch contains some new and some duplicate messages:
messagesForDb(which includes known duplicates) is passed to the bulk-upsert loop unchanged.affectedBucketsis populated for all messages, including those that are duplicates, so bucket aggregation runs for buckets that have no new data.Proposed Fix
After the duplicate detection query, filter
messagesForDbto only include genuinely new messages:Then use
newMessagesForDbfor all subsequent upsert and recalculation steps (and update timing/metrics accordingly), so duplicates are never sent to the DB or cause unnecessary bucket recalculations.References