Skip to content

Optimize mixed-batch uploads: filter known duplicates from bulk-upsert and bucket recalculation #43

@coderabbitai

Description

@coderabbitai

Overview

For mixed-batch uploads (payloads containing both new and duplicate messages), known duplicates are still included in the bulk-upsert path and may still trigger bucket recalculation in /api/upload-stats. This is a follow-up optimization to PR #42, which fixed the all-duplicate timeout regression.

Problem

When a batch contains some new and some duplicate messages:

  • messagesForDb (which includes known duplicates) is passed to the bulk-upsert loop unchanged.
  • affectedBuckets is populated for all messages, including those that are duplicates, so bucket aggregation runs for buckets that have no new data.

Proposed Fix

After the duplicate detection query, filter messagesForDb to only include genuinely new messages:

const existingHashSet = new Set(existingMessages.map((m) => m.globalHash));
const newMessagesForDb = messagesForDb.filter(
  (m) => !existingHashSet.has(m.globalHash)
);
const duplicateCount = messagesForDb.length - newMessagesForDb.length;

Then use newMessagesForDb for all subsequent upsert and recalculation steps (and update timing/metrics accordingly), so duplicates are never sent to the DB or cause unnecessary bucket recalculations.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions