Import Document with Buffer by xingfan-git · Pull Request #2692 · microsoft/vscode-cosmosdb

xingfan-git · 2025-05-16T09:20:02Z

This pull request update the insert document feature for Document DB. To improve importing performance, it uses buffer to reduce the number of calls to server, and update error message handling accordingly

Implement the core architecture
Simplify the ClusterDocumentBufferManager: we don’t really need a manager here—a buffer that handles buffering and size measurement should be sufficient.
Create the buffer when needed during the import—this would behave the same as the current ClusterDocumentBufferManager management.
Once we have that buffer, can you make it independent of the Document class? Or make it generic?
The buffer should accept some configuration options when being created—provide default configurations for Mongo and for CosmosDB.
Use this improved import implementation for CosmosDB Core as well, so that Azure Databases benefits from the improvement too.
Ship it 🚀

Fixes #2582

Copilot

Pull Request Overview

This PR enhances the DocumentDB insert workflow by batching documents into a configurable in-memory buffer to reduce server calls and updates error handling to surface partial failures.

Introduces ClusterBufferManager to accumulate and flush documents in bulk.
Updates ClustersClient.insertDocuments to use unordered bulk inserts with error logging.
Modifies the import command to drive inserts through the new buffer.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/documentdb/ClustersClient.ts	Switched to unordered `insertMany`, added bulk‐error logging via `ext.outputChannel`.
src/documentdb/ClusterDocumentBufferManager.ts	Added buffering logic and config for chunked bulk imports.
src/commands/importDocuments/importDocuments.ts	Wired up `ClusterBufferManager` in the import flow and adapted `insertDocument` to handle buffered vs. single inserts.

Comments suppressed due to low confidence (2)

src/documentdb/ClusterDocumentBufferManager.ts:9

[nitpick] The fileCount field actually represents the number of buffered documents. Consider renaming it to documentCount for clarity.

fileCount: number;

src/documentdb/ClusterDocumentBufferManager.ts:1

Consider adding unit tests for BufferList and ClusterBufferManager to verify boundary conditions (max file count, max total size, oversized single documents).

export interface BufferStats {

src/commands/importDocuments/importDocuments.ts

src/documentdb/ClustersClient.ts

src/commands/importDocuments/importDocuments.ts

Copilot

Pull Request Overview

This PR enhances the DocumentDB import feature by batching inserts with an in-memory buffer to reduce server calls and updates error handling for bulk insert operations.

Introduced ClusterBufferManager to accumulate documents per collection and flush when thresholds are reached
Updated ClustersClient.insertDocuments to use unordered bulk inserts (ordered: false) and log write errors
Removed the acknowledged flag from InsertDocumentsResult and streamlined the result to insertedCount

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/documentdb/ClustersClient.ts	Bulk insert updated to `ordered: false`, added try/catch and error logging, slimmed result type
src/documentdb/ClusterDocumentBufferManager.ts	New buffer manager to batch and flush documents by size/count
src/commands/importDocuments/importDocuments.ts	Integrated buffer manager into import flow and refactored insert logic
l10n/bundle.l10n.json	Removed obsolete error message from localization bundle

Comments suppressed due to low confidence (3)

src/documentdb/ClustersClient.ts:61

Fix grammatical error in JSDoc: change "operations" to "operation".

/** The number of inserted documents for this operations */

src/documentdb/ClustersClient.ts:60

Dropping the acknowledged field is a breaking change. Consider deprecating it or bumping the API version and updating all consumers.

export type InsertDocumentsResult = {

src/documentdb/ClusterDocumentBufferManager.ts:89

Add unit tests for ClusterBufferManager (e.g., insert, flush, shouldFlush) to validate buffering logic and edge cases.

export class ClusterBufferManager {

src/documentdb/ClustersClient.ts

src/commands/importDocuments/importDocuments.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

tnaum-ms

Thank you @xingfan-git !

It looks good. It doesn't compile due to a minor error, please make sure to review the feedback in the action log
(note to self: modify actions to write errors back to the PR as a comment somehow).

I'd like to finalize with a simplification of the code base and a release. I'll share a dedicated comment in the PR in a couple of minutes.

src/documentdb/ClustersClient.ts

src/documentdb/ClusterDocumentBufferManager.ts

tnaum-ms · 2025-05-21T12:46:20Z

Summary and Closing Steps

Good approach, and great catch 🕵️ with the correct configuration of the insertMany command! I'm very happy to see the improvement and can’t wait to ship it 🚀

Now, in order to get there, I’d like to finalize this ticket with:

Simplify the ClusterDocumentBufferManager: I think that we don’t really need a manager here, a buffer that handles buffering and size measurement should be sufficient.
Create the buffer when needed during the import—this would behave the same as the current ClusterDocumentBufferManager management.
Once we have that buffer, can you make it independent of the Document class? Or make it generic?
The buffer should accept some configuration options when being created—provide default configurations for Mongo and for CosmosDB.
Use this improved import implementation for CosmosDB Core as well, so that Azure Databases benefits from the improvement too.
Ship it 🚀

I'll add the tasks to the PR as well for better tracking.

…nto dev/xingfan/insert-with-buffer

…or cosmos db

xingfan-git · 2025-05-28T05:15:30Z

The error message for Cosmos Core was slightly changed:
we won't track the position of failed document in bulk insertion, so the error message would only say 'insertion failed with error code xxx' but not specify which file failed to be inserted

…cases with 0 valid documents to import.

tnaum-ms

First of all, congrats ⭐

We're at this stage, imports are so much faster!

import-super-fast.mp4

I added comments in the PR, essentially an few minor things and essentially an ask to simplify the Buffer API even further. I get the idea behind the auto flush you've build and I know it helps with high-throughput systems. We're in a much simpler environment, with no multithreading in the end, so I'd put the emphasis on keeping the code simpler for future maintainers and external contributors.

Let's just check whether a buffer is "full" (i.e. flush is required), and then do it. If you want to keep the auto-flush behavior, please make sure function names are more verbose, so that the intent is clear + ensure that we don't only rely on success === false to decide about the next step but do add some sort of status flag.

A: Simplify the API (no auto-flush),
B: Improve the API (if you want to keep auto-flush).

I'd suggest moving forward with A, but it's up to you :)

These changes are API tweaks. I added a few changes to ensure progress reporting is correct (URI/file loading for only one file wasn't being reported correctly - this was an error in the original code).

Once these changes are in, feel free to move forward with a similar PR for our DocumentDB extension without waiting for my feedback here.

I just noticed that my comments where I just comment Copilot's comments are not linked, please scroll up and read other 'unresolved' discussions and close them.

src/utils/documentBuffer.ts

src/commands/importDocuments/importDocuments.ts

src/utils/documentBuffer.ts

…lushing operation

…ding part

…nto dev/xingfan/insert-with-buffer

tnaum-ms

@xingfan-git 🥳 Congratulations on the first PR to be shipped!

Everything looks great! I added more details to write error logging for better UX.

tnaum-ms · 2025-06-06T10:41:49Z

@xingfan-git I bumped into a blocking issue :-/

During extensive testing, I encountered a blocking issue when attempting to import data into Azure Cosmos DB for MongoDB (RU-based). The import fails due to request unit (RU) throttling, and the driver in use does not appear to handle RU-based flow control automatically.

In contrast, testing against Azure Cosmos DB NoSQL did not show this issue — likely because the driver handles throttling internally or in a more robust way.

Details

The following error is observed during import:

Write error: Failed with code "16500". - Error=16500, Details='Insert error.'

This indicates that the server is rejecting operations due to exceeding available RUs.

Root Cause

The MongoDB API driver we use does not account for RU throttling, as it's unaware of the Cosmos DB-specific RU model. Unlike Cosmos DB’s native NoSQL SDKs, it doesn’t manage retry or backoff logic by default.

Required Fix Before Shipping

To avoid failed imports and improve resilience, we should implement RU throttling handling explicitly:

Add delay + retry logic when encountering 16500 errors
(RUs are replenished every second, so short delays may suffice)
Optionally, investigate whether Cosmos DB for MongoDB (RU) exposes a server status command or metric we can query to assess RU availability in advance

…hrottling issue

xingfan-git · 2025-06-09T15:44:46Z

The issue occurred because the bulk insert operation exceeded the collection RU limit.
We cannot directly retry the insert operation:

First, RU has a retry mechanism, but it is not robust enough to resolve the issue we encountered;
Secondly, the bulk insert operation partially succeeded, so if we simply retry, we risk 1) inserting duplicate items or 2) facing insertion failure if the _id field was specified in the inserted items.

Due to the complexity of this issue, we decided to fix it in two steps:

For the current iteration, we noticed that inserting documents one by one could resolve most situations where throttling occurs, so we decided to use a small buffer (single document buffer) for RU resources.
In the future, there is a feature called server-side retry for Mongo RU that could resolve the throttling issue. We can investigate if we can perform a server-side retry via driver parameters. If not, @tnaum-ms mentioned we can retry only on the failed documents in a later iteration.

Copilot

Pull Request Overview

This PR improves the document import feature by introducing a generic document buffer (with separate defaults for MongoDB and CosmosDB), simplifying buffer management, and enhancing error handling for bulk insert operations. Key changes include:

Implementation of a generic document buffer with configurable options.
Updated insertion routines in ClustersClient and importDocuments to leverage buffering and bulk operations.
Enhanced logging and localized error message updates for better diagnostics.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/utils/documentBuffer.ts	Introduces a generic, configurable document buffer for batching.
src/documentdb/ClustersClient.ts	Adds new error handling and a helper to check for Azure Cosmos DB RU connections.
src/commands/importDocuments/importDocuments.ts	Updates the import flow to use buffering and bulk insert operations.
l10n/bundle.l10n.json	Updates localized strings for insertion error messages and logging.

Comments suppressed due to low confidence (1)

src/commands/importDocuments/importDocuments.ts:349

In the BufferFull case, the current document is reinserted into the buffer after flush, which might be confusing at first glance. Consider adding an inline comment or refactoring the logic to make the flow explicit.

if (insertOrFlushToBufferResult.errorCode === BufferFull) {

src/commands/importDocuments/importDocuments.ts

src/documentdb/ClustersClient.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

tnaum-ms · 2025-06-10T15:47:03Z

@sevoku This looks good from my point of view. It improves performance for vCore and Cosmos DB, but there is no improvement for RU.

With RU, we encountered issues with throughput and will need to address this in a separate ticket. As a workaround, we reduced the buffer size to "1," essentially reverting to the insertion of individual documents. This is slow enough for most cases, so throughput limits won't be hit.

This is a temporary solution for RU (or rather, no solution), but we'll address it properly during the copy-and-paste work. There is already a dedicated issue for it. We also have a ticket for better reporting of the id of failing documents, but I would prefer to integrate this with the copy-and-paste ticket.

I tested this with vCore and RU for various configurations and error scenarios. It's being merged into DocumentDB. For Cosmos DB, I conducted tests, including partition key conflicts, but I might have missed something.

🎯 Please try it out in your Cosmos DB setups.

bk201- · 2025-06-18T10:53:34Z

src/commands/importDocuments/importDocuments.ts


 import { type ItemDefinition, type JSONObject, type JSONValue, type PartitionKeyDefinition } from '@azure/cosmos';
-import { parseError, type IActionContext } from '@microsoft/vscode-azext-utils';
+import { nonNullProp, parseError, type IActionContext } from '@microsoft/vscode-azext-utils';


Please use nonNullProp from our utils, our function provides information about properties in the message

bk201- · 2025-06-18T10:56:30Z

src/commands/importDocuments/importDocuments.ts


            const countUri = uris.length;
-            const incrementUri = 50 / (countUri || 1);
+            const incrementUri = 25 / (countUri || 1);


Please either add comment or create a constant variables for 25 and 75 (25% and 25% * 3 of 100% progress)

bk201- · 2025-06-18T11:09:42Z

src/commands/importDocuments/importDocuments.ts


-            for (let i = 0, percent = 0; i < countUri; i++, percent += incrementUri) {
+            for (let i = 0; i < countUri; i++) {
+                const increment = (i + 1) * incrementUri;


Taking into consideration the code below where also there is increment, I can't figure out what approach is right. Here you increase increment every time and it looks like you set the percentage. But in the code below the increment never increases and it looks like you add it value to progress.

updated
I figured out why it does not increases. Because you really don't know how many documents you already inserted and how many in the buffer.

In this case the progress bar message will be untransparent for user.

The progress bar doesn't move but the message shows how many documents were inserted

Even if the message shows that 20 document were inserted it is absolutely does not mean that they were actually inserted. Again for user it is not unclear.

You move buffer logic to the insert function, but the logic has to be reversed. You have to insert into a buffer and when it return the error that it is full, you have to flush and insert one batch.

Progress bar might be easily counted. One step: 75 / the number of document. And when you insert a batch, you take a number of inserted document multiply them to value of one step and add to progress bar.

In this case you also remove weird check if buffer has document since you will know this. If the cycle for is end you just do flush and insert, always.

bk201- · 2025-06-18T11:13:19Z

src/commands/importDocuments/importDocuments.ts

-            // await needs to catch the error here, otherwise it will be thrown to the caller
-            return await insertDocumentIntoCluster(node, document as Document);
+        // Check for valid buffer
+        if (!buffer) {


If without buffer you return the fixed object, why do not you set type as DocumentBuffer only

async function insertDocument( node: CosmosDBContainerResourceItem | CollectionItem, document: unknown, buffer: DocumentBuffer<unknown>, // <-- it is more strict type ): Promise<{ count: number; errorOccurred: boolean }> {

bk201- · 2025-06-18T11:20:22Z

src/utils/documentBuffer.ts

+/**
+ * Create a document buffer configured for MongoDB
+ */
+export function createMongoDbBuffer<T>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T> {


Function name already has name Mongo so you can narrow a type

// This buffer can keep only Document and all inherited classes export function createMongoDbBuffer<T extends Document>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T>

bk201- · 2025-06-18T11:26:38Z

src/utils/documentBuffer.ts

+
+    public insert(document: T): BufferInsertResult {
+        // Check if the document is valid
+        if (!document) {


But if I create a new buffer like new DocumentBuffer<undefined> this condition will be wrong. Please see comment above about T

bk201- · 2025-06-18T11:30:21Z

src/utils/documentBuffer.ts

+ * Document buffer for a specific database/collection pair.
+ * Used for batching document inserts to improve performance.
+ */
+export class DocumentBuffer<T> {


It is wrong generic definition.

class name has name Document so the type has to be narrowed to T extends Document | ItemDefinition | <any document type>.

Just T might lead to wrong behavior and redundant checks. See the comment below.

bk201- · 2025-06-18T11:38:32Z

src/utils/documentBuffer.ts

+/**
+ * Error codes for document buffer operations
+ */
+export enum BufferErrorCode {


NIT: Please try to avoid string enum. You can find more information in the internet. Just for example
https://dev.to/ivanzm123/dont-use-enums-in-typescript-they-are-very-dangerous-57bh

@bk201- never disappoints, thank you for sharing! I can always learn something new (to me) thanks to you.

bk201- · 2025-06-18T11:40:37Z

src/commands/importDocuments/importDocuments.ts

+
+                if (isRuResource) {
+                    // For Azure MongoDB RU, we use a buffer with maxDocumentCount = 1
+                    buffer = createMongoDbBuffer<unknown>({


But now you know a type moreover the function name already says what type it must be. See my comment below

buffer = createMongoDbBuffer<Document>

bk201- · 2025-06-18T12:01:38Z

src/commands/importDocuments/importDocuments.ts

+    node: CollectionItem,
+    buffer: DocumentBuffer<unknown>,
+    document?: Document,
+    // If document is undefined, it means that we are flushing the buffer


Looks like hack and unexpected behavior. In order to use function everyone HAS TO read this comment. This code smells.

Good point. It's something I discussed "offline" with @xingfan-git
Let's leave it as is in this it iteration - there is another thing we're working on here and I wanted to finally get it started:
microsoft/vscode-documentdb#63

Once it finalizes, we'll be moving import and export to the new task service and, while working on it, we'll improve the overall code and comment quality around import and export.

This work will be packaged as a module for sharing.

tnaum-ms · 2025-06-19T08:44:40Z

@bk201- Thank you for your detailed review.
@xingfan-git Please address Dmitry's comments.

tnaum-ms · 2025-07-18T07:52:18Z

@xingfan-git We'd like to ship this feature soon. Please address Dmitry's comments.

insert with buffer for documentdb

6a72173

xingfan-git requested a review from Copilot May 16, 2025 09:20

Copilot AI reviewed May 16, 2025

View reviewed changes

src/commands/importDocuments/importDocuments.ts Outdated Show resolved Hide resolved

src/documentdb/ClustersClient.ts Outdated Show resolved Hide resolved

src/commands/importDocuments/importDocuments.ts Outdated Show resolved Hide resolved

adding comments for interface

de8ad9b

xingfan-git requested a review from tnaum-ms May 16, 2025 09:27

xingfan-git added 2 commits May 16, 2025 09:30

missed comments for interface

30f4561

update l10n file to fix pipeline

d44780b

tnaum-ms requested a review from Copilot May 21, 2025 11:58

Copilot AI reviewed May 21, 2025

View reviewed changes

src/documentdb/ClustersClient.ts Outdated Show resolved Hide resolved

src/commands/importDocuments/importDocuments.ts Outdated Show resolved Hide resolved

src/commands/importDocuments/importDocuments.ts Outdated Show resolved Hide resolved

tnaum-ms and others added 3 commits May 21, 2025 14:02

chore: typo

a30f689

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chore: typo

4443c92

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chore: typos

d25370d

tnaum-ms reviewed May 21, 2025

View reviewed changes

src/documentdb/ClustersClient.ts Show resolved Hide resolved

src/documentdb/ClustersClient.ts Outdated Show resolved Hide resolved

src/documentdb/ClustersClient.ts Outdated Show resolved Hide resolved

src/documentdb/ClusterDocumentBufferManager.ts Outdated Show resolved Hide resolved

tnaum-ms assigned xingfan-git May 21, 2025

xingfan-git added 6 commits May 27, 2025 12:38

resolve comment

51875d6

update for cosmosdb

834ffd1

update comments

c174012

Merge branch 'main' of https://github.com/microsoft/vscode-cosmosdb i…

6c22748

…nto dev/xingfan/insert-with-buffer

update progress bar to show percent correctly & increase batch size f…

d162a1f

…or cosmos db

update to fix l10n check

eb4e882

tnaum-ms requested a review from Copilot May 30, 2025 08:06

This comment was marked as outdated.

Sign in to view

tnaum-ms added 2 commits May 30, 2025 11:17

fix: improved progress reporting when documents are loaded

19158c5

fix: improved error handling for CosmosDB. The UI was misleading for …

6ca96ac

…cases with 0 valid documents to import.

tnaum-ms requested changes May 30, 2025

View reviewed changes

xingfan-git added 2 commits June 3, 2025 05:12

insert -> insertOrFlush, adding error code & separate insertion and f…

2e591e5

…lushing operation

use floor function to keep increment integer, being consistent of loa…

c6a24cf

…ding part

xingfan-git mentioned this pull request Jun 3, 2025

Improve Document Importing with Buffer microsoft/vscode-documentdb#130

Closed

Copilot AI mentioned this pull request Jun 3, 2025

Implement buffered document importing for improved performance microsoft/vscode-documentdb#131

Merged

xingfan-git added 2 commits June 3, 2025 15:07

revert increment for loading documents

6926b05

Merge branch 'main' of https://github.com/microsoft/vscode-cosmosdb i…

e43f6b8

…nto dev/xingfan/insert-with-buffer

xingfan-git requested review from Copilot and tnaum-ms June 4, 2025 02:52

This comment was marked as outdated.

Sign in to view

fix typo

0541459

tnaum-ms previously approved these changes Jun 6, 2025

View reviewed changes

feat: improved error logging on mongodb-api imports

b1f4ea6

tnaum-ms dismissed their stale review via b1f4ea6 June 6, 2025 10:00

xingfan-git added 3 commits June 9, 2025 14:51

disable bulk insert for mongo RU for this iteration, to not fail on t…

66f2703

…hrottling issue

function headers

7945282

run l10n to fix pipeline

ee27ec0

xingfan-git requested a review from Copilot June 10, 2025 02:12

Copilot AI reviewed Jun 10, 2025

View reviewed changes

src/commands/importDocuments/importDocuments.ts Outdated Show resolved Hide resolved

src/documentdb/ClustersClient.ts Outdated Show resolved Hide resolved

tnaum-ms and others added 3 commits June 10, 2025 17:13

Update src/commands/importDocuments/importDocuments.ts

89a76b7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Simplified Mongo RU detection (removed server calls)

555d4d7

fix: l10n update

61f899e

tnaum-ms changed the title ~~Import Document with Buffer for DocumentDB~~ Import Document with Buffer Jun 10, 2025

tnaum-ms marked this pull request as ready for review June 10, 2025 15:47

tnaum-ms requested a review from a team as a code owner June 10, 2025 15:47

tnaum-ms requested a review from sevoku June 10, 2025 15:48

sevoku requested a review from bk201- June 18, 2025 10:41

bk201- reviewed Jun 18, 2025

View reviewed changes

bk201- marked this pull request as draft January 22, 2026 09:43

Conversation

xingfan-git commented May 16, 2025 • edited by tnaum-ms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tnaum-ms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tnaum-ms commented May 21, 2025 • edited by xingfan-git Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary and Closing Steps

Uh oh!

xingfan-git commented May 28, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

tnaum-ms left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

tnaum-ms left a comment

Choose a reason for hiding this comment

Uh oh!

tnaum-ms commented Jun 6, 2025

Details

Root Cause

Required Fix Before Shipping

Uh oh!

xingfan-git commented Jun 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

tnaum-ms commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingfan-git commented May 16, 2025 •

edited by tnaum-ms

Loading

tnaum-ms commented May 21, 2025 •

edited by xingfan-git

Loading

tnaum-ms left a comment •

edited

Loading