Skip to content

Import Document with Buffer#2692

Draft
xingfan-git wants to merge 27 commits intomainfrom
dev/xingfan/insert-with-buffer
Draft

Import Document with Buffer#2692
xingfan-git wants to merge 27 commits intomainfrom
dev/xingfan/insert-with-buffer

Conversation

@xingfan-git
Copy link

@xingfan-git xingfan-git commented May 16, 2025

June 10th Update: #2692 (comment)


This pull request update the insert document feature for Document DB. To improve importing performance, it uses buffer to reduce the number of calls to server, and update error message handling accordingly

  • Implement the core architecture
  • Simplify the ClusterDocumentBufferManager: we don’t really need a manager here—a buffer that handles buffering and size measurement should be sufficient.
  • Create the buffer when needed during the import—this would behave the same as the current ClusterDocumentBufferManager management.
  • Once we have that buffer, can you make it independent of the Document class? Or make it generic?
  • The buffer should accept some configuration options when being created—provide default configurations for Mongo and for CosmosDB.
  • Use this improved import implementation for CosmosDB Core as well, so that Azure Databases benefits from the improvement too.
  • Ship it 🚀

Fixes #2582

@xingfan-git xingfan-git requested a review from Copilot May 16, 2025 09:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the DocumentDB insert workflow by batching documents into a configurable in-memory buffer to reduce server calls and updates error handling to surface partial failures.

  • Introduces ClusterBufferManager to accumulate and flush documents in bulk.
  • Updates ClustersClient.insertDocuments to use unordered bulk inserts with error logging.
  • Modifies the import command to drive inserts through the new buffer.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/documentdb/ClustersClient.ts Switched to unordered insertMany, added bulk‐error logging via ext.outputChannel.
src/documentdb/ClusterDocumentBufferManager.ts Added buffering logic and config for chunked bulk imports.
src/commands/importDocuments/importDocuments.ts Wired up ClusterBufferManager in the import flow and adapted insertDocument to handle buffered vs. single inserts.
Comments suppressed due to low confidence (2)

src/documentdb/ClusterDocumentBufferManager.ts:9

  • [nitpick] The fileCount field actually represents the number of buffered documents. Consider renaming it to documentCount for clarity.
fileCount: number;

src/documentdb/ClusterDocumentBufferManager.ts:1

  • Consider adding unit tests for BufferList and ClusterBufferManager to verify boundary conditions (max file count, max total size, oversized single documents).
export interface BufferStats {

@xingfan-git xingfan-git requested a review from tnaum-ms May 16, 2025 09:27
@tnaum-ms tnaum-ms requested a review from Copilot May 21, 2025 11:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the DocumentDB import feature by batching inserts with an in-memory buffer to reduce server calls and updates error handling for bulk insert operations.

  • Introduced ClusterBufferManager to accumulate documents per collection and flush when thresholds are reached
  • Updated ClustersClient.insertDocuments to use unordered bulk inserts (ordered: false) and log write errors
  • Removed the acknowledged flag from InsertDocumentsResult and streamlined the result to insertedCount

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/documentdb/ClustersClient.ts Bulk insert updated to ordered: false, added try/catch and error logging, slimmed result type
src/documentdb/ClusterDocumentBufferManager.ts New buffer manager to batch and flush documents by size/count
src/commands/importDocuments/importDocuments.ts Integrated buffer manager into import flow and refactored insert logic
l10n/bundle.l10n.json Removed obsolete error message from localization bundle
Comments suppressed due to low confidence (3)

src/documentdb/ClustersClient.ts:61

  • Fix grammatical error in JSDoc: change "operations" to "operation".
/** The number of inserted documents for this operations */

src/documentdb/ClustersClient.ts:60

  • Dropping the acknowledged field is a breaking change. Consider deprecating it or bumping the API version and updating all consumers.
export type InsertDocumentsResult = {

src/documentdb/ClusterDocumentBufferManager.ts:89

  • Add unit tests for ClusterBufferManager (e.g., insert, flush, shouldFlush) to validate buffering logic and edge cases.
export class ClusterBufferManager {

tnaum-ms and others added 3 commits May 21, 2025 14:02
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Collaborator

@tnaum-ms tnaum-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xingfan-git !

It looks good. It doesn't compile due to a minor error, please make sure to review the feedback in the action log
(note to self: modify actions to write errors back to the PR as a comment somehow).

I'd like to finalize with a simplification of the code base and a release. I'll share a dedicated comment in the PR in a couple of minutes.

@tnaum-ms
Copy link
Collaborator

tnaum-ms commented May 21, 2025

Summary and Closing Steps

Good approach, and great catch 🕵️ with the correct configuration of the insertMany command! I'm very happy to see the improvement and can’t wait to ship it 🚀

Now, in order to get there, I’d like to finalize this ticket with:

  • Simplify the ClusterDocumentBufferManager: I think that we don’t really need a manager here, a buffer that handles buffering and size measurement should be sufficient.
  • Create the buffer when needed during the import—this would behave the same as the current ClusterDocumentBufferManager management.
  • Once we have that buffer, can you make it independent of the Document class? Or make it generic?
  • The buffer should accept some configuration options when being created—provide default configurations for Mongo and for CosmosDB.
  • Use this improved import implementation for CosmosDB Core as well, so that Azure Databases benefits from the improvement too.
  • Ship it 🚀

I'll add the tasks to the PR as well for better tracking.

@xingfan-git
Copy link
Author

The error message for Cosmos Core was slightly changed:
we won't track the position of failed document in bulk insertion, so the error message would only say 'insertion failed with error code xxx' but not specify which file failed to be inserted

@tnaum-ms tnaum-ms requested a review from Copilot May 30, 2025 08:06

This comment was marked as outdated.

Copy link
Collaborator

@tnaum-ms tnaum-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, congrats ⭐

We're at this stage, imports are so much faster!

import-super-fast.mp4

I added comments in the PR, essentially an few minor things and essentially an ask to simplify the Buffer API even further. I get the idea behind the auto flush you've build and I know it helps with high-throughput systems. We're in a much simpler environment, with no multithreading in the end, so I'd put the emphasis on keeping the code simpler for future maintainers and external contributors.

Let's just check whether a buffer is "full" (i.e. flush is required), and then do it. If you want to keep the auto-flush behavior, please make sure function names are more verbose, so that the intent is clear + ensure that we don't only rely on success === false to decide about the next step but do add some sort of status flag.

A: Simplify the API (no auto-flush),
B: Improve the API (if you want to keep auto-flush).

I'd suggest moving forward with A, but it's up to you :)


These changes are API tweaks. I added a few changes to ensure progress reporting is correct (URI/file loading for only one file wasn't being reported correctly - this was an error in the original code).

Once these changes are in, feel free to move forward with a similar PR for our DocumentDB extension without waiting for my feedback here.


I just noticed that my comments where I just comment Copilot's comments are not linked, please scroll up and read other 'unresolved' discussions and close them.

@xingfan-git xingfan-git requested review from Copilot and tnaum-ms June 4, 2025 02:52

This comment was marked as outdated.

tnaum-ms
tnaum-ms previously approved these changes Jun 6, 2025
Copy link
Collaborator

@tnaum-ms tnaum-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingfan-git 🥳 Congratulations on the first PR to be shipped!

  • Everything looks great! I added more details to write error logging for better UX.

@tnaum-ms
Copy link
Collaborator

tnaum-ms commented Jun 6, 2025

@xingfan-git I bumped into a blocking issue :-/

During extensive testing, I encountered a blocking issue when attempting to import data into Azure Cosmos DB for MongoDB (RU-based). The import fails due to request unit (RU) throttling, and the driver in use does not appear to handle RU-based flow control automatically.

In contrast, testing against Azure Cosmos DB NoSQL did not show this issue — likely because the driver handles throttling internally or in a more robust way.

Details

The following error is observed during import:

Write error: Failed with code "16500". - Error=16500, Details='Insert error.'

This indicates that the server is rejecting operations due to exceeding available RUs.

Root Cause

The MongoDB API driver we use does not account for RU throttling, as it's unaware of the Cosmos DB-specific RU model. Unlike Cosmos DB’s native NoSQL SDKs, it doesn’t manage retry or backoff logic by default.


Required Fix Before Shipping

To avoid failed imports and improve resilience, we should implement RU throttling handling explicitly:

  • Add delay + retry logic when encountering 16500 errors
    (RUs are replenished every second, so short delays may suffice)

  • Optionally, investigate whether Cosmos DB for MongoDB (RU) exposes a server status command or metric we can query to assess RU availability in advance

@xingfan-git
Copy link
Author

The issue occurred because the bulk insert operation exceeded the collection RU limit.
We cannot directly retry the insert operation:

  • First, RU has a retry mechanism, but it is not robust enough to resolve the issue we encountered;
  • Secondly, the bulk insert operation partially succeeded, so if we simply retry, we risk 1) inserting duplicate items or 2) facing insertion failure if the _id field was specified in the inserted items.

Due to the complexity of this issue, we decided to fix it in two steps:

  • For the current iteration, we noticed that inserting documents one by one could resolve most situations where throttling occurs, so we decided to use a small buffer (single document buffer) for RU resources.
  • In the future, there is a feature called server-side retry for Mongo RU that could resolve the throttling issue. We can investigate if we can perform a server-side retry via driver parameters. If not, @tnaum-ms mentioned we can retry only on the failed documents in a later iteration.

@xingfan-git xingfan-git requested a review from Copilot June 10, 2025 02:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the document import feature by introducing a generic document buffer (with separate defaults for MongoDB and CosmosDB), simplifying buffer management, and enhancing error handling for bulk insert operations. Key changes include:

  • Implementation of a generic document buffer with configurable options.
  • Updated insertion routines in ClustersClient and importDocuments to leverage buffering and bulk operations.
  • Enhanced logging and localized error message updates for better diagnostics.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/utils/documentBuffer.ts Introduces a generic, configurable document buffer for batching.
src/documentdb/ClustersClient.ts Adds new error handling and a helper to check for Azure Cosmos DB RU connections.
src/commands/importDocuments/importDocuments.ts Updates the import flow to use buffering and bulk insert operations.
l10n/bundle.l10n.json Updates localized strings for insertion error messages and logging.
Comments suppressed due to low confidence (1)

src/commands/importDocuments/importDocuments.ts:349

  • In the BufferFull case, the current document is reinserted into the buffer after flush, which might be confusing at first glance. Consider adding an inline comment or refactoring the logic to make the flow explicit.
if (insertOrFlushToBufferResult.errorCode === BufferFull) {

@tnaum-ms
Copy link
Collaborator

@sevoku This looks good from my point of view. It improves performance for vCore and Cosmos DB, but there is no improvement for RU.

With RU, we encountered issues with throughput and will need to address this in a separate ticket. As a workaround, we reduced the buffer size to "1," essentially reverting to the insertion of individual documents. This is slow enough for most cases, so throughput limits won't be hit.

This is a temporary solution for RU (or rather, no solution), but we'll address it properly during the copy-and-paste work. There is already a dedicated issue for it. We also have a ticket for better reporting of the id of failing documents, but I would prefer to integrate this with the copy-and-paste ticket.


I tested this with vCore and RU for various configurations and error scenarios. It's being merged into DocumentDB. For Cosmos DB, I conducted tests, including partition key conflicts, but I might have missed something.

🎯 Please try it out in your Cosmos DB setups.

@tnaum-ms tnaum-ms changed the title Import Document with Buffer for DocumentDB Import Document with Buffer Jun 10, 2025
@tnaum-ms tnaum-ms marked this pull request as ready for review June 10, 2025 15:47
@tnaum-ms tnaum-ms requested a review from a team as a code owner June 10, 2025 15:47
@tnaum-ms tnaum-ms requested a review from sevoku June 10, 2025 15:48
@sevoku sevoku requested a review from bk201- June 18, 2025 10:41

import { type ItemDefinition, type JSONObject, type JSONValue, type PartitionKeyDefinition } from '@azure/cosmos';
import { parseError, type IActionContext } from '@microsoft/vscode-azext-utils';
import { nonNullProp, parseError, type IActionContext } from '@microsoft/vscode-azext-utils';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use nonNullProp from our utils, our function provides information about properties in the message


const countUri = uris.length;
const incrementUri = 50 / (countUri || 1);
const incrementUri = 25 / (countUri || 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either add comment or create a constant variables for 25 and 75 (25% and 25% * 3 of 100% progress)


for (let i = 0, percent = 0; i < countUri; i++, percent += incrementUri) {
for (let i = 0; i < countUri; i++) {
const increment = (i + 1) * incrementUri;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into consideration the code below where also there is increment, I can't figure out what approach is right. Here you increase increment every time and it looks like you set the percentage. But in the code below the increment never increases and it looks like you add it value to progress.

updated
I figured out why it does not increases. Because you really don't know how many documents you already inserted and how many in the buffer.

In this case the progress bar message will be untransparent for user.

  1. The progress bar doesn't move but the message shows how many documents were inserted
  2. Even if the message shows that 20 document were inserted it is absolutely does not mean that they were actually inserted. Again for user it is not unclear.
  3. You move buffer logic to the insert function, but the logic has to be reversed. You have to insert into a buffer and when it return the error that it is full, you have to flush and insert one batch.
  4. Progress bar might be easily counted. One step: 75 / the number of document. And when you insert a batch, you take a number of inserted document multiply them to value of one step and add to progress bar.
  5. In this case you also remove weird check if buffer has document since you will know this. If the cycle for is end you just do flush and insert, always.

// await needs to catch the error here, otherwise it will be thrown to the caller
return await insertDocumentIntoCluster(node, document as Document);
// Check for valid buffer
if (!buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If without buffer you return the fixed object, why do not you set type as DocumentBuffer only

async function insertDocument(
    node: CosmosDBContainerResourceItem | CollectionItem,
    document: unknown,
    buffer: DocumentBuffer<unknown>, // <-- it is more strict type
): Promise<{ count: number; errorOccurred: boolean }> {

/**
* Create a document buffer configured for MongoDB
*/
export function createMongoDbBuffer<T>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name already has name Mongo so you can narrow a type

// This buffer can keep only Document and all inherited classes
export function createMongoDbBuffer<T extends Document>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T>


public insert(document: T): BufferInsertResult {
// Check if the document is valid
if (!document) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I create a new buffer like new DocumentBuffer<undefined> this condition will be wrong. Please see comment above about T

* Document buffer for a specific database/collection pair.
* Used for batching document inserts to improve performance.
*/
export class DocumentBuffer<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wrong generic definition.

  1. class name has name Document so the type has to be narrowed to T extends Document | ItemDefinition | <any document type>.
  2. Just T might lead to wrong behavior and redundant checks. See the comment below.

/**
* Error codes for document buffer operations
*/
export enum BufferErrorCode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Please try to avoid string enum. You can find more information in the internet. Just for example
https://dev.to/ivanzm123/dont-use-enums-in-typescript-they-are-very-dangerous-57bh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bk201- never disappoints, thank you for sharing! I can always learn something new (to me) thanks to you.


if (isRuResource) {
// For Azure MongoDB RU, we use a buffer with maxDocumentCount = 1
buffer = createMongoDbBuffer<unknown>({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now you know a type moreover the function name already says what type it must be. See my comment below

buffer = createMongoDbBuffer<Document>

node: CollectionItem,
buffer: DocumentBuffer<unknown>,
document?: Document,
// If document is undefined, it means that we are flushing the buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like hack and unexpected behavior. In order to use function everyone HAS TO read this comment. This code smells.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's something I discussed "offline" with @xingfan-git
Let's leave it as is in this it iteration - there is another thing we're working on here and I wanted to finally get it started:
microsoft/vscode-documentdb#63

Once it finalizes, we'll be moving import and export to the new task service and, while working on it, we'll improve the overall code and comment quality around import and export.

This work will be packaged as a module for sharing.

@tnaum-ms
Copy link
Collaborator

@bk201- Thank you for your detailed review.
@xingfan-git Please address Dmitry's comments.

@tnaum-ms
Copy link
Collaborator

@xingfan-git We'd like to ship this feature soon. Please address Dmitry's comments.

@bk201- bk201- marked this pull request as draft January 22, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Importing multiple documents is very slow, need bulk-import support

3 participants