Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the DocumentDB insert workflow by batching documents into a configurable in-memory buffer to reduce server calls and updates error handling to surface partial failures.
- Introduces
ClusterBufferManagerto accumulate and flush documents in bulk. - Updates
ClustersClient.insertDocumentsto use unordered bulk inserts with error logging. - Modifies the import command to drive inserts through the new buffer.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/documentdb/ClustersClient.ts | Switched to unordered insertMany, added bulk‐error logging via ext.outputChannel. |
| src/documentdb/ClusterDocumentBufferManager.ts | Added buffering logic and config for chunked bulk imports. |
| src/commands/importDocuments/importDocuments.ts | Wired up ClusterBufferManager in the import flow and adapted insertDocument to handle buffered vs. single inserts. |
Comments suppressed due to low confidence (2)
src/documentdb/ClusterDocumentBufferManager.ts:9
- [nitpick] The
fileCountfield actually represents the number of buffered documents. Consider renaming it todocumentCountfor clarity.
fileCount: number;
src/documentdb/ClusterDocumentBufferManager.ts:1
- Consider adding unit tests for
BufferListandClusterBufferManagerto verify boundary conditions (max file count, max total size, oversized single documents).
export interface BufferStats {
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the DocumentDB import feature by batching inserts with an in-memory buffer to reduce server calls and updates error handling for bulk insert operations.
- Introduced
ClusterBufferManagerto accumulate documents per collection and flush when thresholds are reached - Updated
ClustersClient.insertDocumentsto use unordered bulk inserts (ordered: false) and log write errors - Removed the
acknowledgedflag fromInsertDocumentsResultand streamlined the result toinsertedCount
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/documentdb/ClustersClient.ts | Bulk insert updated to ordered: false, added try/catch and error logging, slimmed result type |
| src/documentdb/ClusterDocumentBufferManager.ts | New buffer manager to batch and flush documents by size/count |
| src/commands/importDocuments/importDocuments.ts | Integrated buffer manager into import flow and refactored insert logic |
| l10n/bundle.l10n.json | Removed obsolete error message from localization bundle |
Comments suppressed due to low confidence (3)
src/documentdb/ClustersClient.ts:61
- Fix grammatical error in JSDoc: change "operations" to "operation".
/** The number of inserted documents for this operations */
src/documentdb/ClustersClient.ts:60
- Dropping the
acknowledgedfield is a breaking change. Consider deprecating it or bumping the API version and updating all consumers.
export type InsertDocumentsResult = {
src/documentdb/ClusterDocumentBufferManager.ts:89
- Add unit tests for
ClusterBufferManager(e.g.,insert,flush,shouldFlush) to validate buffering logic and edge cases.
export class ClusterBufferManager {
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
tnaum-ms
left a comment
There was a problem hiding this comment.
Thank you @xingfan-git !
It looks good. It doesn't compile due to a minor error, please make sure to review the feedback in the action log
(note to self: modify actions to write errors back to the PR as a comment somehow).
I'd like to finalize with a simplification of the code base and a release. I'll share a dedicated comment in the PR in a couple of minutes.
Summary and Closing StepsGood approach, and great catch 🕵️ with the correct configuration of the Now, in order to get there, I’d like to finalize this ticket with:
I'll add the tasks to the PR as well for better tracking. |
…nto dev/xingfan/insert-with-buffer
|
The error message for Cosmos Core was slightly changed: |
…cases with 0 valid documents to import.
There was a problem hiding this comment.
First of all, congrats ⭐
We're at this stage, imports are so much faster!
import-super-fast.mp4
I added comments in the PR, essentially an few minor things and essentially an ask to simplify the Buffer API even further. I get the idea behind the auto flush you've build and I know it helps with high-throughput systems. We're in a much simpler environment, with no multithreading in the end, so I'd put the emphasis on keeping the code simpler for future maintainers and external contributors.
Let's just check whether a buffer is "full" (i.e. flush is required), and then do it. If you want to keep the auto-flush behavior, please make sure function names are more verbose, so that the intent is clear + ensure that we don't only rely on success === false to decide about the next step but do add some sort of status flag.
A: Simplify the API (no auto-flush),
B: Improve the API (if you want to keep auto-flush).
I'd suggest moving forward with A, but it's up to you :)
These changes are API tweaks. I added a few changes to ensure progress reporting is correct (URI/file loading for only one file wasn't being reported correctly - this was an error in the original code).
Once these changes are in, feel free to move forward with a similar PR for our DocumentDB extension without waiting for my feedback here.
I just noticed that my comments where I just comment Copilot's comments are not linked, please scroll up and read other 'unresolved' discussions and close them.
…nto dev/xingfan/insert-with-buffer
tnaum-ms
left a comment
There was a problem hiding this comment.
@xingfan-git 🥳 Congratulations on the first PR to be shipped!
- Everything looks great! I added more details to write error logging for better UX.
|
@xingfan-git I bumped into a blocking issue :-/ During extensive testing, I encountered a blocking issue when attempting to import data into Azure Cosmos DB for MongoDB (RU-based). The import fails due to request unit (RU) throttling, and the driver in use does not appear to handle RU-based flow control automatically. In contrast, testing against Azure Cosmos DB NoSQL did not show this issue — likely because the driver handles throttling internally or in a more robust way. DetailsThe following error is observed during import: This indicates that the server is rejecting operations due to exceeding available RUs. Root CauseThe MongoDB API driver we use does not account for RU throttling, as it's unaware of the Cosmos DB-specific RU model. Unlike Cosmos DB’s native NoSQL SDKs, it doesn’t manage retry or backoff logic by default. Required Fix Before ShippingTo avoid failed imports and improve resilience, we should implement RU throttling handling explicitly:
|
|
The issue occurred because the bulk insert operation exceeded the collection RU limit.
Due to the complexity of this issue, we decided to fix it in two steps:
|
There was a problem hiding this comment.
Pull Request Overview
This PR improves the document import feature by introducing a generic document buffer (with separate defaults for MongoDB and CosmosDB), simplifying buffer management, and enhancing error handling for bulk insert operations. Key changes include:
- Implementation of a generic document buffer with configurable options.
- Updated insertion routines in ClustersClient and importDocuments to leverage buffering and bulk operations.
- Enhanced logging and localized error message updates for better diagnostics.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/utils/documentBuffer.ts | Introduces a generic, configurable document buffer for batching. |
| src/documentdb/ClustersClient.ts | Adds new error handling and a helper to check for Azure Cosmos DB RU connections. |
| src/commands/importDocuments/importDocuments.ts | Updates the import flow to use buffering and bulk insert operations. |
| l10n/bundle.l10n.json | Updates localized strings for insertion error messages and logging. |
Comments suppressed due to low confidence (1)
src/commands/importDocuments/importDocuments.ts:349
- In the BufferFull case, the current document is reinserted into the buffer after flush, which might be confusing at first glance. Consider adding an inline comment or refactoring the logic to make the flow explicit.
if (insertOrFlushToBufferResult.errorCode === BufferFull) {
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@sevoku This looks good from my point of view. It improves performance for vCore and Cosmos DB, but there is no improvement for RU. With RU, we encountered issues with throughput and will need to address this in a separate ticket. As a workaround, we reduced the buffer size to "1," essentially reverting to the insertion of individual documents. This is slow enough for most cases, so throughput limits won't be hit. This is a temporary solution for RU (or rather, no solution), but we'll address it properly during the copy-and-paste work. There is already a dedicated issue for it. We also have a ticket for better reporting of the I tested this with vCore and RU for various configurations and error scenarios. It's being merged into DocumentDB. For Cosmos DB, I conducted tests, including partition key conflicts, but I might have missed something. 🎯 Please try it out in your Cosmos DB setups. |
|
|
||
| import { type ItemDefinition, type JSONObject, type JSONValue, type PartitionKeyDefinition } from '@azure/cosmos'; | ||
| import { parseError, type IActionContext } from '@microsoft/vscode-azext-utils'; | ||
| import { nonNullProp, parseError, type IActionContext } from '@microsoft/vscode-azext-utils'; |
There was a problem hiding this comment.
Please use nonNullProp from our utils, our function provides information about properties in the message
|
|
||
| const countUri = uris.length; | ||
| const incrementUri = 50 / (countUri || 1); | ||
| const incrementUri = 25 / (countUri || 1); |
There was a problem hiding this comment.
Please either add comment or create a constant variables for 25 and 75 (25% and 25% * 3 of 100% progress)
|
|
||
| for (let i = 0, percent = 0; i < countUri; i++, percent += incrementUri) { | ||
| for (let i = 0; i < countUri; i++) { | ||
| const increment = (i + 1) * incrementUri; |
There was a problem hiding this comment.
Taking into consideration the code below where also there is increment, I can't figure out what approach is right. Here you increase increment every time and it looks like you set the percentage. But in the code below the increment never increases and it looks like you add it value to progress.
updated
I figured out why it does not increases. Because you really don't know how many documents you already inserted and how many in the buffer.
In this case the progress bar message will be untransparent for user.
- The progress bar doesn't move but the message shows how many documents were inserted
- Even if the message shows that 20 document were inserted it is absolutely does not mean that they were actually inserted. Again for user it is not unclear.
- You move buffer logic to the insert function, but the logic has to be reversed. You have to insert into a buffer and when it return the error that it is full, you have to flush and insert one batch.
- Progress bar might be easily counted. One step: 75 / the number of document. And when you insert a batch, you take a number of inserted document multiply them to value of one step and add to progress bar.
- In this case you also remove weird check if buffer has document since you will know this. If the cycle for is end you just do flush and insert, always.
| // await needs to catch the error here, otherwise it will be thrown to the caller | ||
| return await insertDocumentIntoCluster(node, document as Document); | ||
| // Check for valid buffer | ||
| if (!buffer) { |
There was a problem hiding this comment.
If without buffer you return the fixed object, why do not you set type as DocumentBuffer only
async function insertDocument(
node: CosmosDBContainerResourceItem | CollectionItem,
document: unknown,
buffer: DocumentBuffer<unknown>, // <-- it is more strict type
): Promise<{ count: number; errorOccurred: boolean }> {| /** | ||
| * Create a document buffer configured for MongoDB | ||
| */ | ||
| export function createMongoDbBuffer<T>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T> { |
There was a problem hiding this comment.
Function name already has name Mongo so you can narrow a type
// This buffer can keep only Document and all inherited classes
export function createMongoDbBuffer<T extends Document>(customConfig?: Partial<DocumentBufferOptions>): DocumentBuffer<T>|
|
||
| public insert(document: T): BufferInsertResult { | ||
| // Check if the document is valid | ||
| if (!document) { |
There was a problem hiding this comment.
But if I create a new buffer like new DocumentBuffer<undefined> this condition will be wrong. Please see comment above about T
| * Document buffer for a specific database/collection pair. | ||
| * Used for batching document inserts to improve performance. | ||
| */ | ||
| export class DocumentBuffer<T> { |
There was a problem hiding this comment.
It is wrong generic definition.
- class name has name
Documentso the type has to be narrowed toT extends Document | ItemDefinition | <any document type>. - Just T might lead to wrong behavior and redundant checks. See the comment below.
| /** | ||
| * Error codes for document buffer operations | ||
| */ | ||
| export enum BufferErrorCode { |
There was a problem hiding this comment.
NIT: Please try to avoid string enum. You can find more information in the internet. Just for example
https://dev.to/ivanzm123/dont-use-enums-in-typescript-they-are-very-dangerous-57bh
There was a problem hiding this comment.
@bk201- never disappoints, thank you for sharing! I can always learn something new (to me) thanks to you.
|
|
||
| if (isRuResource) { | ||
| // For Azure MongoDB RU, we use a buffer with maxDocumentCount = 1 | ||
| buffer = createMongoDbBuffer<unknown>({ |
There was a problem hiding this comment.
But now you know a type moreover the function name already says what type it must be. See my comment below
buffer = createMongoDbBuffer<Document>| node: CollectionItem, | ||
| buffer: DocumentBuffer<unknown>, | ||
| document?: Document, | ||
| // If document is undefined, it means that we are flushing the buffer |
There was a problem hiding this comment.
Looks like hack and unexpected behavior. In order to use function everyone HAS TO read this comment. This code smells.
There was a problem hiding this comment.
Good point. It's something I discussed "offline" with @xingfan-git
Let's leave it as is in this it iteration - there is another thing we're working on here and I wanted to finally get it started:
microsoft/vscode-documentdb#63
Once it finalizes, we'll be moving import and export to the new task service and, while working on it, we'll improve the overall code and comment quality around import and export.
This work will be packaged as a module for sharing.
|
@bk201- Thank you for your detailed review. |
|
@xingfan-git We'd like to ship this feature soon. Please address Dmitry's comments. |
June 10th Update: #2692 (comment)
This pull request update the insert document feature for Document DB. To improve importing performance, it uses buffer to reduce the number of calls to server, and update error message handling accordingly
ClusterDocumentBufferManager: we don’t really need a manager here—a buffer that handles buffering and size measurement should be sufficient.ClusterDocumentBufferManagermanagement.Documentclass? Or make it generic?Mongoand forCosmosDB.CosmosDB Coreas well, so thatAzure Databasesbenefits from the improvement too.Fixes #2582