perf: Optimize metadata records processing in `SqlStorageClient` #1551

Mantisus · 2025-11-11T14:09:29Z

Description

This PR adds new buffer tables to improve handling of metadata records. The key change is that updates to metadata are now accumulated in buffer and applied when get_metadata is called. With the old behavior, metadata records were updated instantly within a transaction. This led to waiting for locks to be released in high-concurrency situations.

Issues

Closes: Updating request queue metada performs full table scan in SQL storage #1533

vdusek

Please PR type without an exclamation mark

janbuchar · 2025-11-18T09:39:13Z

Interesting! I'd imagine that transactions consisting of e.g., an insertion to the dataset_items table and an update to dataset metadata wouldn't lock the metadata table for that long - you can commit right after the update to metadata.

Also, the buffering approach is faster because the buffer table gets a row for each increment and those get compacted later on, correct?

Mantisus · 2025-11-18T12:26:14Z

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate.
This is exactly what @ericvg97 pointed out - #1533 (comment)

Although, of course, the strongest impact on RequestQueue

Yes, insert operations into the buffer table are quite fast. And then we can simply apply the result of the aggregations to update the metadata record.

janbuchar · 2025-11-19T14:02:43Z

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate. This is exactly what @ericvg97 pointed out - #1533 (comment)

Although, of course, the strongest impact on RequestQueue

I see, thanks. And is there any chance that the lock is held for too long because of how we work with sqlalchemy? In other words, would it be better if we just executed sql such as insert ...; update ...; commit in one go? If yes, it might be worth trying before adding three new tables to the whole thing.

Mantisus · 2025-11-19T15:44:34Z

it might be worth trying before adding three new tables to the whole thing.

I will test this approach.

Mantisus · 2026-01-26T10:21:49Z

I will test this approach.

Unfortunately, switching to queries such as insert ...; update ...; commit does not improve the situation. Concurrent access to metadata records remains a bottleneck.

So far, switching to additional tables with on-demand metadata processing has shown the best results.
This also opens up the possibility of integrating MySQL, provided that the DBMS is running with transaction_isolation=READ-COMMITTED.

Copilot

Pull request overview

This PR optimizes metadata record processing in SqlStorageClient by introducing buffer tables to defer metadata updates and reduce database lock contention in high-concurrency scenarios. Instead of updating metadata instantly within transactions, updates are now accumulated in buffer tables and applied when get_metadata is called.

Changes:

Added buffer tables (dataset_metadata_buffer, key_value_store_metadata_buffer, request_queue_metadata_buffer) to accumulate metadata updates
Implemented buffer processing mechanism with locking to prevent concurrent buffer processing
Increased database connection pool settings to handle higher concurrency

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/crawlee/storage_clients/_sql/_storage_client.py	Removed `_accessed_modified_update_interval` mechanism; increased connection pool sizes
src/crawlee/storage_clients/_sql/_db_models.py	Added buffer table models and `buffer_locked_until` field to metadata tables; added indexes
src/crawlee/storage_clients/_sql/_client_mixin.py	Implemented core buffer processing logic including `_add_buffer_record`, `_process_buffers`, and lock management
src/crawlee/storage_clients/_sql/_request_queue_client.py	Integrated buffer system; replaced direct metadata updates with buffer records
src/crawlee/storage_clients/_sql/_key_value_store_client.py	Integrated buffer system for KVS operations
src/crawlee/storage_clients/_sql/_dataset_client.py	Integrated buffer system for dataset operations
tests/unit/storage_clients/_sql/test_sql_rq_client.py	Removed test fixtures that modified `_accessed_modified_update_interval`
tests/unit/storage_clients/_sql/test_sql_kvs_client.py	Removed test fixtures that modified `_accessed_modified_update_interval`
tests/unit/storage_clients/_sql/test_sql_dataset_client.py	Removed test fixtures that modified `_accessed_modified_update_interval`
docs/guides/storage_clients.mdx	Updated ER diagrams to include new buffer tables

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/crawlee/storage_clients/_sql/_request_queue_client.py

src/crawlee/storage_clients/_sql/_db_models.py

src/crawlee/storage_clients/_sql/_client_mixin.py

src/crawlee/storage_clients/_sql/_key_value_store_client.py

src/crawlee/storage_clients/_sql/_db_models.py

src/crawlee/storage_clients/_sql/_client_mixin.py

src/crawlee/storage_clients/_sql/_storage_client.py

docs/guides/storage_clients.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

src/crawlee/storage_clients/_sql/_dataset_client.py:290

The docstring mentions a session parameter that doesn't exist in the method signature. This appears to be leftover documentation from before the refactoring. The parameter should be removed from the docstring to match the actual method signature.

            session: The SQLAlchemy AsyncSession to use for the update.
            new_item_count: If provided, set item count to this value.
            delta_item_count: If provided, add this value to the current item count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/crawlee/storage_clients/_sql/_storage_client.py

src/crawlee/storage_clients/_sql/_client_mixin.py

src/crawlee/storage_clients/_sql/_request_queue_client.py

src/crawlee/storage_clients/_sql/_dataset_client.py

src/crawlee/storage_clients/_sql/_db_models.py

src/crawlee/storage_clients/_sql/_request_queue_client.py

src/crawlee/storage_clients/_sql/_db_models.py

src/crawlee/storage_clients/_sql/_client_mixin.py

src/crawlee/storage_clients/_sql/_request_queue_client.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

janbuchar · 2026-01-27T09:13:07Z

I will test this approach.

Unfortunately, switching to queries such as insert ...; update ...; commit does not improve the situation. Concurrent access to metadata records remains a bottleneck.

So far, switching to additional tables with on-demand metadata processing has shown the best results.

So be it 😐

This also opens up the possibility of integrating MySQL, provided that the DBMS is running with transaction_isolation=READ-COMMITTED.

You should be able to set that in connection options if I'm not missing something.

Mantisus added 4 commits November 10, 2025 18:17

add buffer tables

5d0802b

some optimization

4b9386f

index optimization

7848a21

fix

3665a4c

Mantisus self-assigned this Nov 11, 2025

vdusek reviewed Nov 11, 2025

View reviewed changes

Mantisus changed the title ~~perf!: Optimize metadata records processing in 'SqlStorageClient`~~ perf: Optimize metadata records processing in 'SqlStorageClient` Nov 11, 2025

Mantisus added 9 commits November 11, 2025 15:41

Merge branch 'master' into sql-metadata-buffer

4236b20

recalculate only for is_empty

4e74b70

add lenght for all String fields

251253c

Merge branch 'master' into sql-metadata-buffer

d7171db

consistent use _update_metadata

c8054b9

up block time interval

83209ce

fix

1ee46da

up lengrh for data

d4cd262

up docs

eae231d

Mantisus marked this pull request as ready for review November 17, 2025 14:03

Mantisus requested review from janbuchar and vdusek November 17, 2025 14:03

Mantisus changed the title ~~perf: Optimize metadata records processing in 'SqlStorageClient`~~ perf: Optimize metadata records processing in SqlStorageClient Nov 18, 2025

Mantisus marked this pull request as draft November 26, 2025 16:36

Mantisus added 2 commits January 26, 2026 00:38

Merge branch 'master' into sql-metadata-buffer

c3fd6e8

remove foreign key constraint for buffer tables

e0d07b4

Mantisus marked this pull request as ready for review January 26, 2026 10:21

vdusek requested a review from Copilot January 26, 2026 13:06

Copilot started reviewing on behalf of vdusek January 26, 2026 13:06 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Mantisus and others added 3 commits January 26, 2026 15:16

Update src/crawlee/storage_clients/_sql/_request_queue_client.py

4ed54a9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fixes

aea99e7

merge

7bf9e1e

Mantisus requested a review from Copilot January 26, 2026 14:17

Copilot started reviewing on behalf of Mantisus January 26, 2026 14:17 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Mantisus and others added 5 commits January 26, 2026 16:37

Update src/crawlee/storage_clients/_sql/_db_models.py

c48f179

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/crawlee/storage_clients/_sql/_client_mixin.py

24fe92d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/crawlee/storage_clients/_sql/_request_queue_client.py

83225de

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

update types

bedcd8d

change level message

9d143a7

perf: Optimize metadata records processing in SqlStorageClient #1551

Are you sure you want to change the base?

perf: Optimize metadata records processing in SqlStorageClient #1551

Conversation

Mantisus commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Nov 18, 2025

Uh oh!

Mantisus commented Nov 18, 2025

Uh oh!

janbuchar commented Nov 19, 2025

Uh oh!

Mantisus commented Nov 19, 2025

Uh oh!

Mantisus commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Optimize metadata records processing in `SqlStorageClient` #1551

perf: Optimize metadata records processing in `SqlStorageClient` #1551

Mantisus commented Nov 11, 2025 •

edited

Loading