perf: batch dataset ingestion commits and fix spurious update logging#581
perf: batch dataset ingestion commits and fix spurious update logging#581lewisjared wants to merge 8 commits intomainfrom
Conversation
Batch ingestion commits in groups of 50 to reduce SQLite fsync overhead from N commits to N/50. Memory is managed via expire_all() after each batch. Fix incorrect "updated" logging caused by numpy scalar types (e.g. numpy.int64) comparing unequal to Python native types from the database. Added _to_python_native() to normalize values before comparison in _values_differ().
* origin/main: Bump version: 0.12.1 → 0.12.2 chore: add second EOF Bump version: 0.12.0 → 0.12.1 chore: support mip_id
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…ison File metadata (start_time/end_time) is stored as strings in the DB but incoming values from parsers are cftime.datetime objects. The raw != comparison always returned True for these mismatched types, causing every file to be flagged as "updated" on every re-ingest. Add _file_meta_differs() that coerces non-string values to str before comparing, matching the DatasetFile._coerce_time_to_str storage behavior.
…o ingest-batch * 'ingest-batch' of github.com:Climate-REF/climate-ref: fix: resolve false positive file updates from cftime vs string comparison fix: filtering datasets
There was a problem hiding this comment.
Pull request overview
This PR improves dataset ingestion performance and correctness:
- Batched commits: Datasets are now ingested in transactions of 50 (configurable via
INGEST_BATCH_SIZE), reducing SQLite fsync overhead, withexpire_all()called after each batch to control memory usage. - Fix spurious "updated" logging: Adds
_to_python_native()to normalize numpy scalar types before comparison in_values_differ(), fixing false mismatches between DataFrame numpy values and DB Python native values.
Changes:
- Batched transaction commits in dataset ingestion (
__init__.py), reducing N commits to N/50 - Added
_to_python_native()helper indatabase.pyand updated_values_differ()to call it; improved logging inupdate_or_create()to show before/after values - Refactored
datasets/base.pyto addfiltersparameter to_get_dataset_files,_get_datasets, andload_catalog; added_file_meta_differs()for cftime-aware file metadata comparison - CLI
datasets listnow pushes filtering down to the DB query instead of post-loading viaapply_dataset_filters - Updated
scripts/fetch-esgf.pyto usemax_membersinteger parameter instead of aremove_ensemblesbool
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
packages/climate-ref/src/climate_ref/database.py |
Adds _to_python_native() to normalize numpy scalars; fixes _values_differ() and improves update_or_create() logging |
packages/climate-ref/src/climate_ref/datasets/__init__.py |
Adds batched commit logic (INGEST_BATCH_SIZE=50) and _accumulate_stats() helper; improves docstring |
packages/climate-ref/src/climate_ref/datasets/base.py |
Adds _file_meta_differs() for cftime comparison; adds filters parameter to query methods; uses _file_meta_differs in file change detection |
packages/climate-ref/src/climate_ref/cli/datasets.py |
Pushes dataset filters to the DB query via load_catalog(filters=...) instead of in-memory post-filtering |
scripts/fetch-esgf.py |
Replaces remove_ensembles: bool with max_members: int for fine-grained ensemble member control |
packages/climate-ref/tests/unit/test_database.py |
Adds tests for _to_python_native() and numpy scalar handling in _values_differ() |
packages/climate-ref/tests/unit/datasets/test_datasets.py |
Adds tests for _file_meta_differs() including cftime comparison cases |
changelog/581.improvement.md |
New changelog entry for this improvement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| By default, up to 10 ensemble members per source_id are fetched to reduce the total data volume. | ||
| This can be changed with the --max-ensembles flag (0 = no limit). |
There was a problem hiding this comment.
The module-level docstring says "up to 10 ensemble members per source_id" (line 5) and "--max-ensembles flag" (line 6), but the actual default for max_members is 1 (line 353 in the CLI option and line 27/72 in the fetch methods), and the CLI flag name is --max-members, not --max-ensembles. These three documentation errors are inconsistent with the actual code behavior.
| max_members : int, default 10 | ||
| Maximum number of ensemble members to fetch per source_id. |
There was a problem hiding this comment.
The docstring says max_members : int, default 10 but the actual default parameter value in the function signature is 1. This is a discrepancy between the documentation and the implementation. The docstring default should be 1 to match the signature.
| By default, up to 1 ensemble members per source_id are fetched, but this can be | ||
| changed with --max-ensembles (use 0 for no limit). |
There was a problem hiding this comment.
The main function docstring says "up to 1 ensemble members" (grammatically incorrect — should be "member") and references --max-ensembles as the flag name, but the actual Typer option will generate --max-members from the max_members parameter name. Both the grammar and the flag name are wrong.
| By default, up to 1 ensemble members per source_id are fetched, but this can be | |
| changed with --max-ensembles (use 0 for no limit). | |
| By default, up to 1 ensemble member per source_id is fetched, but this can be | |
| changed with --max-members (use 0 for no limit). |
| return pd.DataFrame( | ||
| [{k: getattr(dataset, k) for k in self.dataset_specific_metadata} for dataset in result_datasets], |
There was a problem hiding this comment.
In the _get_datasets method, the index of the returned DataFrame is built using [file.id for file in result_datasets] (line 482, which would be part of the refactored function). However, result_datasets contains Dataset objects, not DatasetFile objects. The loop variable file is a misleading name and should be dataset to be consistent with the comprehension on the line above it (for dataset in result_datasets). This naming inconsistency makes the code harder to read and maintain.
| ) | ||
|
|
||
|
|
||
| def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None: |
There was a problem hiding this comment.
The type annotation for the results parameter in _accumulate_stats uses a forward reference string "DatasetRegistrationResult", but DatasetRegistrationResult is already directly imported at line 13 (from climate_ref.datasets.base import DatasetAdapter, DatasetRegistrationResult). The forward reference string is unnecessary and can be replaced with the direct type DatasetRegistrationResult.
| def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None: | |
| def _accumulate_stats(stats: IngestionStats, results: DatasetRegistrationResult) -> None: |
Description
Two performance/correctness improvements to dataset ingestion:
Batched commits: Ingestion now commits datasets in batches of 50 instead of one transaction per dataset. This reduces SQLite fsync overhead from N commits to N/50. Memory is managed via
expire_all()after each batch to prevent ORM object accumulation.Fix spurious "updated" logging: numpy scalar types (e.g.
numpy.int64) from DataFrame values were comparing unequal to Python native types from the database, causing_values_differ()to report false mismatches. Added_to_python_native()to normalize values before comparison.Checklist
Please confirm that this pull request has done the following:
changelog/