perf: batch dataset ingestion commits and fix spurious update logging by lewisjared · Pull Request #581 · Climate-REF/climate-ref

lewisjared · 2026-03-06T10:23:13Z

Description

Two performance/correctness improvements to dataset ingestion:

Batched commits: Ingestion now commits datasets in batches of 50 instead of one transaction per dataset. This reduces SQLite fsync overhead from N commits to N/50. Memory is managed via expire_all() after each batch to prevent ORM object accumulation.
Fix spurious "updated" logging: numpy scalar types (e.g. numpy.int64) from DataFrame values were comparing unequal to Python native types from the database, causing _values_differ() to report false mismatches. Added _to_python_native() to normalize values before comparison.

Checklist

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Changelog item added to changelog/

Batch ingestion commits in groups of 50 to reduce SQLite fsync overhead from N commits to N/50. Memory is managed via expire_all() after each batch. Fix incorrect "updated" logging caused by numpy scalar types (e.g. numpy.int64) comparing unequal to Python native types from the database. Added _to_python_native() to normalize values before comparison in _values_differ().

* origin/main: Bump version: 0.12.1 → 0.12.2 chore: add second EOF Bump version: 0.12.0 → 0.12.1 chore: support mip_id

codecov · 2026-03-06T10:31:28Z

Codecov Report

❌ Patch coverage is 89.65517% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...kages/climate-ref/src/climate_ref/datasets/base.py	73.91%	4 Missing and 2 partials ⚠️

Flag	Coverage Δ
core	`93.04% <89.65%> (-0.07%)`	⬇️
providers	`91.79% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ckages/climate-ref/src/climate_ref/cli/datasets.py	`88.99% <100.00%> (ø)`
packages/climate-ref/src/climate_ref/database.py	`95.74% <100.00%> (+0.22%)`	⬆️
...s/climate-ref/src/climate_ref/datasets/__init__.py	`97.33% <100.00%> (+0.23%)`	⬆️
...kages/climate-ref/src/climate_ref/datasets/base.py	`94.91% <73.91%> (-3.23%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ison File metadata (start_time/end_time) is stored as strings in the DB but incoming values from parsers are cftime.datetime objects. The raw != comparison always returned True for these mismatched types, causing every file to be flagged as "updated" on every re-ingest. Add _file_meta_differs() that coerces non-string values to str before comparing, matching the DatasetFile._coerce_time_to_str storage behavior.

…o ingest-batch * 'ingest-batch' of github.com:Climate-REF/climate-ref: fix: resolve false positive file updates from cftime vs string comparison fix: filtering datasets

Copilot

Pull request overview

This PR improves dataset ingestion performance and correctness:

Batched commits: Datasets are now ingested in transactions of 50 (configurable via INGEST_BATCH_SIZE), reducing SQLite fsync overhead, with expire_all() called after each batch to control memory usage.
Fix spurious "updated" logging: Adds _to_python_native() to normalize numpy scalar types before comparison in _values_differ(), fixing false mismatches between DataFrame numpy values and DB Python native values.

Changes:

Batched transaction commits in dataset ingestion (__init__.py), reducing N commits to N/50
Added _to_python_native() helper in database.py and updated _values_differ() to call it; improved logging in update_or_create() to show before/after values
Refactored datasets/base.py to add filters parameter to _get_dataset_files, _get_datasets, and load_catalog; added _file_meta_differs() for cftime-aware file metadata comparison
CLI datasets list now pushes filtering down to the DB query instead of post-loading via apply_dataset_filters
Updated scripts/fetch-esgf.py to use max_members integer parameter instead of a remove_ensembles bool

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`packages/climate-ref/src/climate_ref/database.py`	Adds `_to_python_native()` to normalize numpy scalars; fixes `_values_differ()` and improves `update_or_create()` logging
`packages/climate-ref/src/climate_ref/datasets/__init__.py`	Adds batched commit logic (`INGEST_BATCH_SIZE=50`) and `_accumulate_stats()` helper; improves docstring
`packages/climate-ref/src/climate_ref/datasets/base.py`	Adds `_file_meta_differs()` for cftime comparison; adds `filters` parameter to query methods; uses `_file_meta_differs` in file change detection
`packages/climate-ref/src/climate_ref/cli/datasets.py`	Pushes dataset filters to the DB query via `load_catalog(filters=...)` instead of in-memory post-filtering
`scripts/fetch-esgf.py`	Replaces `remove_ensembles: bool` with `max_members: int` for fine-grained ensemble member control
`packages/climate-ref/tests/unit/test_database.py`	Adds tests for `_to_python_native()` and numpy scalar handling in `_values_differ()`
`packages/climate-ref/tests/unit/datasets/test_datasets.py`	Adds tests for `_file_meta_differs()` including cftime comparison cases
`changelog/581.improvement.md`	New changelog entry for this improvement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T22:24:01Z

scripts/fetch-esgf.py

+By default, up to 10 ensemble members per source_id are fetched to reduce the total data volume.
+This can be changed with the --max-ensembles flag (0 = no limit).


The module-level docstring says "up to 10 ensemble members per source_id" (line 5) and "--max-ensembles flag" (line 6), but the actual default for max_members is 1 (line 353 in the CLI option and line 27/72 in the fetch methods), and the CLI flag name is --max-members, not --max-ensembles. These three documentation errors are inconsistent with the actual code behavior.

Copilot · 2026-03-07T22:24:01Z

scripts/fetch-esgf.py

+        max_members : int, default 10
+            Maximum number of ensemble members to fetch per source_id.


The docstring says max_members : int, default 10 but the actual default parameter value in the function signature is 1. This is a discrepancy between the documentation and the implementation. The docstring default should be 1 to match the signature.

Copilot · 2026-03-07T22:24:02Z

scripts/fetch-esgf.py

+    By default, up to 1 ensemble members per source_id are fetched, but this can be
+    changed with --max-ensembles (use 0 for no limit).


The main function docstring says "up to 1 ensemble members" (grammatically incorrect — should be "member") and references --max-ensembles as the flag name, but the actual Typer option will generate --max-members from the max_members parameter name. Both the grammar and the flag name are wrong.

Suggested change

By default, up to 1 ensemble members per source_id are fetched, but this can be

changed with --max-ensembles (use 0 for no limit).

By default, up to 1 ensemble member per source_id is fetched, but this can be

changed with --max-members (use 0 for no limit).

Copilot · 2026-03-07T22:24:02Z

packages/climate-ref/src/climate_ref/datasets/base.py

        return pd.DataFrame(
            [{k: getattr(dataset, k) for k in self.dataset_specific_metadata} for dataset in result_datasets],


In the _get_datasets method, the index of the returned DataFrame is built using [file.id for file in result_datasets] (line 482, which would be part of the refactored function). However, result_datasets contains Dataset objects, not DatasetFile objects. The loop variable file is a misleading name and should be dataset to be consistent with the comprehension on the line above it (for dataset in result_datasets). This naming inconsistency makes the code harder to read and maintain.

Copilot · 2026-03-07T22:24:02Z

packages/climate-ref/src/climate_ref/datasets/__init__.py

        )


+def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None:


The type annotation for the results parameter in _accumulate_stats uses a forward reference string "DatasetRegistrationResult", but DatasetRegistrationResult is already directly imported at line 13 (from climate_ref.datasets.base import DatasetAdapter, DatasetRegistrationResult). The forward reference string is unnecessary and can be replaced with the direct type DatasetRegistrationResult.

Suggested change

def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None:

def _accumulate_stats(stats: IngestionStats, results: DatasetRegistrationResult) -> None:

lewisjared added 2 commits March 6, 2026 17:15

Merge remote-tracking branch 'origin/main' into ingest-batch

2a87908

* origin/main: Bump version: 0.12.1 → 0.12.2 chore: add second EOF Bump version: 0.12.0 → 0.12.1 chore: support mip_id

lewisjared added 6 commits March 6, 2026 21:32

docs: add changelog entry for PR #581

6968755

fix: filtering datasets

312105a

chore: Introduce a max members parameter

02a906b

Merge branch 'ingest-batch' of github.com:Climate-REF/climate-ref int…

f08669e

…o ingest-batch * 'ingest-batch' of github.com:Climate-REF/climate-ref: fix: resolve false positive file updates from cftime vs string comparison fix: filtering datasets

chore: sort before selecting

f5b8cfe

lewisjared requested a review from Copilot March 7, 2026 22:17

Copilot started reviewing on behalf of lewisjared March 7, 2026 22:17 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch dataset ingestion commits and fix spurious update logging#581

perf: batch dataset ingestion commits and fix spurious update logging#581
lewisjared wants to merge 8 commits intomainfrom
ingest-batch

lewisjared commented Mar 6, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		By default, up to 10 ensemble members per source_id are fetched to reduce the total data volume.
		This can be changed with the --max-ensembles flag (0 = no limit).

		max_members : int, default 10
		Maximum number of ensemble members to fetch per source_id.

		By default, up to 1 ensemble members per source_id are fetched, but this can be
		changed with --max-ensembles (use 0 for no limit).

		return pd.DataFrame(
		[{k: getattr(dataset, k) for k in self.dataset_specific_metadata} for dataset in result_datasets],

		)


		def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None:

Conversation

lewisjared commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lewisjared commented Mar 6, 2026 •

edited

Loading

codecov bot commented Mar 6, 2026 •

edited

Loading