Skip to content

perf: batch dataset ingestion commits and fix spurious update logging#581

Open
lewisjared wants to merge 8 commits intomainfrom
ingest-batch
Open

perf: batch dataset ingestion commits and fix spurious update logging#581
lewisjared wants to merge 8 commits intomainfrom
ingest-batch

Conversation

@lewisjared
Copy link
Contributor

@lewisjared lewisjared commented Mar 6, 2026

Description

Two performance/correctness improvements to dataset ingestion:

  1. Batched commits: Ingestion now commits datasets in batches of 50 instead of one transaction per dataset. This reduces SQLite fsync overhead from N commits to N/50. Memory is managed via expire_all() after each batch to prevent ORM object accumulation.

  2. Fix spurious "updated" logging: numpy scalar types (e.g. numpy.int64) from DataFrame values were comparing unequal to Python native types from the database, causing _values_differ() to report false mismatches. Added _to_python_native() to normalize values before comparison.

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Batch ingestion commits in groups of 50 to reduce SQLite fsync
overhead from N commits to N/50. Memory is managed via expire_all()
after each batch.

Fix incorrect "updated" logging caused by numpy scalar types
(e.g. numpy.int64) comparing unequal to Python native types from
the database. Added _to_python_native() to normalize values before
comparison in _values_differ().
* origin/main:
  Bump version: 0.12.1 → 0.12.2
  chore: add second EOF
  Bump version: 0.12.0 → 0.12.1
  chore: support mip_id
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 89.65517% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...kages/climate-ref/src/climate_ref/datasets/base.py 73.91% 4 Missing and 2 partials ⚠️
Flag Coverage Δ
core 93.04% <89.65%> (-0.07%) ⬇️
providers 91.79% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ckages/climate-ref/src/climate_ref/cli/datasets.py 88.99% <100.00%> (ø)
packages/climate-ref/src/climate_ref/database.py 95.74% <100.00%> (+0.22%) ⬆️
...s/climate-ref/src/climate_ref/datasets/__init__.py 97.33% <100.00%> (+0.23%) ⬆️
...kages/climate-ref/src/climate_ref/datasets/base.py 94.91% <73.91%> (-3.23%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ison

File metadata (start_time/end_time) is stored as strings in the DB
but incoming values from parsers are cftime.datetime objects. The raw
!= comparison always returned True for these mismatched types, causing
every file to be flagged as "updated" on every re-ingest.

Add _file_meta_differs() that coerces non-string values to str before
comparing, matching the DatasetFile._coerce_time_to_str storage behavior.
…o ingest-batch

* 'ingest-batch' of github.com:Climate-REF/climate-ref:
  fix: resolve false positive file updates from cftime vs string comparison
  fix: filtering datasets
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves dataset ingestion performance and correctness:

  • Batched commits: Datasets are now ingested in transactions of 50 (configurable via INGEST_BATCH_SIZE), reducing SQLite fsync overhead, with expire_all() called after each batch to control memory usage.
  • Fix spurious "updated" logging: Adds _to_python_native() to normalize numpy scalar types before comparison in _values_differ(), fixing false mismatches between DataFrame numpy values and DB Python native values.

Changes:

  • Batched transaction commits in dataset ingestion (__init__.py), reducing N commits to N/50
  • Added _to_python_native() helper in database.py and updated _values_differ() to call it; improved logging in update_or_create() to show before/after values
  • Refactored datasets/base.py to add filters parameter to _get_dataset_files, _get_datasets, and load_catalog; added _file_meta_differs() for cftime-aware file metadata comparison
  • CLI datasets list now pushes filtering down to the DB query instead of post-loading via apply_dataset_filters
  • Updated scripts/fetch-esgf.py to use max_members integer parameter instead of a remove_ensembles bool

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/climate-ref/src/climate_ref/database.py Adds _to_python_native() to normalize numpy scalars; fixes _values_differ() and improves update_or_create() logging
packages/climate-ref/src/climate_ref/datasets/__init__.py Adds batched commit logic (INGEST_BATCH_SIZE=50) and _accumulate_stats() helper; improves docstring
packages/climate-ref/src/climate_ref/datasets/base.py Adds _file_meta_differs() for cftime comparison; adds filters parameter to query methods; uses _file_meta_differs in file change detection
packages/climate-ref/src/climate_ref/cli/datasets.py Pushes dataset filters to the DB query via load_catalog(filters=...) instead of in-memory post-filtering
scripts/fetch-esgf.py Replaces remove_ensembles: bool with max_members: int for fine-grained ensemble member control
packages/climate-ref/tests/unit/test_database.py Adds tests for _to_python_native() and numpy scalar handling in _values_differ()
packages/climate-ref/tests/unit/datasets/test_datasets.py Adds tests for _file_meta_differs() including cftime comparison cases
changelog/581.improvement.md New changelog entry for this improvement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +5 to +6
By default, up to 10 ensemble members per source_id are fetched to reduce the total data volume.
This can be changed with the --max-ensembles flag (0 = no limit).
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module-level docstring says "up to 10 ensemble members per source_id" (line 5) and "--max-ensembles flag" (line 6), but the actual default for max_members is 1 (line 353 in the CLI option and line 27/72 in the fetch methods), and the CLI flag name is --max-members, not --max-ensembles. These three documentation errors are inconsistent with the actual code behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
max_members : int, default 10
Maximum number of ensemble members to fetch per source_id.
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says max_members : int, default 10 but the actual default parameter value in the function signature is 1. This is a discrepancy between the documentation and the implementation. The docstring default should be 1 to match the signature.

Copilot uses AI. Check for mistakes.
Comment on lines +365 to +366
By default, up to 1 ensemble members per source_id are fetched, but this can be
changed with --max-ensembles (use 0 for no limit).
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main function docstring says "up to 1 ensemble members" (grammatically incorrect — should be "member") and references --max-ensembles as the flag name, but the actual Typer option will generate --max-members from the max_members parameter name. Both the grammar and the flag name are wrong.

Suggested change
By default, up to 1 ensemble members per source_id are fetched, but this can be
changed with --max-ensembles (use 0 for no limit).
By default, up to 1 ensemble member per source_id is fetched, but this can be
changed with --max-members (use 0 for no limit).

Copilot uses AI. Check for mistakes.
Comment on lines 480 to 481
return pd.DataFrame(
[{k: getattr(dataset, k) for k in self.dataset_specific_metadata} for dataset in result_datasets],
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _get_datasets method, the index of the returned DataFrame is built using [file.id for file in result_datasets] (line 482, which would be part of the refactored function). However, result_datasets contains Dataset objects, not DatasetFile objects. The loop variable file is a misleading name and should be dataset to be consistent with the comprehension on the line above it (for dataset in result_datasets). This naming inconsistency makes the code harder to read and maintain.

Copilot uses AI. Check for mistakes.
)


def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None:
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation for the results parameter in _accumulate_stats uses a forward reference string "DatasetRegistrationResult", but DatasetRegistrationResult is already directly imported at line 13 (from climate_ref.datasets.base import DatasetAdapter, DatasetRegistrationResult). The forward reference string is unnecessary and can be replaced with the direct type DatasetRegistrationResult.

Suggested change
def _accumulate_stats(stats: IngestionStats, results: "DatasetRegistrationResult") -> None:
def _accumulate_stats(stats: IngestionStats, results: DatasetRegistrationResult) -> None:

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants