Skip to content

Add NClimGrid data source with lexicon and test coverage#806

Open
iype-eldho wants to merge 1 commit intoNVIDIA:mainfrom
iype-eldho:feature/nclimgrid-tests
Open

Add NClimGrid data source with lexicon and test coverage#806
iype-eldho wants to merge 1 commit intoNVIDIA:mainfrom
iype-eldho:feature/nclimgrid-tests

Conversation

@iype-eldho
Copy link
Copy Markdown

Earth2Studio Pull Request

Description

This PR adds support for the NClimGrid dataset as a new Earth2Studio data source.

Key features

  • Implements NClimGrid data loader using NOAA NODD S3 bucket
  • Adds canonical variable mapping via NClimGridLexicon
  • Supports flexible time selection (single timestamp, list, slice)
  • Implements per-timestep parquet caching for efficient reuse
  • Ensures schema consistency with existing Earth2Studio data interfaces

Testing

  • Added tests under test/data/test_nclimgrid.py
  • Covers:
    • schema validation
    • variable mapping correctness
    • time selection behavior
    • basic scientific sanity checks (temperature ranges, precipitation non-negativity)
    • caching behavior

Notes

  • Data is streamed per timestep to avoid loading entire monthly datasets into memory
  • Uses public NOAA NODD S3 bucket (no authentication required)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.
  • Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

None

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This PR adds NClimGrid as a new DataFrame-returning data source backed by the public NOAA NODD S3 bucket, with per-timestep parquet caching and a companion NClimGridLexicon for unit conversion (°C→K, mm→m). The implementation logic is sound and the test suite is thorough, but both new files are missing the mandatory SPDX Apache-2.0 license header and neither class is exported from its respective __init__.py, making the new data source unreachable via the standard from earth2studio.data import NClimGrid import path.

Confidence Score: 3/5

Not ready to merge — two P1 compliance issues must be fixed first

Two P1 issues remain: missing SPDX license headers (will fail the CI /license check) and missing init.py exports (makes the new data source unreachable through the standard API). The underlying implementation logic is correct.

earth2studio/data/nclimgrid.py and earth2studio/lexicon/nclimgrid.py both need license headers; earth2studio/data/init.py and earth2studio/lexicon/init.py both need export entries added

Important Files Changed

Filename Overview
earth2studio/data/nclimgrid.py New DataSource implementation; missing SPDX license header, uses stdlib logging instead of loguru, not exported from init.py, stale-cache risk on modifier changes
earth2studio/lexicon/nclimgrid.py New lexicon with unit-conversion modifiers; missing SPDX license header and not exported from lexicon init.py
test/data/test_nclimgrid.py Thorough offline and network (marked @pytest.mark.network) test coverage for schema, unit conversions, caching, and scientific sanity checks

Reviews (1): Last reviewed commit: "Add NClimGrid data source and lexicon wi..." | Re-trigger Greptile

@@ -0,0 +1,371 @@
import logging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing required SPDX license header

All Python source files in earth2studio/ must start with the Apache-2.0 SPDX header (project rule). This file starts directly with import logging, so the /license check will fail and the CI compliance gate will block merging.

Suggested change
import logging
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging

@@ -0,0 +1,111 @@
from collections.abc import Callable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing required SPDX license header

earth2studio/lexicon/nclimgrid.py also starts directly with imports, missing the mandatory Apache-2.0 header required by the project style guide. Every other lexicon file in the directory carries this header.

Suggested change
from collections.abc import Callable
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections.abc import Callable

Comment on lines +1 to +17
import logging
import os
from datetime import datetime
from pathlib import Path
from typing import Any

import numpy as np
import pandas as pd
import pyarrow as pa
import s3fs
import xarray as xr
from tqdm.auto import tqdm

from earth2studio.lexicon.nclimgrid import NClimGridLexicon

logger = logging.getLogger("earth2studio.nclimgrid")
logger.setLevel(logging.INFO)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 stdlib logging instead of loguru

The project mandates from loguru import logger for all logging inside earth2studio/ (CLAUDE.md, style guide). import logging / logging.getLogger(...) / logger.setLevel(logging.INFO) are all non-standard here — loguru does not use setLevel. Replace with the loguru pattern used by every other data source (e.g. mrms.py).

Suggested change
import logging
import os
from datetime import datetime
from pathlib import Path
from typing import Any
import numpy as np
import pandas as pd
import pyarrow as pa
import s3fs
import xarray as xr
from tqdm.auto import tqdm
from earth2studio.lexicon.nclimgrid import NClimGridLexicon
logger = logging.getLogger("earth2studio.nclimgrid")
logger.setLevel(logging.INFO)
from loguru import logger

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@@ -0,0 +1,371 @@
import logging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NClimGrid not exported from __init__.py

NClimGrid is not added to earth2studio/data/__init__.py and NClimGridLexicon is not added to earth2studio/lexicon/__init__.py. Every existing data source and lexicon is exported from these files, making them available via from earth2studio.data import NClimGrid. Without these exports the new data source is effectively invisible to users following the standard import path.

Comment on lines +293 to +303
cache_file = self._cache_file_for_timestep(native, ts)

if self.cache and cache_file.exists():
df_t = pd.read_parquet(cache_file)
else:
df_t = self._dataarray_to_dataframe(da_t, var, modifier)

if self.cache:
df_t.to_parquet(cache_file)

frames.append(df_t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale disk cache after modifier changes

The parquet cache is keyed solely on native variable name + date. The files contain unit-converted values (e.g. °C→K). If the conversion logic is ever corrected (e.g. a bug fix to modifier), existing cache files will silently serve wrong values until users manually delete ~/.earth2studio-cache/nclimgrid. Consider including a hash of the modifier or a schema version in the cache filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant