Skip to content

Add ERA5 training data statistical analysis scripts and findings#1004

Draft
mcgibbon wants to merge 10 commits intomainfrom
scripts/era5_investigation
Draft

Add ERA5 training data statistical analysis scripts and findings#1004
mcgibbon wants to merge 10 commits intomainfrom
scripts/era5_investigation

Conversation

@mcgibbon
Copy link
Copy Markdown
Contributor

@mcgibbon mcgibbon commented Mar 24, 2026

Investigate statistical properties of the ERA5 1-degree 8-layer dataset used for ACE2 pretraining, focusing on features that could explain why upper atmospheric variables are difficult to learn. Key findings include pathological distributions in stratospheric humidity, near-unpredictable stratospheric meridional wind, and normalization scale mismatches for upper atmosphere variables.

mcgibbon and others added 3 commits March 24, 2026 22:23
Investigate statistical properties of the ERA5 1-degree 8-layer dataset
used for ACE2 pretraining, focusing on features that could explain why
upper atmospheric variables are difficult to learn. Key findings include
pathological distributions in stratospheric humidity, near-unpredictable
stratospheric meridional wind, and normalization scale mismatches for
upper atmosphere variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ERA5 has known discontinuities where parallel data streams were merged (documented in Hersbach 2020). The training config already excludes some problematic periods (1996-2010, 2020) but **includes 1979-1995**, which spans two significant discontinuities:

| Year | Variable most affected | Jump magnitude |
Copy link
Copy Markdown
Contributor Author

@mcgibbon mcgibbon Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arcomano1234 these may indicate more periods we need to skip. However it sounds like these are year-over-year discontinuities, they might just be years with strong jumps due to volcanic events.

Should we have volcanic activity / aerosol as part of our forcings? Though Claude said they’re reanalysis artifacts below. Should investigate further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got Claude to investigate, these are real discontinuities. Looks like the global map basically gets scaled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats really interesting I wonder if ECMWF is aware of this I want to dive a little more into these dates

| -------- | ------------ | ---------- | ----------------------- |
| DSWRFsfc | 29.1% | 0% | 1.28 |
| USWRFtoa | 29.0% | 0% | 1.13 |
| USWRFsfc | 0.8% | 23.1% | 0.82 |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark any idea why upwelling shortwave surface radiation has so many small negative values, while these other similar variables do not?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is related to the fact that we compute the upward components as the difference between the downward and net fluxes, since they are not stored directly in ERA5. I cannot immediately explain why this does not show up in the upward shortwave flux at TOA in this older version of the dataset, but it is present in that variable too in the new dataset.

If we are concerned about it, we could consider clipping negative values at zero.

mcgibbon and others added 3 commits March 25, 2026 15:47
Investigate whether year-over-year jumps in Finding 4 are volcanic
signals or reanalysis artifacts (1986 STW0 = artifact at Apr 1 stream
merge, 1993 temperature = Pinatubo, 2000 STW0 = artifact). Investigate
USWRFsfc negative values (floating-point noise at nighttime, not
physical). Rename original findings/transcript to 01_ prefix and add
02_ follow-up files. Include discontinuity visualization plots.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transcripts get t prefix, findings markdown get f prefix, and
analysis scripts get s prefix for easier identification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compare spherical power spectra for four periods (pre-1979, 1979-2000,
2001-2009, 2010-2022) to quantify how much spectral change is due to the
pre-1979 regime shift vs post-1979 drift and the 2010 stream boundary.
Key finding: specific_total_water_0 shows the largest spectral shift,
dominated by the pre-1979 boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
os.makedirs(CACHE_DIR, exist_ok=True)

# Variables to analyze — a mix of upper-atmosphere and surface
VARIABLES = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add PRATEsfc?

mcgibbon and others added 4 commits March 25, 2026 18:13
PRATEsfc shows the largest spectral evolution of any variable analyzed:
nearly 2x increase in small-scale power from pre-1979 to 2010-2022,
accumulating progressively across all period boundaries rather than
concentrated at a single regime shift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tropopause-level moisture (level 1, ~98 hPa) shows the most extreme
spectral evolution: 4.3x more small-scale power in 2010-2022 vs pre-1979,
progressive across all period boundaries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summarizes spectral evolution across four ERA5 time periods for 9
variables. Key findings: STW1 has 4.3x small-scale power shift,
PRATEsfc ~2x progressive shift, STW0 shift concentrated at 1979.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes test_main_guard_requires_distributed_context by wrapping the
main() call with the required distributed context manager.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants