Add ERA5 training data statistical analysis scripts and findings#1004
Add ERA5 training data statistical analysis scripts and findings#1004
Conversation
Investigate statistical properties of the ERA5 1-degree 8-layer dataset used for ACE2 pretraining, focusing on features that could explain why upper atmospheric variables are difficult to learn. Key findings include pathological distributions in stratospheric humidity, near-unpredictable stratospheric meridional wind, and normalization scale mismatches for upper atmosphere variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| ERA5 has known discontinuities where parallel data streams were merged (documented in Hersbach 2020). The training config already excludes some problematic periods (1996-2010, 2020) but **includes 1979-1995**, which spans two significant discontinuities: | ||
|
|
||
| | Year | Variable most affected | Jump magnitude | |
There was a problem hiding this comment.
@Arcomano1234 these may indicate more periods we need to skip. However it sounds like these are year-over-year discontinuities, they might just be years with strong jumps due to volcanic events.
Should we have volcanic activity / aerosol as part of our forcings? Though Claude said they’re reanalysis artifacts below. Should investigate further.
There was a problem hiding this comment.
Got Claude to investigate, these are real discontinuities. Looks like the global map basically gets scaled.
There was a problem hiding this comment.
Thats really interesting I wonder if ECMWF is aware of this I want to dive a little more into these dates
| | -------- | ------------ | ---------- | ----------------------- | | ||
| | DSWRFsfc | 29.1% | 0% | 1.28 | | ||
| | USWRFtoa | 29.0% | 0% | 1.13 | | ||
| | USWRFsfc | 0.8% | 23.1% | 0.82 | |
There was a problem hiding this comment.
@spencerkclark any idea why upwelling shortwave surface radiation has so many small negative values, while these other similar variables do not?
There was a problem hiding this comment.
I think it is related to the fact that we compute the upward components as the difference between the downward and net fluxes, since they are not stored directly in ERA5. I cannot immediately explain why this does not show up in the upward shortwave flux at TOA in this older version of the dataset, but it is present in that variable too in the new dataset.
If we are concerned about it, we could consider clipping negative values at zero.
Investigate whether year-over-year jumps in Finding 4 are volcanic signals or reanalysis artifacts (1986 STW0 = artifact at Apr 1 stream merge, 1993 temperature = Pinatubo, 2000 STW0 = artifact). Investigate USWRFsfc negative values (floating-point noise at nighttime, not physical). Rename original findings/transcript to 01_ prefix and add 02_ follow-up files. Include discontinuity visualization plots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transcripts get t prefix, findings markdown get f prefix, and analysis scripts get s prefix for easier identification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compare spherical power spectra for four periods (pre-1979, 1979-2000, 2001-2009, 2010-2022) to quantify how much spectral change is due to the pre-1979 regime shift vs post-1979 drift and the 2010 stream boundary. Key finding: specific_total_water_0 shows the largest spectral shift, dominated by the pre-1979 boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| os.makedirs(CACHE_DIR, exist_ok=True) | ||
|
|
||
| # Variables to analyze — a mix of upper-atmosphere and surface | ||
| VARIABLES = [ |
There was a problem hiding this comment.
Can you add PRATEsfc?
PRATEsfc shows the largest spectral evolution of any variable analyzed: nearly 2x increase in small-scale power from pre-1979 to 2010-2022, accumulating progressively across all period boundaries rather than concentrated at a single regime shift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tropopause-level moisture (level 1, ~98 hPa) shows the most extreme spectral evolution: 4.3x more small-scale power in 2010-2022 vs pre-1979, progressive across all period boundaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summarizes spectral evolution across four ERA5 time periods for 9 variables. Key findings: STW1 has 4.3x small-scale power shift, PRATEsfc ~2x progressive shift, STW0 shift concentrated at 1979. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes test_main_guard_requires_distributed_context by wrapping the main() call with the required distributed context manager. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigate statistical properties of the ERA5 1-degree 8-layer dataset used for ACE2 pretraining, focusing on features that could explain why upper atmospheric variables are difficult to learn. Key findings include pathological distributions in stratospheric humidity, near-unpredictable stratospheric meridional wind, and normalization scale mismatches for upper atmosphere variables.