Skip to content

Dump configuration next to the output#522

Open
gtrevisan wants to merge 9 commits intodevfrom
glt/config
Open

Dump configuration next to the output#522
gtrevisan wants to merge 9 commits intodevfrom
glt/config

Conversation

@gtrevisan
Copy link
Member

this should help for reproducibility:

  • dump configuration next to the outputs as JSON,
  • drop tests structure (if not testing),
  • drop attributes structure (as nan and inf are supported by TOML but not by JSON, and they should be stored as metadata anyways),
  • add a debug statement.

in the future we might want to add this back in as xarray metadata, rather than a configuration dump.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances reproducibility by dumping the configuration as JSON next to the output files. The changes filter out sensitive data (passwords), remove test-specific configuration when not running tests, and exclude the attributes structure that contains TOML-specific values (nan/inf) not compatible with JSON.

Changes:

  • Added filter_dict utility function to recursively filter dictionary keys containing a specified substring
  • Integrated configuration dumping in the main workflow to save sanitized config as JSON alongside outputs
  • Added debug logging for the config dump location

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
disruption_py/workflow.py Added imports and logic to dump sanitized configuration as JSON to temporary folder, filtering passwords and conditionally removing test data
disruption_py/core/utils/misc.py Added filter_dict utility function to recursively remove dictionary keys containing a specified substring

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gtrevisan gtrevisan requested a review from zapatace February 12, 2026 21:52
Copy link
Contributor

@yumouwei yumouwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm able to generate the config.json file with output_setting='dataset' or 'dataframe'. However, currently it does not work if I specify the file path (e.g. output_setting="<path_to_file>/temp.h5" or other format); it will only generate the dataset file but not the config file in the specified folder.

Additionally, it would also be idea to set the name of the config file to something like <dataset_file_name>.json or <dataset_file_name>_config.json so that a user can distinguish them in case there are multiple dataset files stored in the same folder.

@zapatace
Copy link
Contributor

zapatace commented Feb 16, 2026

I tested different things...

First, I used to save files using output_setting and path. And I found that the json and log file are always saved in /tmp folder. Is that the expected behavior?
For example:

from disruption_py.workflow import get_shots_data

shot_data = get_shots_data(
    tokamak="cmod",
    shotlist_setting=[1050420019],
    output_setting="./test.nc",
)

saves a file "test.nc” on the path expecified in output_setting but the json and log files are saved in the user's /tmp/. That's the problem @yumouwei had here #522 (review).

Second, I also notice that if you rerun last code, you have the expected error

output_setting.py", line 282, in to_disk
    raise FileExistsError(f"File already exists! {self.path}")

and no .nc file of course. However, this type of error is not registered in the log file.

Finally, I have a suggestion. I tried different time settings, shot_lists, and output settings… It would be nice that any of these options in RetrievalSettings() or get_shots_data() are saved in the output.json, so even if the code is lost by the user, another user in the future can reproduce the exact data retrieval.

@gtrevisan
Copy link
Member Author

gtrevisan commented Feb 17, 2026

at present we automatically generate a unique temporary work folder where everything gets dumped by default:

  • output.log (old)
  • output.nc (old)
  • config.json (new)

if one chooses to save the netcdf elsewhere, obviously the result is less consistent, eg log file and config json still get generated in the temporary folder.

we should first design how we would like the framework to behave in all these edge cases, and then implement it with a (not crazy, but consistent) overhaul, which I'd say at the moment is not a priority.

for context: back in the day I implemented the unique temporary folder choice in order to have a place for all test executions to drop files and log files and to be able to debug them better, because previously nothing was saved.

it does not work if I specify the file path [...]; it will only generate the dataset file but not the config file in the specified folder.

@yumouwei you should enable debug statements to see where the config json file is stored:

logger.debug("Dumped configuration into: {path}", path=json_file_path)

EDIT: I've upgraded it to verbose, now.

Additionally, [...] set the name of the config file [...] so that a user can distinguish them in case there are multiple dataset files stored in the same folder.

@yumouwei what if the nc file does not exist, but the config file does? would you rename it at will, overwrite, or crash with an error?
let's tackle this in a separate PR if we deem it high enough in our priority list.

I found that the json and log file are always saved in /tmp folder. Is that the expected behavior?

@zapatace for now, yes, as I doubt people want to specify a folder for each of their runs.
you already can specify your specific log file to be created wherever you want, but I doubt people do that.
let's tackle this in a separate PR if we deem it high enough in our priority list.

if you rerun last code, you have the expected error [...] FileExistsError(f"File already exists! {self.path}")

@zapatace yes, I will not overwrite any data.
I'd suggest people to leave the defaults and then move/copy/rename/hardlink their desired nc file.

However, this type of error is not registered in the log file.

@zapatace correct, I believe only workflow-specific errors get logged, but stuff like this isn't.
let's tackle this in a separate PR if we deem it high enough in our priority list.

Finally, I have a suggestion. I tried different time settings, shot_lists, and output settings… It would be nice that any of these options in RetrievalSettings() or get_shots_data() are saved in the output.json, so even if the code is lost by the user, another user in the future can reproduce the exact data retrieval.

@zapatace yes, that's the plan:

  1. save all the configs (ie: this PR),
  2. move all settings to configs (cf: Evaluate proper design for configurations and settings #445),
  3. rejoice from absolute reproducibility.

arguably the most important settings are currently already configurations, that is:

  • efit nickname setting for C-MOD,
  • code rundb tag for DIII-D,

final thoughts?

@gtrevisan gtrevisan requested a review from yumouwei February 17, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants