Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances reproducibility by dumping the configuration as JSON next to the output files. The changes filter out sensitive data (passwords), remove test-specific configuration when not running tests, and exclude the attributes structure that contains TOML-specific values (nan/inf) not compatible with JSON.
Changes:
- Added
filter_dictutility function to recursively filter dictionary keys containing a specified substring - Integrated configuration dumping in the main workflow to save sanitized config as JSON alongside outputs
- Added debug logging for the config dump location
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| disruption_py/workflow.py | Added imports and logic to dump sanitized configuration as JSON to temporary folder, filtering passwords and conditionally removing test data |
| disruption_py/core/utils/misc.py | Added filter_dict utility function to recursively remove dictionary keys containing a specified substring |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yumouwei
left a comment
There was a problem hiding this comment.
I'm able to generate the config.json file with output_setting='dataset' or 'dataframe'. However, currently it does not work if I specify the file path (e.g. output_setting="<path_to_file>/temp.h5" or other format); it will only generate the dataset file but not the config file in the specified folder.
Additionally, it would also be idea to set the name of the config file to something like <dataset_file_name>.json or <dataset_file_name>_config.json so that a user can distinguish them in case there are multiple dataset files stored in the same folder.
|
I tested different things... First, I used to save files using output_setting and path. And I found that the json and log file are always saved in /tmp folder. Is that the expected behavior? saves a file "test.nc” on the path expecified in output_setting but the json and log files are saved in the user's /tmp/. That's the problem @yumouwei had here #522 (review). Second, I also notice that if you rerun last code, you have the expected error and no .nc file of course. However, this type of error is not registered in the log file. Finally, I have a suggestion. I tried different time settings, shot_lists, and output settings… It would be nice that any of these options in RetrievalSettings() or get_shots_data() are saved in the output.json, so even if the code is lost by the user, another user in the future can reproduce the exact data retrieval. |
|
at present we automatically generate a unique temporary work folder where everything gets dumped by default:
if one chooses to save the netcdf elsewhere, obviously the result is less consistent, eg log file and config json still get generated in the temporary folder. we should first design how we would like the framework to behave in all these edge cases, and then implement it with a (not crazy, but consistent) overhaul, which I'd say at the moment is not a priority. for context: back in the day I implemented the unique temporary folder choice in order to have a place for all test executions to drop files and log files and to be able to debug them better, because previously nothing was saved.
@yumouwei you should enable debug statements to see where the config json file is stored: disruption-py/disruption_py/workflow.py Line 132 in a3bb455 EDIT: I've upgraded it to verbose, now.
@yumouwei what if the nc file does not exist, but the config file does? would you rename it at will, overwrite, or crash with an error?
@zapatace for now, yes, as I doubt people want to specify a folder for each of their runs.
@zapatace yes, I will not overwrite any data.
@zapatace correct, I believe only workflow-specific errors get logged, but stuff like this isn't.
@zapatace yes, that's the plan:
arguably the most important settings are currently already configurations, that is:
final thoughts? |
this should help for reproducibility:
testsstructure (if not testing),attributesstructure (asnanandinfare supported by TOML but not by JSON, and they should be stored as metadata anyways),in the future we might want to add this back in as xarray metadata, rather than a configuration dump.