Heterogeneous graph transformers#451
Open
allaffa wants to merge 102 commits into
Open
Conversation
4b4ed75 to
5291773
Compare
…writer - HDF5Writer: add streaming mode (begin/put/end_label) for memory-efficient writes without holding all samples in RAM; fix h5py vlen broadcast error by writing elements individually instead of batch slice assignment. - train_opf_solution_heterogeneous: add _raw_json_to_heterodata() and _iter_raw_split_for_rank() to bypass InMemoryDataset entirely during preonly pipeline; each rank reads only its assigned raw JSON files one at a time, eliminating 8x full-dataset copies per node. - Preonly+HDF5 for pglib_opf_case14_ieee (20 groups, 300K samples) completes in ~2min on 2 nodes (previously OOM).
…multi-case preonly; update preonly job to 16 nodes with all cases
… groups When --preonly with multiple cases, ranks are split into groups where each group independently downloads, stages, and processes a different case concurrently. This eliminates the sequential bottleneck where all ranks had to finish one case before moving to the next. - HDF5Writer.begin() no longer uses a barrier (uses makedirs exist_ok instead), enabling ranks in different groups to call it at different times without deadlock. - Sub-communicators are used for intra-group coordination (barriers, broadcasts, work splitting). - COMM_WORLD is restored for the collective metadata save at the end.
- Fix _initialized flag missing in HeteroRGATStack and HeteroGATStack _init_conv(), which caused DDP CPU buffer sync crash (modules recreated on CPU after .to(device)) - Fix HeteroRGATStack edge_dim dict passed directly to GATConv, now resolves per-type - Add HeteroHEAT config (opf_solution_heterogeneous_heat.json) and job script - Add HeteroHEAT model creation support in create.py - Update OPF_Heterogeneous_Graph_Explanation.docx with algorithmic novelties (Sections 9-13)
- HeteroHEATStack: move edge_lin_dict (ModuleDict) assignment after super().__init__() to fix 'cannot assign module before Module.__init__() call' - HeteroHEATStack: replace in_channels=-1 (lazy init) with in_channels=self.hidden_dim to fix DDP rejection of uninitialized parameters - train_opf_solution_heterogeneous.py: add --mpnn_type, --hidden_dim, --num_conv_layers, --learning_rate, --log CLI args with config overrides for HPO - Add DeepHyper HPO script (opf_deephyper_hpo.py) and SLURM job script
Same fix as HeteroRGAT (bug 4) - use _resolve_edge_dim_for_type(edge_type) instead of passing raw edge_dim dict to GATv2Conv.
…est_model_loadpred args
- opf_solution_utils.py: add OPFDomainLoss and OPFEnhancedModelWrapper
- Bus-level penalties: ac_line/transformer smoothness, voltage magnitude
bound (Vm), angle difference limit, DC thermal limit (branch flow)
- Per-term EMA normalization to keep penalty scales commensurate with
task loss regardless of raw physical magnitudes
- Wrapper pattern analogous to interatomic-potential enhancement
- train_opf_solution_heterogeneous.py: wire domain loss via CLI
- --enable/disable_domain_loss, per-weight and per-index overrides
- --domain_loss_ema_momentum override
- Rank-0 logging when domain loss is active
- opf_solution_heterogeneous.json: baseline config update
- compare_heat_runs.py: A/B log comparison tool
- Reports first/best/last epoch for each run, total epoch count
- --json flag for machine-readable output
- job-frontier-train-heat-compare.sh: sequential A/B Frontier job
- job-frontier-train-heat-physics.sh: physics-only Frontier job
- job-frontier-deephyper-hpo-1024.sh: large-scale HPO job script
- job-frontier-deephyper-hpo-pna.sh: PNA-constrained HPO job script
- plot_hpo_results.py, plot_hpo_combined.py: HPO curve plotting
- hpo_validation_loss_curves.png: HPO trial summary figure
- Add per-term EMA normalization so each penalty stays ~1.0 regardless of raw physical magnitude (prevents angle-diff/flow terms dominating) - Add curriculum warmup+ramp scheduling: domain loss is zero for the first warmup_epochs so the model converges on task loss first, then linearly ramped over ramp_epochs to full weight - Fix voltage_output_index default 0->1 (Vm, not Va) - Add --domain_loss_warmup_epochs / --domain_loss_ramp_epochs CLI args - Update batch script: DOMAIN_WARMUP_EPOCHS=3, DOMAIN_RAMP_EPOCHS=3 - Job 4450208 submitted with curriculum schedule
Smoothness (L2 difference of adjacent bus values) is non-zero on the correct OPF solution — adjacent buses naturally have different voltages and angles. It was forcing the model toward a flat profile, causing sharp loss degradation when the curriculum ramped it in at epoch 3. Remaining 3 penalties are all feasibility constraints that are identically zero on any strictly feasible OPF solution: - voltage_bound : vmin <= Vm <= vmax - angle_diff : theta_min <= delta_Va <= theta_max per line - line_flow : |P_ij| = |delta_Va / x_ij| <= rate_a Job 4451003 resubmitted with corrected domain loss.
…comments
- Per-epoch LossBreakdown line in run.log (rank 0 only) showing:
data_driven_mse, physics_penalty_total, curriculum_scale,
and raw_* values for each feasibility constraint term.
Enables direct comparison of task loss vs domain loss contribution
to diagnose whether physics penalties interfere with data-driven learning.
Final epoch flushed explicitly after train_validate_test returns.
- line_flow_slack (default 1e-4): tolerance in DC thermal-limit penalty
absorbing the ~1.3e-5 residual from DC linearisation error on
AC-feasible OPF solutions (verified empirically on pglib_opf_case10000_goc).
- Strategic inline comments explaining purpose of all tolerances and
numerical guards (clamp floors, relu-squared form, EMA floor at 1e-8).
- check_domain_loss_on_gt.py: diagnostic script confirming all penalty
terms are zero on ground-truth solutions.
…ltage_bound_weight: 0.01 -> 0.001 - angle_diff_weight: 0.001 -> 0.0001 - line_flow_weight: 0.001 -> 0.0001 Total domain contribution ~0.001 vs task loss ~0.009 (~11%) rather than ~100% as before. - Switch LossBreakdown from logging.info to print() so it always appears in run.log regardless of HydraGNN logging level.
7a12cb7 to
5ccd072
Compare
- Add binary_cross_entropy loss to loss_function_selection() - FT1 feasibility classification: configs, training script, job scripts, infeasible-sample generator (graph-level BCE, case118 → case14 agnostic) - FT3 N-1 contingency regression: configs, training script, job scripts - Shared ft_utils.py: EpochCSVWriter, evaluate_ft1/ft3, save_run_results - --no_pretrained flag on both train scripts for baseline comparison - collect_results.py: aggregate results.json + training_curve.csv files - plot_ft_results.py: bar charts, ROC curves, learning curves, scatter plots - submit_ft1_ft3_jobs.sh: Frontier Slurm orchestration with job dependencies - Pretrained HeteroSAGE_best and HeteroHEAT_best checkpoints
…rain_samples - All 30 FT configs (FT1-FT4, both archs, all regimes): use model-agnostic dataset names (e.g. FT3_contingency_data instead of FT3_contingency_HeteroSAGE_data) - train_opf_ft1_classify.py, train_opf_finetune.py: add --max_train_samples to truncate training split for data-efficiency sweeps - job-frontier-finetune.sh, job-frontier-FT1-*.sh: pass MAX_TRAIN_SAMPLES and NO_PRETRAINED through to training; DATA_MODELNAME now model-agnostic; LOG_NAME encodes arch/regime/scratch/n-samples (no SLURM_JOB_ID) - generate_infeasible_samples.py: replace full-load with reservoir sampling to avoid OOM when generating FT1 feasibility dataset - New scripts: preprocess_ft1_ft3_loginnode.sh, submit_ft1_ft3_debug.sh, submit_ft1_data_efficiency_sweep.sh, submit_ft3_data_efficiency_sweep.sh (72-job FT1 sweep and 80-job FT3 sweep: 4 methods x 2 archs x sample sizes)
…et generation - generate_infeasible_samples.py: move torch.distributed import to module level; fix dist.init_process_group call formatting; add gloo process group init before HDF5Writer.save() to satisfy iterate_tqdm's dist.get_rank() call - train_opf_ft1_classify.py, train_opf_finetune.py: fix line length and blank-line formatting around --max_train_samples Subset truncation block
- train_opf_ft1_classify.py / train_opf_finetune.py: call setup_log()
before save_config() so ./logs/{LOG_NAME}/ exists.
- job-frontier-FT1-single-method.sh / -FT3-: 1-node debug-QOS jobs;
unset AWS-OFI NCCL plugin and force NCCL_P2P_LEVEL=SYS for N<=1
(single-node intra-host xGMI/SHM only).
- FT1 LOG_NAME now keyed on requested N (not TOTAL=2N) to avoid
cross-N directory collisions.
- watch_and_chain_per_n.sh: serial submitter (1 in-flight per debug
QOS limit); skips cells whose results.json already exists; persists
log on Lustre.
- plot_sample_efficiency.py: aggregates results.json across runs into
per-(task, arch, metric) line plots and summary.csv.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.