Heterogeneous graph transformers by allaffa · Pull Request #451 · ORNL/HydraGNN

allaffa · 2026-01-31T15:42:31Z

No description provided.

…writer - HDF5Writer: add streaming mode (begin/put/end_label) for memory-efficient writes without holding all samples in RAM; fix h5py vlen broadcast error by writing elements individually instead of batch slice assignment. - train_opf_solution_heterogeneous: add _raw_json_to_heterodata() and _iter_raw_split_for_rank() to bypass InMemoryDataset entirely during preonly pipeline; each rank reads only its assigned raw JSON files one at a time, eliminating 8x full-dataset copies per node. - Preonly+HDF5 for pglib_opf_case14_ieee (20 groups, 300K samples) completes in ~2min on 2 nodes (previously OOM).

…cript

…multi-case preonly; update preonly job to 16 nodes with all cases

… groups When --preonly with multiple cases, ranks are split into groups where each group independently downloads, stages, and processes a different case concurrently. This eliminates the sequential bottleneck where all ranks had to finish one case before moving to the next. - HDF5Writer.begin() no longer uses a barrier (uses makedirs exist_ok instead), enabling ranks in different groups to call it at different times without deadlock. - Sub-communicators are used for intra-group coordination (barriers, broadcasts, work splitting). - COMM_WORLD is restored for the collective metadata save at the end.

… counts)

- Fix _initialized flag missing in HeteroRGATStack and HeteroGATStack _init_conv(), which caused DDP CPU buffer sync crash (modules recreated on CPU after .to(device)) - Fix HeteroRGATStack edge_dim dict passed directly to GATConv, now resolves per-type - Add HeteroHEAT config (opf_solution_heterogeneous_heat.json) and job script - Add HeteroHEAT model creation support in create.py - Update OPF_Heterogeneous_Graph_Explanation.docx with algorithmic novelties (Sections 9-13)

- HeteroHEATStack: move edge_lin_dict (ModuleDict) assignment after super().__init__() to fix 'cannot assign module before Module.__init__() call' - HeteroHEATStack: replace in_channels=-1 (lazy init) with in_channels=self.hidden_dim to fix DDP rejection of uninitialized parameters - train_opf_solution_heterogeneous.py: add --mpnn_type, --hidden_dim, --num_conv_layers, --learning_rate, --log CLI args with config overrides for HPO - Add DeepHyper HPO script (opf_deephyper_hpo.py) and SLURM job script

Same fix as HeteroRGAT (bug 4) - use _resolve_edge_dim_for_type(edge_type) instead of passing raw edge_dim dict to GATv2Conv.

…_hpo.py

…est_model_loadpred args

- opf_solution_utils.py: add OPFDomainLoss and OPFEnhancedModelWrapper - Bus-level penalties: ac_line/transformer smoothness, voltage magnitude bound (Vm), angle difference limit, DC thermal limit (branch flow) - Per-term EMA normalization to keep penalty scales commensurate with task loss regardless of raw physical magnitudes - Wrapper pattern analogous to interatomic-potential enhancement - train_opf_solution_heterogeneous.py: wire domain loss via CLI - --enable/disable_domain_loss, per-weight and per-index overrides - --domain_loss_ema_momentum override - Rank-0 logging when domain loss is active - opf_solution_heterogeneous.json: baseline config update - compare_heat_runs.py: A/B log comparison tool - Reports first/best/last epoch for each run, total epoch count - --json flag for machine-readable output - job-frontier-train-heat-compare.sh: sequential A/B Frontier job - job-frontier-train-heat-physics.sh: physics-only Frontier job - job-frontier-deephyper-hpo-1024.sh: large-scale HPO job script - job-frontier-deephyper-hpo-pna.sh: PNA-constrained HPO job script - plot_hpo_results.py, plot_hpo_combined.py: HPO curve plotting - hpo_validation_loss_curves.png: HPO trial summary figure

- Add per-term EMA normalization so each penalty stays ~1.0 regardless of raw physical magnitude (prevents angle-diff/flow terms dominating) - Add curriculum warmup+ramp scheduling: domain loss is zero for the first warmup_epochs so the model converges on task loss first, then linearly ramped over ramp_epochs to full weight - Fix voltage_output_index default 0->1 (Vm, not Va) - Add --domain_loss_warmup_epochs / --domain_loss_ramp_epochs CLI args - Update batch script: DOMAIN_WARMUP_EPOCHS=3, DOMAIN_RAMP_EPOCHS=3 - Job 4450208 submitted with curriculum schedule

Smoothness (L2 difference of adjacent bus values) is non-zero on the correct OPF solution — adjacent buses naturally have different voltages and angles. It was forcing the model toward a flat profile, causing sharp loss degradation when the curriculum ramped it in at epoch 3. Remaining 3 penalties are all feasibility constraints that are identically zero on any strictly feasible OPF solution: - voltage_bound : vmin <= Vm <= vmax - angle_diff : theta_min <= delta_Va <= theta_max per line - line_flow : |P_ij| = |delta_Va / x_ij| <= rate_a Job 4451003 resubmitted with corrected domain loss.

…comments - Per-epoch LossBreakdown line in run.log (rank 0 only) showing: data_driven_mse, physics_penalty_total, curriculum_scale, and raw_* values for each feasibility constraint term. Enables direct comparison of task loss vs domain loss contribution to diagnose whether physics penalties interfere with data-driven learning. Final epoch flushed explicitly after train_validate_test returns. - line_flow_slack (default 1e-4): tolerance in DC thermal-limit penalty absorbing the ~1.3e-5 residual from DC linearisation error on AC-feasible OPF solutions (verified empirically on pglib_opf_case10000_goc). - Strategic inline comments explaining purpose of all tolerances and numerical guards (clamp floors, relu-squared form, EMA floor at 1e-8). - check_domain_loss_on_gt.py: diagnostic script confirming all penalty terms are zero on ground-truth solutions.

…ltage_bound_weight: 0.01 -> 0.001 - angle_diff_weight: 0.001 -> 0.0001 - line_flow_weight: 0.001 -> 0.0001 Total domain contribution ~0.001 vs task loss ~0.009 (~11%) rather than ~100% as before. - Switch LossBreakdown from logging.info to print() so it always appears in run.log regardless of HydraGNN logging level.

- Add binary_cross_entropy loss to loss_function_selection() - FT1 feasibility classification: configs, training script, job scripts, infeasible-sample generator (graph-level BCE, case118 → case14 agnostic) - FT3 N-1 contingency regression: configs, training script, job scripts - Shared ft_utils.py: EpochCSVWriter, evaluate_ft1/ft3, save_run_results - --no_pretrained flag on both train scripts for baseline comparison - collect_results.py: aggregate results.json + training_curve.csv files - plot_ft_results.py: bar charts, ROC curves, learning curves, scatter plots - submit_ft1_ft3_jobs.sh: Frontier Slurm orchestration with job dependencies - Pretrained HeteroSAGE_best and HeteroHEAT_best checkpoints

…rain_samples - All 30 FT configs (FT1-FT4, both archs, all regimes): use model-agnostic dataset names (e.g. FT3_contingency_data instead of FT3_contingency_HeteroSAGE_data) - train_opf_ft1_classify.py, train_opf_finetune.py: add --max_train_samples to truncate training split for data-efficiency sweeps - job-frontier-finetune.sh, job-frontier-FT1-*.sh: pass MAX_TRAIN_SAMPLES and NO_PRETRAINED through to training; DATA_MODELNAME now model-agnostic; LOG_NAME encodes arch/regime/scratch/n-samples (no SLURM_JOB_ID) - generate_infeasible_samples.py: replace full-load with reservoir sampling to avoid OOM when generating FT1 feasibility dataset - New scripts: preprocess_ft1_ft3_loginnode.sh, submit_ft1_ft3_debug.sh, submit_ft1_data_efficiency_sweep.sh, submit_ft3_data_efficiency_sweep.sh (72-job FT1 sweep and 80-job FT3 sweep: 4 methods x 2 archs x sample sizes)

…et generation - generate_infeasible_samples.py: move torch.distributed import to module level; fix dist.init_process_group call formatting; add gloo process group init before HDF5Writer.save() to satisfy iterate_tqdm's dist.get_rank() call - train_opf_ft1_classify.py, train_opf_finetune.py: fix line length and blank-line formatting around --max_train_samples Subset truncation block

…obs total

- train_opf_ft1_classify.py / train_opf_finetune.py: call setup_log() before save_config() so ./logs/{LOG_NAME}/ exists. - job-frontier-FT1-single-method.sh / -FT3-: 1-node debug-QOS jobs; unset AWS-OFI NCCL plugin and force NCCL_P2P_LEVEL=SYS for N<=1 (single-node intra-host xGMI/SHM only). - FT1 LOG_NAME now keyed on requested N (not TOTAL=2N) to avoid cross-N directory collisions. - watch_and_chain_per_n.sh: serial submitter (1 in-flight per debug QOS limit); skips cells whose results.json already exists; persists log on Lustre. - plot_sample_efficiency.py: aggregates results.json across runs into per-(task, arch, metric) line plots and summary.csv.

allaffa self-assigned this Jan 31, 2026

allaffa added the enhancement New feature or request label Jan 31, 2026

allaffa marked this pull request as draft January 31, 2026 15:44

allaffa force-pushed the heterogeneous_graph_transformers branch 4 times, most recently from 4b4ed75 to 5291773 Compare March 29, 2026 01:31

allaffa marked this pull request as ready for review March 29, 2026 01:32

allaffa added 22 commits May 4, 2026 20:32

heterogeneous GNN and OPF example added

68ffa07

black formatting fixed

af5f07a

fix on rank printouts

3c5d765

CI test for heterogeneous GNNs restructures

9030d33

test thresholds adjusted

20f5465

improved handlign of data download

a365138

improved tar extraction o fdata

962941a

remove barrier

e0c9fa5

cope with errors in reading JSON files

6a16bd4

progression bar added to show decompression of tiles

320e52d

added max_neighbors to JSON file

de2bf68

output added to JSON file

c392e80

corrections applied also to homogeneous case

e7a7d55

upgraded data management for transfer to device

e94b171

upgraded data management for transfer to device

922f1e6

HeteroBase managed to improve data movement to devices

6473494

HeteroBase managed to improve data movement to devices

21dbd86

HeteroBase managed to improve data movement to devices

da92b46

HeteroBase managed to improve data movement to devices

1612814

HeteroBase managed to improve data movement to devices

ea3e620

opf solver with nodal predictions added

4f6b52c

opf solver with nodal predictions added

db6f627

Massimiliano Lupo Pasini and others added 23 commits May 4, 2026 20:33

Add --hdf5 support to OPF inference script and fix modelname in job s…

7a3191a

…cript

Fix HDF5 streaming: resume appending when dataset already exists for …

668d6ca

…multi-case preonly; update preonly job to 16 nodes with all cases

Add corrupt JSON tolerance, update job scripts (wall time, nvme, node…

127f258

… counts)

heterogeneous transformers added

358dc2c

Fix black formatting in train_opf_solution_heterogeneous.py

7ce2b73

Fix black formatting in hdf5dataset.py

401a5f0

Fix HeteroGATStack edge_dim dict bug: resolve per edge type

2e62c75

Same fix as HeteroRGAT (bug 4) - use _resolve_edge_dim_for_type(edge_type) instead of passing raw edge_dim dict to GATv2Conv.

Fix HPO val loss extraction: use nums[1] to skip rank prefix

647be49

HPO: use min val loss, handle walltime kills, 128 nodes, 6h, 10 epochs

2c44639

Add CLI arguments for configurable HPO search ranges in opf_deephyper…

1adc2fc

…_hpo.py

Remove OPF_Heterogeneous_Graph_Explanation.docx from tracking

8af5e3e

Make h5py import optional in datasets __init__ to fix CI without h5py

4aacd1e

Fix black formatting in datasets __init__

76041e8

Fix CI: restore edge_dim backward compat, fix HeteroGAT errors, fix t…

2cab47b

…est_model_loadpred args

allaffa force-pushed the heterogeneous_graph_transformers branch from 7a12cb7 to 5ccd072 Compare May 5, 2026 00:33

allaffa and others added 6 commits May 6, 2026 10:32

Update .gitignore, OPF job scripts, and HPO warm-start support

352ed8e

OPF finetune: pack 4 methods per job (multi-srun), FT1+FT3 only, 38 j…

de77861

…obs total

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heterogeneous graph transformers#451

Heterogeneous graph transformers#451
allaffa wants to merge 102 commits into
ORNL:mainfrom
allaffa:heterogeneous_graph_transformers

allaffa commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

allaffa commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant