Skip to content

Heterogeneous graph transformers#451

Open
allaffa wants to merge 102 commits into
ORNL:mainfrom
allaffa:heterogeneous_graph_transformers
Open

Heterogeneous graph transformers#451
allaffa wants to merge 102 commits into
ORNL:mainfrom
allaffa:heterogeneous_graph_transformers

Conversation

@allaffa
Copy link
Copy Markdown
Collaborator

@allaffa allaffa commented Jan 31, 2026

No description provided.

@allaffa allaffa self-assigned this Jan 31, 2026
@allaffa allaffa added the enhancement New feature or request label Jan 31, 2026
@allaffa allaffa marked this pull request as draft January 31, 2026 15:44
@allaffa allaffa force-pushed the heterogeneous_graph_transformers branch 4 times, most recently from 4b4ed75 to 5291773 Compare March 29, 2026 01:31
@allaffa allaffa marked this pull request as ready for review March 29, 2026 01:32
Massimiliano Lupo Pasini and others added 23 commits May 4, 2026 20:33
…writer

- HDF5Writer: add streaming mode (begin/put/end_label) for memory-efficient
  writes without holding all samples in RAM; fix h5py vlen broadcast error
  by writing elements individually instead of batch slice assignment.
- train_opf_solution_heterogeneous: add _raw_json_to_heterodata() and
  _iter_raw_split_for_rank() to bypass InMemoryDataset entirely during
  preonly pipeline; each rank reads only its assigned raw JSON files one
  at a time, eliminating 8x full-dataset copies per node.
- Preonly+HDF5 for pglib_opf_case14_ieee (20 groups, 300K samples)
  completes in ~2min on 2 nodes (previously OOM).
…multi-case preonly; update preonly job to 16 nodes with all cases
… groups

When --preonly with multiple cases, ranks are split into groups where
each group independently downloads, stages, and processes a different
case concurrently. This eliminates the sequential bottleneck where all
ranks had to finish one case before moving to the next.

- HDF5Writer.begin() no longer uses a barrier (uses makedirs exist_ok
  instead), enabling ranks in different groups to call it at different
  times without deadlock.
- Sub-communicators are used for intra-group coordination (barriers,
  broadcasts, work splitting).
- COMM_WORLD is restored for the collective metadata save at the end.
- Fix _initialized flag missing in HeteroRGATStack and HeteroGATStack _init_conv(),
  which caused DDP CPU buffer sync crash (modules recreated on CPU after .to(device))
- Fix HeteroRGATStack edge_dim dict passed directly to GATConv, now resolves per-type
- Add HeteroHEAT config (opf_solution_heterogeneous_heat.json) and job script
- Add HeteroHEAT model creation support in create.py
- Update OPF_Heterogeneous_Graph_Explanation.docx with algorithmic novelties (Sections 9-13)
- HeteroHEATStack: move edge_lin_dict (ModuleDict) assignment after super().__init__()
  to fix 'cannot assign module before Module.__init__() call'
- HeteroHEATStack: replace in_channels=-1 (lazy init) with in_channels=self.hidden_dim
  to fix DDP rejection of uninitialized parameters
- train_opf_solution_heterogeneous.py: add --mpnn_type, --hidden_dim, --num_conv_layers,
  --learning_rate, --log CLI args with config overrides for HPO
- Add DeepHyper HPO script (opf_deephyper_hpo.py) and SLURM job script
Same fix as HeteroRGAT (bug 4) - use _resolve_edge_dim_for_type(edge_type)
instead of passing raw edge_dim dict to GATv2Conv.
- opf_solution_utils.py: add OPFDomainLoss and OPFEnhancedModelWrapper
  - Bus-level penalties: ac_line/transformer smoothness, voltage magnitude
    bound (Vm), angle difference limit, DC thermal limit (branch flow)
  - Per-term EMA normalization to keep penalty scales commensurate with
    task loss regardless of raw physical magnitudes
  - Wrapper pattern analogous to interatomic-potential enhancement

- train_opf_solution_heterogeneous.py: wire domain loss via CLI
  - --enable/disable_domain_loss, per-weight and per-index overrides
  - --domain_loss_ema_momentum override
  - Rank-0 logging when domain loss is active

- opf_solution_heterogeneous.json: baseline config update

- compare_heat_runs.py: A/B log comparison tool
  - Reports first/best/last epoch for each run, total epoch count
  - --json flag for machine-readable output

- job-frontier-train-heat-compare.sh: sequential A/B Frontier job
- job-frontier-train-heat-physics.sh: physics-only Frontier job
- job-frontier-deephyper-hpo-1024.sh: large-scale HPO job script
- job-frontier-deephyper-hpo-pna.sh: PNA-constrained HPO job script
- plot_hpo_results.py, plot_hpo_combined.py: HPO curve plotting
- hpo_validation_loss_curves.png: HPO trial summary figure
- Add per-term EMA normalization so each penalty stays ~1.0 regardless
  of raw physical magnitude (prevents angle-diff/flow terms dominating)
- Add curriculum warmup+ramp scheduling: domain loss is zero for the
  first warmup_epochs so the model converges on task loss first, then
  linearly ramped over ramp_epochs to full weight
- Fix voltage_output_index default 0->1 (Vm, not Va)
- Add --domain_loss_warmup_epochs / --domain_loss_ramp_epochs CLI args
- Update batch script: DOMAIN_WARMUP_EPOCHS=3, DOMAIN_RAMP_EPOCHS=3
- Job 4450208 submitted with curriculum schedule
Smoothness (L2 difference of adjacent bus values) is non-zero on the
correct OPF solution — adjacent buses naturally have different voltages
and angles. It was forcing the model toward a flat profile, causing
sharp loss degradation when the curriculum ramped it in at epoch 3.

Remaining 3 penalties are all feasibility constraints that are
identically zero on any strictly feasible OPF solution:
  - voltage_bound  : vmin <= Vm <= vmax
  - angle_diff     : theta_min <= delta_Va <= theta_max per line
  - line_flow      : |P_ij| = |delta_Va / x_ij| <= rate_a

Job 4451003 resubmitted with corrected domain loss.
…comments

- Per-epoch LossBreakdown line in run.log (rank 0 only) showing:
    data_driven_mse, physics_penalty_total, curriculum_scale,
    and raw_* values for each feasibility constraint term.
  Enables direct comparison of task loss vs domain loss contribution
  to diagnose whether physics penalties interfere with data-driven learning.
  Final epoch flushed explicitly after train_validate_test returns.

- line_flow_slack (default 1e-4): tolerance in DC thermal-limit penalty
  absorbing the ~1.3e-5 residual from DC linearisation error on
  AC-feasible OPF solutions (verified empirically on pglib_opf_case10000_goc).

- Strategic inline comments explaining purpose of all tolerances and
  numerical guards (clamp floors, relu-squared form, EMA floor at 1e-8).

- check_domain_loss_on_gt.py: diagnostic script confirming all penalty
  terms are zero on ground-truth solutions.
…ltage_bound_weight: 0.01 -> 0.001 - angle_diff_weight: 0.001 -> 0.0001 - line_flow_weight: 0.001 -> 0.0001 Total domain contribution ~0.001 vs task loss ~0.009 (~11%) rather than ~100% as before. - Switch LossBreakdown from logging.info to print() so it always appears in run.log regardless of HydraGNN logging level.
@allaffa allaffa force-pushed the heterogeneous_graph_transformers branch from 7a12cb7 to 5ccd072 Compare May 5, 2026 00:33
allaffa and others added 6 commits May 6, 2026 10:32
- Add binary_cross_entropy loss to loss_function_selection()
- FT1 feasibility classification: configs, training script, job scripts,
  infeasible-sample generator (graph-level BCE, case118 → case14 agnostic)
- FT3 N-1 contingency regression: configs, training script, job scripts
- Shared ft_utils.py: EpochCSVWriter, evaluate_ft1/ft3, save_run_results
- --no_pretrained flag on both train scripts for baseline comparison
- collect_results.py: aggregate results.json + training_curve.csv files
- plot_ft_results.py: bar charts, ROC curves, learning curves, scatter plots
- submit_ft1_ft3_jobs.sh: Frontier Slurm orchestration with job dependencies
- Pretrained HeteroSAGE_best and HeteroHEAT_best checkpoints
…rain_samples

- All 30 FT configs (FT1-FT4, both archs, all regimes): use model-agnostic
  dataset names (e.g. FT3_contingency_data instead of FT3_contingency_HeteroSAGE_data)
- train_opf_ft1_classify.py, train_opf_finetune.py: add --max_train_samples
  to truncate training split for data-efficiency sweeps
- job-frontier-finetune.sh, job-frontier-FT1-*.sh: pass MAX_TRAIN_SAMPLES and
  NO_PRETRAINED through to training; DATA_MODELNAME now model-agnostic;
  LOG_NAME encodes arch/regime/scratch/n-samples (no SLURM_JOB_ID)
- generate_infeasible_samples.py: replace full-load with reservoir sampling
  to avoid OOM when generating FT1 feasibility dataset
- New scripts: preprocess_ft1_ft3_loginnode.sh, submit_ft1_ft3_debug.sh,
  submit_ft1_data_efficiency_sweep.sh, submit_ft3_data_efficiency_sweep.sh
  (72-job FT1 sweep and 80-job FT3 sweep: 4 methods x 2 archs x sample sizes)
…et generation

- generate_infeasible_samples.py: move torch.distributed import to module
  level; fix dist.init_process_group call formatting; add gloo process group
  init before HDF5Writer.save() to satisfy iterate_tqdm's dist.get_rank() call
- train_opf_ft1_classify.py, train_opf_finetune.py: fix line length and
  blank-line formatting around --max_train_samples Subset truncation block
- train_opf_ft1_classify.py / train_opf_finetune.py: call setup_log()
  before save_config() so ./logs/{LOG_NAME}/ exists.
- job-frontier-FT1-single-method.sh / -FT3-: 1-node debug-QOS jobs;
  unset AWS-OFI NCCL plugin and force NCCL_P2P_LEVEL=SYS for N<=1
  (single-node intra-host xGMI/SHM only).
- FT1 LOG_NAME now keyed on requested N (not TOTAL=2N) to avoid
  cross-N directory collisions.
- watch_and_chain_per_n.sh: serial submitter (1 in-flight per debug
  QOS limit); skips cells whose results.json already exists; persists
  log on Lustre.
- plot_sample_efficiency.py: aggregates results.json across runs into
  per-(task, arch, metric) line plots and summary.csv.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant