Skip to content

Add section on recovering disconnected SLURM interactive sessions#322

Open
d-morrison wants to merge 2 commits into
mainfrom
claude/intelligent-thompson-gdkdcb
Open

Add section on recovering disconnected SLURM interactive sessions#322
d-morrison wants to merge 2 commits into
mainfrom
claude/intelligent-thompson-gdkdcb

Conversation

@d-morrison

Copy link
Copy Markdown
Member

Summary

Adds a new "Recovering a disconnected interactive session" section to the HPC/SLURM chapter (slurm.qmd), documenting what to do when an interactive SLURM session drops.

This is a corrected version of advice that circulates online (e.g. AI-generated answers suggesting you can always "reenter" a dropped allocation). The key correction: for a plain interactive srun --pty / salloc session, losing the terminal usually tears down the whole allocation, not just the shell — so there is often nothing left to reattach to. The section therefore leads with prevention and is explicit about when recovery is actually possible.

What's covered

  • Why sessions drop — the launching process receives a hang-up signal and SLURM releases the allocation.
  • Prevention (recommended) — run tmux/screen on the login node, then launch the allocation from inside it so it survives an SSH drop; reattach with tmux attach.
  • Reconnecting to a surviving allocation — find the job ID with squeue --me, then open a new job step with srun --jobid=<JOBID> --overlap --pty bash.
  • The --overlap caveat — flagged in a callout-important: required since SLURM 20.11, otherwise srun hangs waiting for resources the running step holds.
  • sattach as the alternative for reattaching to an existing step's I/O, plus a callout-note that exact behavior is cluster-specific.

References

Added @online bib entries to book.bib for the srun, salloc, squeue, and sattach SchedMD documentation pages, cited inline in the new section (matching the existing @slurm sbatch reference).

🤖 Generated with Claude Code


Generated by Claude Code

Document why interactive sessions drop, how to prevent it with a
terminal multiplexer, and how to reconnect to a surviving allocation
(including the --overlap flag required on SLURM 20.11+). Add references
to the srun, salloc, squeue, and sattach documentation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0177QtQQ7Ebz4Yg5GAVH7QJp
Copilot AI review requested due to automatic review settings June 18, 2026 23:28
@github-actions github-actions Bot removed the request for review from Copilot June 18, 2026 23:28
Replace literal em-dashes with pandoc `---` to satisfy the check-chars
workflow, and add "multiplexer" and "reattach" to the spellcheck wordlist.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0177QtQQ7Ebz4Yg5GAVH7QJp
Copilot AI review requested due to automatic review settings June 18, 2026 23:30
@github-actions github-actions Bot removed the request for review from Copilot June 18, 2026 23:31
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Claude encountered an error —— View job


I'll analyze this and get back to you.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1-2-g6ad689f

QR code for preview link

🚀 View preview at
https://UCD-SERG.github.io/lab-manual/pr-preview/pr-322/

Built to branch gh-pages at 2026-06-18 23:35 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants