Skip to content

Eurekashen/Slurm2Slack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Slurm2Slack

SlackSlurm posts Slurm job lifecycle notifications to Slack via Incoming Webhooks. It is designed to run as a watcher on a login node, so compute nodes do not need outbound internet access.

Requirements

  • Python 3.8+
  • Slurm commands available in PATH: squeue, sacct, scontrol
  • Login node can reach https://hooks.slack.com

How to get Slack Incoming Webhooks

  1. Go to your Slack workspace and navigate to "Apps". https://api.slack.com/apps?new_app=1

  2. Click "Create New App" and choose "From scratch".

Create App
  1. Name your app and select the workspace.
Name App
  1. In the left sidebar, go to "Incoming Webhooks" and activate them.
Webhooks
  1. Click "Add New Webhook to Workspace", choose a channel (Your own channel, not a public one), and click "Allow".
Channel

Install

pip install --user .

Quick Start

  1. Initialize config (writes the default config file):
slackslurm init --webhook-url "https://hooks.slack.com/services/..."
  1. Add a marker to the sbatch script header:
#SLACKSLURM project=demo mention_on_fail=@here
  1. Run the watcher on a login node:
slackslurm watch

You can running in a tmux session or via nohup for long-term monitoring.

watch
  1. Optional: send a test message:
slackslurm test

How It Works

  • The watcher polls squeue for active jobs owned by the current user.

  • It only tracks jobs whose sbatch script header contains #SLACKSLURM.

  • When a job disappears from squeue, the watcher queries sacct for the final state and exit code. If sacct is unavailable, it falls back to scontrol show job (best effort).

  • Notifications are sent via Slack Incoming Webhooks using Block Kit payloads.

  • Job submitted:

    Submitted
  • Job finished successfully:

    Finished

Commands

  • slackslurm init --webhook-url URL
    • Creates ~/.config/slackslurm/config.json.
    • You can also set SLACKSLURM_WEBHOOK_URL instead of passing the flag.
  • slackslurm test
    • Sends a sample message to verify webhook connectivity.
  • slackslurm watch [--once] [--poll-interval SECONDS]
    • Runs the watcher loop. --once performs a single poll cycle.

Configuration

Default config path:

~/.config/slackslurm/config.json

You can override paths with:

  • SLACKSLURM_CONFIG (config file path)
  • SLACKSLURM_STATE (state file path)

Example config:

{
  "webhook_url": "https://hooks.slack.com/services/...",
  "poll_interval_seconds": 45,
  "notify_on": ["submit", "start", "end"],
  "script_marker": "#SLACKSLURM",
  "tail_log_lines_on_fail": 40,
  "max_log_chars": 1800,
  "mention_on_fail": "@here",
  "include_log_paths": true
}

Key fields:

  • webhook_url: Slack Incoming Webhook URL.
  • notify_on: choose any of submit, start, end.
  • script_marker: change the marker if you want a different tag.
  • tail_log_lines_on_fail: number of lines to include from stderr/stdout.
  • max_log_chars: hard cap for the log snippet size.
  • mention_on_fail: mention string added to failed jobs.
  • include_log_paths: include stdout/stderr file paths in the message.

Marking Jobs

Add the marker line in the header (before the first non-comment line):

#!/bin/bash
#SBATCH --job-name=train_x
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

#SLACKSLURM project=robot exp=run42 mention_on_fail=@here

python train.py

Supported tag behavior:

  • key=value pairs are shown in the Slack message context as tags.
  • mention_on_fail overrides the config value for this job only.

Running as a Daemon

For long-running sessions, use tmux or nohup:

tmux new -s slackslurm
slackslurm watch
nohup slackslurm watch > ~/.local/state/slackslurm/watch.log 2>&1 &

To stop, send SIGINT or kill the process.

State Files

State is stored at:

~/.local/state/slackslurm/state.json

If you need to reset notification history, stop the watcher and delete the state file.

Troubleshooting

  • No messages: ensure the sbatch script has the #SLACKSLURM marker and that the watcher runs on a login node with internet access.
  • Missing end notifications: confirm sacct works on your cluster.
  • Webhook errors: run slackslurm test and check Slack app configuration.
  • Log tails missing: ensure StdOut/StdErr paths exist and are readable.

Security Notes

  • Treat the webhook URL as a secret and do not commit it to the repo.
  • The config file is written with 0600 permissions when possible.

About

Post slurm server message to slack channel

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors