Skip to content

Fix pod delete path to reliably cancel HTCondor jobs (local and remote schedd)#26

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/jobs-keep-running-although-pod-deleted
Draft

Fix pod delete path to reliably cancel HTCondor jobs (local and remote schedd)#26
Copilot wants to merge 3 commits into
mainfrom
copilot/jobs-keep-running-although-pod-deleted

Conversation

Copilot AI commented May 28, 2026

Copy link
Copy Markdown
Contributor

Pods deleted through /delete were sometimes left running in HTCondor because cancellation used a weak command path and did not consistently target remote schedds. This change tightens delete-time job removal so pod deletion maps to a real HTCondor cancellation outcome.

  • Delete cancellation path hardened

    • Replaced os.popen-based condor_rm invocation with subprocess.run(...).
    • Added explicit non-zero exit handling so failed cancellations surface as errors instead of looking successful.
    • Added timeout protection to avoid hanging delete requests.
  • JID parsing/validation corrected

    • Accepts .jid values in cluster and cluster.proc form.
    • Validates JID format before cancellation and extracts the cluster ID deterministically.
    • Returns a deletion error for malformed JID content instead of issuing ambiguous removal commands.
  • Remote HTCondor targeting fixed

    • When configured, /delete now cancels against the intended remote scheduler using:
      • condor_rm -pool <collector> -name <schedd> <cluster_id>
    • Falls back to local condor_rm <cluster_id> when remote endpoints are not configured.
  • Focused API-level coverage added

    • Added /delete tests for:
      • remote -pool/-name command construction
      • cluster.proc JID parsing
      • condor_rm failure propagation
      • malformed JID rejection
if collector and schedd:
    cmd = ["condor_rm", "-pool", collector, "-name", schedd, cluster_id]
else:
    cmd = ["condor_rm", cluster_id]

result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if result.returncode != 0:
    raise RuntimeError(f"condor_rm failed (exit {result.returncode}): {result.stderr.strip()}")

Copilot AI linked an issue May 28, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Fix htcondor job cancellation issue during tests at KIT Fix pod delete path to reliably cancel HTCondor jobs (local and remote schedd) May 28, 2026
Copilot AI requested a review from dciangot May 28, 2026 02:51

@dciangot dciangot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Jobs keep running although the pod is deleted

2 participants