Skip to content

Artemis guest request not cancelled on SIGTERM, cleanup crashes with MetadataError #4834

@thrix-bot

Description

@thrix-bot

Summary

When tmt receives SIGTERM during Artemis provisioning (while polling get_new_state), the Artemis guest request is never cancelled via the API, resulting in an orphaned cloud resource. Additionally, the cleanup step crashes with a MetadataError instead of handling the situation gracefully.

Observed in: https://artifacts.osci.redhat.com/testing-farm/9821a7f6-f206-462a-a512-ab0efea3f12c/work-whqlfskl_d68/log.txt
tmt version: 1.71.0

What happened

  1. Artemis guest c7bbbc9f-0b8a-4d3b-ba11-234d7bcda709 was requested at 12:32:52
  2. get_new_state polling started (86400s timeout, 3s tick) — guest never reached ready state
  3. SIGTERM received at 14:45:10 (~2h12m into polling)
  4. tmt caught the interrupt, suspended steps, ran report, then attempted cleanup
  5. Cleanup crashed with: No guests queued for phase "default-0". A typo in "where" key?
  6. Artemis guest request was never cancelled — resource leak

Root cause

There are two issues:

1. GuestArtemis._create() does not cancel the request on interrupt

In tmt/steps/provision/artemis.py, the _create() method only calls self.remove() when WaitingTimedOutError is raised (line 636-639). When Interrupted is raised (due to SIGTERM), the exception propagates without cancelling the Artemis request:

try:
    guest_info = Waiting(
        Deadline.from_seconds(self.provision_timeout), tick=self.provision_tick
    ).wait(get_new_state, self._logger)

except tmt.utils.wait.WaitingTimedOutError as error:
    self.remove()  # ← only on timeout
    raise ArtemisProvisionError(...) from error

# No handler for Interrupted → guest request leaks

2. Cleanup step crashes when guests exist but none are ready

In tmt/steps/cleanup/__init__.py, Cleanup.go() has an inconsistency:

  • Line 135: checks self.plan.provision.guests (includes all guests, even not-ready ones) — this is non-empty (the GuestArtemis object exists because self._guest is set in ProvisionArtemis.go() before start() is called)
  • Line 150: uses self._steppified_guests which filters through ready_guestsguest.is_ready → checks self.primary_address is not None — this is empty because provisioning never completed
  • Line 147: enqueue_plugin(guests=[]) raises MetadataError

This means CleanupInternal.go() — which calls guest.stop() and guest.remove() (the DELETE /guests/{guestname} API call) — is never executed.

Expected behavior

  1. When interrupted during _create() polling, tmt should cancel the Artemis guest request via DELETE /guests/{guestname} before propagating the exception
  2. The cleanup step should handle the case where guests exist but are not ready, either by skipping the enqueue or by including not-ready guests for cleanup purposes (they still need remove() called)

Impact

Orphaned Artemis guest requests leaking cloud resources (e.g., AWS bare-metal instances) when tmt is interrupted during provisioning.


Assisted-by: Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    backlog

    Status

    triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions