Summary
When tmt receives SIGTERM during Artemis provisioning (while polling get_new_state), the Artemis guest request is never cancelled via the API, resulting in an orphaned cloud resource. Additionally, the cleanup step crashes with a MetadataError instead of handling the situation gracefully.
Observed in: https://artifacts.osci.redhat.com/testing-farm/9821a7f6-f206-462a-a512-ab0efea3f12c/work-whqlfskl_d68/log.txt
tmt version: 1.71.0
What happened
- Artemis guest
c7bbbc9f-0b8a-4d3b-ba11-234d7bcda709 was requested at 12:32:52
get_new_state polling started (86400s timeout, 3s tick) — guest never reached ready state
- SIGTERM received at 14:45:10 (~2h12m into polling)
- tmt caught the interrupt, suspended steps, ran report, then attempted cleanup
- Cleanup crashed with:
No guests queued for phase "default-0". A typo in "where" key?
- Artemis guest request was never cancelled — resource leak
Root cause
There are two issues:
1. GuestArtemis._create() does not cancel the request on interrupt
In tmt/steps/provision/artemis.py, the _create() method only calls self.remove() when WaitingTimedOutError is raised (line 636-639). When Interrupted is raised (due to SIGTERM), the exception propagates without cancelling the Artemis request:
try:
guest_info = Waiting(
Deadline.from_seconds(self.provision_timeout), tick=self.provision_tick
).wait(get_new_state, self._logger)
except tmt.utils.wait.WaitingTimedOutError as error:
self.remove() # ← only on timeout
raise ArtemisProvisionError(...) from error
# No handler for Interrupted → guest request leaks
2. Cleanup step crashes when guests exist but none are ready
In tmt/steps/cleanup/__init__.py, Cleanup.go() has an inconsistency:
- Line 135: checks
self.plan.provision.guests (includes all guests, even not-ready ones) — this is non-empty (the GuestArtemis object exists because self._guest is set in ProvisionArtemis.go() before start() is called)
- Line 150: uses
self._steppified_guests which filters through ready_guests → guest.is_ready → checks self.primary_address is not None — this is empty because provisioning never completed
- Line 147:
enqueue_plugin(guests=[]) raises MetadataError
This means CleanupInternal.go() — which calls guest.stop() and guest.remove() (the DELETE /guests/{guestname} API call) — is never executed.
Expected behavior
- When interrupted during
_create() polling, tmt should cancel the Artemis guest request via DELETE /guests/{guestname} before propagating the exception
- The cleanup step should handle the case where guests exist but are not ready, either by skipping the enqueue or by including not-ready guests for cleanup purposes (they still need
remove() called)
Impact
Orphaned Artemis guest requests leaking cloud resources (e.g., AWS bare-metal instances) when tmt is interrupted during provisioning.
Assisted-by: Claude Code
Summary
When tmt receives SIGTERM during Artemis provisioning (while polling
get_new_state), the Artemis guest request is never cancelled via the API, resulting in an orphaned cloud resource. Additionally, the cleanup step crashes with aMetadataErrorinstead of handling the situation gracefully.Observed in: https://artifacts.osci.redhat.com/testing-farm/9821a7f6-f206-462a-a512-ab0efea3f12c/work-whqlfskl_d68/log.txt
tmt version: 1.71.0
What happened
c7bbbc9f-0b8a-4d3b-ba11-234d7bcda709was requested at 12:32:52get_new_statepolling started (86400s timeout, 3s tick) — guest never reachedreadystateNo guests queued for phase "default-0". A typo in "where" key?Root cause
There are two issues:
1.
GuestArtemis._create()does not cancel the request on interruptIn
tmt/steps/provision/artemis.py, the_create()method only callsself.remove()whenWaitingTimedOutErroris raised (line 636-639). WhenInterruptedis raised (due to SIGTERM), the exception propagates without cancelling the Artemis request:2. Cleanup step crashes when guests exist but none are ready
In
tmt/steps/cleanup/__init__.py,Cleanup.go()has an inconsistency:self.plan.provision.guests(includes all guests, even not-ready ones) — this is non-empty (theGuestArtemisobject exists becauseself._guestis set inProvisionArtemis.go()beforestart()is called)self._steppified_guestswhich filters throughready_guests→guest.is_ready→ checksself.primary_address is not None— this is empty because provisioning never completedenqueue_plugin(guests=[])raisesMetadataErrorThis means
CleanupInternal.go()— which callsguest.stop()andguest.remove()(theDELETE /guests/{guestname}API call) — is never executed.Expected behavior
_create()polling, tmt should cancel the Artemis guest request viaDELETE /guests/{guestname}before propagating the exceptionremove()called)Impact
Orphaned Artemis guest requests leaking cloud resources (e.g., AWS bare-metal instances) when tmt is interrupted during provisioning.
Assisted-by: Claude Code