Skip to content

Conversation

@gajeshbhat
Copy link
Contributor

@gajeshbhat gajeshbhat commented Aug 26, 2025

Description

Fixes issue #438 where Testflinger agents show EXITED status in supervisorctl and cannot recover themselves.

Problem

When the Testflinger server experiences issues (like repeated 503 errors), the agent's HTTP client retry mechanism can return None instead of a proper Response object after exhausting all retries. The existing error handling code checked if not job_request: but then tried to access job_request.status_code, causing an AttributeError: 'NoneType' object has no attribute 'status_code'.

This unhandled exception caused the agent process to crash, leading to:

  • Agents showing EXITED status in supervisorctl
  • Manual intervention required to restart agents
  • Loss of job processing capability

Root Cause

The issue stems from the ambiguity in Boolean evaluation of requests.Response objects:

  • None evaluates to False (retry exhaustion)
  • Response with status >= 400 also evaluates to False (HTTP error)

Both cases triggered the same error handling path, but only the Response case has a status_code attribute.

Boolean Evaluation Issue

The requests.Response object implements a custom __bool__ method that returns False for HTTP status codes >= 400 (requests documentation). This creates an ambiguous condition where both cases evaluate as falsy:

# Both evaluate to True in: if not job_request:
job_request = None                    # Retry exhaustion → None
job_request = Response(status=503)    # HTTP error → False due to __bool__

Requests Retry Behaviour

When using urllib3.util.retry.Retry with requests.Session, the retry mechanism can return None in edge cases after exhausting all retry attempts, particularly with network-level failures (urllib3 retry documentation). The original code assumed all HTTP operations would return a Response object, but this assumption breaks during severe connectivity issues.

The fix described in the section below explicitly distinguishes between these two falsy states:

if not job_request:
    if job_request is None:  # Retry exhaustion
        logger.error("No response received")
    else:  # HTTP error response
        logger.error("HTTP %d", job_request.status_code)

Solution

Updated error handling in 5 client methods to explicitly check for None before accessing status_code:

  • post_status_update() - The main method causing reported crashes
  • post_result()
  • get_result()
  • repost_job()
  • post_artifacts()

Resolved issues

Fixes #438 (Resolves #CERTTF-474 Internally)

Documentation

Bug fix. No documentation update required

Web service API changes

None. Added better error handling in a bugfix

Tests

  • Added 4 new unit tests to verify None response handling
  • All existing tests continue to pass (82/82 agent tests)
  • Verified with other components (CLI: 67/67, Server: 116/116, Common: 3/3)
  • All linting checks pass

@gajeshbhat gajeshbhat force-pushed the fix/agents-show-exited-supervisor-ctl branch from c04cb3b to 0e8cab0 Compare August 26, 2025 00:41
@gajeshbhat
Copy link
Contributor Author

CI failed because of the policy check. Need maintainers' approval to rerun.

@gajeshbhat
Copy link
Contributor Author

@pedro-avalos @rene-oromtz Can you kindly look at this PR and let me know if my submission will be accepted?

@gajeshbhat
Copy link
Contributor Author

@boukeas Thank you for reviewing my other PR. I have just one more PR up, which is this one. I understand you folks might have other higher-priority items and totally understand if this has to wait, but if you have some comments/questions about this one, I can address them in the meantime.

@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.59%. Comparing base (8b674a4) to head (2976adc).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #787      +/-   ##
==========================================
+ Coverage   73.54%   73.59%   +0.05%     
==========================================
  Files         109      109              
  Lines       10217    10226       +9     
  Branches      882      885       +3     
==========================================
+ Hits         7514     7526      +12     
+ Misses       2515     2513       -2     
+ Partials      188      187       -1     
Flag Coverage Δ *Carryforward flag
agent 73.54% <100.00%> (+0.43%) ⬆️
cli 89.52% <ø> (ø) Carriedforward from 60ad9aa
device 59.25% <ø> (ø) Carriedforward from 60ad9aa
server 87.99% <ø> (ø) Carriedforward from 60ad9aa

*This pull request uses carry forward flags. Click here to find out more.

Components Coverage Δ
Agent 73.54% <100.00%> (+0.43%) ⬆️
CLI 89.52% <ø> (ø)
Common ∅ <ø> (∅)
Device Connectors 59.25% <ø> (ø)
Server 87.99% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gajeshbhat
Copy link
Contributor Author

gajeshbhat commented Nov 27, 2025

Run charm test CI check hangs and did not run successfully. The tests I added only cover the None cases (when job_request is None), but the else branches (when job_request is a Response object with status >= 400) are NOT covered yet. I will add tests to enchance coverage.

@gajeshbhat gajeshbhat force-pushed the fix/agents-show-exited-supervisor-ctl branch from 565c7be to 7dd9c57 Compare December 2, 2025 01:27
…shes

Fix AttributeError when session.post() returns None after retry exhaustion.
When the server returns repeated 503 errors, the requests retry mechanism
can return None instead of a Response object, causing agents to crash when
accessing status_code attribute.

Fixed methods:
- post_status_update()
- post_result()
- get_result()
- repost_job()
- post_artifacts()

Resolves agents showing EXITED status in supervisorctl.

Fixes canonical#438 (Fixes #CERTTF-474 Internally)

Fix AttributeError when session.post() returns None after retry exhaustion.
When the server returns repeated 503 errors, the requests retry mechanism
can return None instead of a Response object, causing agents to crash when
accessing status_code attribute.

Fixed methods:
- post_status_update()
- post_result()
- get_result()
- save_artifacts()

Resolves agents showing EXITED status in supervisorctl.
Fixes canonical#438 (Fixes #CERTTF-474 Internally)
@gajeshbhat gajeshbhat force-pushed the fix/agents-show-exited-supervisor-ctl branch from 7dd9c57 to 60ad9aa Compare February 7, 2026 18:29
@gajeshbhat
Copy link
Contributor Author

Rebased and fixed conflicts. CI Checks pass. This has been open for a while, Can this get a review ? CC: @rene-oromtz @boukeas

Copy link
Contributor

@ajzobro ajzobro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution and your patience, we are sorry it has taken so long to get around to reviewing it.

Please add tests for:

post_result() with a response object that has a status_code (lines 202-207)
get_result() with a response object that has a status_code (lines 233-237)
save_artifacts() with a response object that has a status_code (lines 335-340)

)
raise TFServerError("No response received")
else:
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding tests for the new None branch, but could you please also add tests for these other else branches where the responses have status codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

"No response received",
)
else:
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

)
raise TFServerError("No response received")
else:
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

job_id = str(uuid.uuid1())

# Mock the session.post to return None (simulating retry exhaustion)
with patch.object(client.session, "post", return_value=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been moving away from using the with patch syntax in favor of using @patch. decorators. See Rene's feedback on my PR #915 for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I moved my tests to @patch

@gajeshbhat
Copy link
Contributor Author

Thank you for your contribution and your patience, we are sorry it has taken so long to get around to reviewing it.

Please add tests for:

post_result() with a response object that has a status_code (lines 202-207) get_result() with a response object that has a status_code (lines 233-237) save_artifacts() with a response object that has a status_code (lines 335-340)

Thank you for taking a look. I made some changes and pushed them. Let me know if you have any questions.

Signed-off-by: Gajesh Bhat <gajeshbht@gmail.com>
@gajeshbhat gajeshbhat requested a review from ajzobro February 9, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agents show EXITED in supervisorctl

2 participants