-
Notifications
You must be signed in to change notification settings - Fork 37
fix(agent): handle None response from retry exhaustion to prevent crashes #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(agent): handle None response from retry exhaustion to prevent crashes #787
Conversation
c04cb3b to
0e8cab0
Compare
|
CI failed because of the policy check. Need maintainers' approval to rerun. |
|
@pedro-avalos @rene-oromtz Can you kindly look at this PR and let me know if my submission will be accepted? |
|
@boukeas Thank you for reviewing my other PR. I have just one more PR up, which is this one. I understand you folks might have other higher-priority items and totally understand if this has to wait, but if you have some comments/questions about this one, I can address them in the meantime. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #787 +/- ##
==========================================
+ Coverage 73.54% 73.59% +0.05%
==========================================
Files 109 109
Lines 10217 10226 +9
Branches 882 885 +3
==========================================
+ Hits 7514 7526 +12
+ Misses 2515 2513 -2
+ Partials 188 187 -1
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Run charm test CI check hangs and did not run successfully. The tests I added only cover the None cases (when job_request is None), but the else branches (when job_request is a Response object with status >= 400) are NOT covered yet. I will add tests to enchance coverage. |
565c7be to
7dd9c57
Compare
…shes Fix AttributeError when session.post() returns None after retry exhaustion. When the server returns repeated 503 errors, the requests retry mechanism can return None instead of a Response object, causing agents to crash when accessing status_code attribute. Fixed methods: - post_status_update() - post_result() - get_result() - repost_job() - post_artifacts() Resolves agents showing EXITED status in supervisorctl. Fixes canonical#438 (Fixes #CERTTF-474 Internally) Fix AttributeError when session.post() returns None after retry exhaustion. When the server returns repeated 503 errors, the requests retry mechanism can return None instead of a Response object, causing agents to crash when accessing status_code attribute. Fixed methods: - post_status_update() - post_result() - get_result() - save_artifacts() Resolves agents showing EXITED status in supervisorctl. Fixes canonical#438 (Fixes #CERTTF-474 Internally)
7dd9c57 to
60ad9aa
Compare
|
Rebased and fixed conflicts. CI Checks pass. This has been open for a while, Can this get a review ? CC: @rene-oromtz @boukeas |
ajzobro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution and your patience, we are sorry it has taken so long to get around to reviewing it.
Please add tests for:
post_result() with a response object that has a status_code (lines 202-207)
get_result() with a response object that has a status_code (lines 233-237)
save_artifacts() with a response object that has a status_code (lines 335-340)
| ) | ||
| raise TFServerError("No response received") | ||
| else: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding tests for the new None branch, but could you please also add tests for these other else branches where the responses have status codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
| "No response received", | ||
| ) | ||
| else: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
| ) | ||
| raise TFServerError("No response received") | ||
| else: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
agent/tests/test_client.py
Outdated
| job_id = str(uuid.uuid1()) | ||
|
|
||
| # Mock the session.post to return None (simulating retry exhaustion) | ||
| with patch.object(client.session, "post", return_value=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been moving away from using the with patch syntax in favor of using @patch. decorators. See Rene's feedback on my PR #915 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I moved my tests to @patch
Thank you for taking a look. I made some changes and pushed them. Let me know if you have any questions. |
Signed-off-by: Gajesh Bhat <gajeshbht@gmail.com>
Description
Fixes issue #438 where Testflinger agents show EXITED status in supervisorctl and cannot recover themselves.
Problem
When the Testflinger server experiences issues (like repeated 503 errors), the agent's HTTP client retry mechanism can return
Noneinstead of a properResponseobject after exhausting all retries. The existing error handling code checkedif not job_request:but then tried to accessjob_request.status_code, causing anAttributeError: 'NoneType' object has no attribute 'status_code'.This unhandled exception caused the agent process to crash, leading to:
Root Cause
The issue stems from the ambiguity in Boolean evaluation of
requests.Responseobjects:Noneevaluates toFalse(retry exhaustion)Responsewith status >= 400 also evaluates toFalse(HTTP error)Both cases triggered the same error handling path, but only the Response case has a
status_codeattribute.Boolean Evaluation Issue
The
requests.Responseobject implements a custom__bool__method that returnsFalsefor HTTP status codes >= 400 (requests documentation). This creates an ambiguous condition where both cases evaluate as falsy:Requests Retry Behaviour
When using
urllib3.util.retry.Retrywithrequests.Session, the retry mechanism can returnNonein edge cases after exhausting all retry attempts, particularly with network-level failures (urllib3 retry documentation). The original code assumed all HTTP operations would return aResponseobject, but this assumption breaks during severe connectivity issues.The fix described in the section below explicitly distinguishes between these two falsy states:
Solution
Updated error handling in 5 client methods to explicitly check for
Nonebefore accessingstatus_code:post_status_update()- The main method causing reported crashespost_result()get_result()repost_job()post_artifacts()Resolved issues
Fixes #438 (Resolves #CERTTF-474 Internally)
Documentation
Bug fix. No documentation update required
Web service API changes
None. Added better error handling in a bugfix
Tests