Skip to content

Fix health check reliability: exception handling, stale status, and concurrency#673

Merged
bhimrazy merged 8 commits intoLightning-AI:mainfrom
Lidang-Jiang:fix/health-check-reliability
Apr 9, 2026
Merged

Fix health check reliability: exception handling, stale status, and concurrency#673
bhimrazy merged 8 commits intoLightning-AI:mainfrom
Lidang-Jiang:fix/health-check-reliability

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

Summary

Fixes #660 — Three reliability issues in the /health endpoint:

  1. Unhandled exceptions in custom health() methods — If a user's health() raises an exception, the endpoint crashes with 500 instead of returning 503. In Kubernetes, this causes incorrect pod restarts.
  2. Stale worker statusworkers_ready was cached via nonlocal and never re-evaluated. If workers die after becoming ready, health check still returns 200.
  3. Shared state between concurrent requests — The nonlocal workers_ready variable was shared across concurrent health check requests.

Changes

  • Wrap health() calls in try/except with logger.exception(), returning 503 on failure
  • Remove nonlocal workers_ready caching — compute workers_ready inline on every request
  • Sub-issue 3 is resolved automatically by removing the shared nonlocal variable
Before (unmodified code)
============================================================
Sub-issue 1: Custom health() raising exception
============================================================
  GET /health → status=500, body='Internal Server Error'
  ❌ BUG: Returns 500 Internal Server Error (should be 503)

============================================================
Sub-issue 2: Stale worker status (cached workers_ready)
============================================================
  Workers READY:  GET /health → status=200, body='ok'
  Workers ERROR:  GET /health → status=200, body='ok'
  ❌ BUG: Still returns 200 (stale cached status)
  Workers READY:  GET /health → status=200, body='ok'
  ✅ Correctly recovers to 200
After (with fix)
============================================================
Sub-issue 1: Custom health() raising exception
============================================================
  GET /health → status=503, body='not ready'
  ✅ FIXED: Returns 503 Service Unavailable as expected

============================================================
Sub-issue 2: Stale worker status (cached workers_ready)
============================================================
  Workers READY:  GET /health → status=200, body='ok'
  Workers ERROR:  GET /health → status=503, body='not ready'
  ✅ FIXED: Returns 503 after workers enter error state
  Workers READY:  GET /health → status=200, body='ok'
  ✅ Correctly recovers to 200
Unit test results (all passing)
tests/unit/test_lit_server.py::test_health_check_returns_503_when_workers_setup_status_is_empty PASSED
tests/unit/test_lit_server.py::test_health_check_returns_503_on_health_exception PASSED
tests/unit/test_lit_server.py::test_health_check_detects_worker_status_change PASSED
tests/unit/test_lit_server.py::test_health_check_no_stale_cache PASSED

Full suite: 76 passed, 3 skipped

Test plan

  • test_health_check_returns_503_on_health_exception — custom health() raising → 503 not 500
  • test_health_check_detects_worker_status_change — workers READY→ERROR → 503
  • test_health_check_no_stale_cache — status not cached across requests, recovers to 200
  • Existing health check tests still pass
  • ruff check and pre-commit hooks pass

…oncurrency

- Wrap custom health() calls in try/except to return 503 instead of 500
- Remove nonlocal workers_ready caching so status is re-evaluated every request
- Add tests for exception handling, stale status detection, and cache invalidation

Fixes Lightning-AI#660

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85%. Comparing base (31024b8) to head (d1681ff).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #673   +/-   ##
===================================
- Coverage    85%    85%   -0%     
===================================
  Files        39     39           
  Lines      3279   3282    +3     
===================================
+ Hits       2775   2776    +1     
- Misses      504    506    +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy bhimrazy merged commit 86e079e into Lightning-AI:main Apr 9, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Health check reliability and exception handling issues

3 participants