feat(retry): add exponential backoff, jitter, and attempt-aware hooks#299
feat(retry): add exponential backoff, jitter, and attempt-aware hooks#299TGPSKI wants to merge 1 commit into
Conversation
Extend sretoolbox.utils.retry with optional exponential backoff, full jitter, and hook signatures that receive attempt context. Default linear backoff (time.sleep(attempt)) is unchanged for existing callers. Adds tests/test_utils_retry.py as the first dedicated retry test suite. Motivated by qontract-reconcile Vault auth resilience work (app-sre#5585).
|
@hemslo — this PR is in direct support of your suggestions and feedback on qontract-reconcile#5585. Rather than keeping a hand-rolled retry loop in
Once this merges and releases as 4.1.0, the plan is to rewrite the vault auth path in #5585 to use the decorator. Tracking in APPSRE-14592. Would appreciate your review when you have a moment. |
| inspect.Parameter.POSITIONAL_OR_KEYWORD, | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| def _invoke_hook( | ||
| hook: Callable[..., None] | None, | ||
| exception: Exception, | ||
| attempt: int, | ||
| max_attempts: int, | ||
| ) -> None: | ||
| if not callable(hook): | ||
| return | ||
| try: | ||
| param_count = _positional_param_count(inspect.signature(hook)) | ||
| except (TypeError, ValueError): | ||
| hook(exception) | ||
| return | ||
| if param_count >= 3: | ||
| hook(exception, attempt, max_attempts) |
There was a problem hiding this comment.
inspect.signature is called on every retry attempt to detect hook arity. This is both a performance issue (introspection per-attempt) and a fragile pattern — it breaks with *args, some lambdas, functools.partial, and builtins.
No major retry library does this. tenacity passes a RetryCallState object, backoff passes a details dict, stamina passes a RetryDetails dataclass. All use a single uniform argument — no signature introspection.
Suggestion: use a context object instead:
from typing import NamedTuple
class RetryInfo(NamedTuple):
exception: Exception
attempt: int
max_attempts: intThen hook dispatch becomes trivial — no inspect, no arity branching:
if hook is not None:
hook(RetryInfo(exception, attempt, max_attempts))The only existing hook= callers in qontract-reconcile are _log_exception(ex) in gitlab_housekeeping.py and capture_and_forget(error) in gql.py — both 1-arg. Migrating them is a one-line change each: (ex) → (info) + info.exception.
This is a new feature release (4.1.0), so it's the right time to make the hook contract clean rather than building introspection machinery to preserve a signature that only 2 callers use.
| *, | ||
| backoff: BackoffStrategy, | ||
| backoff_base: float, | ||
| backoff_max: float | None, | ||
| jitter: bool, | ||
| ) -> float: | ||
| if backoff == "linear": | ||
| delay = float(attempt) | ||
| else: | ||
| delay = backoff_base ** (attempt - 1) |
There was a problem hiding this comment.
jitter=True also applies to linear backoff (random.uniform(0, float(attempt))). This works but isn't documented — the PR description and docstring only discuss jitter with exponential. If intentional, document it. If not, gate it behind backoff="exponential".
| max_attempts=4, | ||
| exceptions=RuntimeError, | ||
| backoff="exponential", | ||
| backoff_base=2.0, |
There was a problem hiding this comment.
Missing test for the 2-arg hook variant (exc, attempt). Tests cover 1-arg and 3-arg but the elif param_count >= 2 path in _invoke_hook has no coverage. (Moot if switching to context object approach.)
| import inspect | ||
| import itertools | ||
| import random | ||
| import time | ||
| from functools import wraps | ||
| from typing import TYPE_CHECKING, ParamSpec, TypeVar | ||
| from typing import TYPE_CHECKING, Literal, ParamSpec, TypeVar |
There was a problem hiding this comment.
Broader question: have you considered adopting tenacity instead of extending this custom decorator? tenacity is the de facto standard (~30-40M monthly downloads), already a transitive dependency via stamina in qontract-reconcile's virtualenv, and provides everything this PR adds out of the box — exponential backoff, full/equal/decorrelated jitter, composable wait strategies, attempt-aware callbacks via RetryCallState, async support, and more.
Maintaining a custom retry decorator means reimplementing (and testing) features that tenacity has battle-tested for years. The current @retry() call sites in qontract-reconcile could migrate incrementally — the decorator API is similar enough that most changes would be mechanical.
Not necessarily blocking for this PR, but worth considering whether extending sretoolbox's retry is the right long-term investment vs. adopting the ecosystem standard.
Summary
sretoolbox.utils.retrywith optional exponential backoff, full jitter, and attempt-aware hook signaturesbackoff="linear"still sleepstime.sleep(attempt)for all existing@retry()call sitestests/test_utils_retry.py— first dedicated test coverage for the retry decoratorMotivation
Follow-up to review discussion on qontract-reconcile#5585. Vault client auth needs exponential backoff with jitter and per-attempt logging for incident visibility. Rather than maintaining a hand-rolled retry loop in qontract-reconcile, extend the shared decorator and migrate vault auth once this releases.
API (all new kwargs are optional)
Test plan
uv run ruff check --no-fixuv run mypyuv run pytest tests/(136 passed, 83% coverage)4.1.0after merge@retry