Skip to content

feat(retry): add exponential backoff, jitter, and attempt-aware hooks#299

Open
TGPSKI wants to merge 1 commit into
app-sre:masterfrom
TGPSKI:extend-retry-exponential-jitter
Open

feat(retry): add exponential backoff, jitter, and attempt-aware hooks#299
TGPSKI wants to merge 1 commit into
app-sre:masterfrom
TGPSKI:extend-retry-exponential-jitter

Conversation

@TGPSKI

@TGPSKI TGPSKI commented Jun 9, 2026

Copy link
Copy Markdown

Summary

  • Extend sretoolbox.utils.retry with optional exponential backoff, full jitter, and attempt-aware hook signatures
  • Default behavior unchanged: backoff="linear" still sleeps time.sleep(attempt) for all existing @retry() call sites
  • Add tests/test_utils_retry.py — first dedicated test coverage for the retry decorator

Motivation

Follow-up to review discussion on qontract-reconcile#5585. Vault client auth needs exponential backoff with jitter and per-attempt logging for incident visibility. Rather than maintaining a hand-rolled retry loop in qontract-reconcile, extend the shared decorator and migrate vault auth once this releases.

API (all new kwargs are optional)

@retry()  # unchanged: linear 1s, 2s, 3s…

@retry(
    max_attempts=5,
    backoff="exponential",
    backoff_base=2.0,
    backoff_max=30.0,
    jitter=True,
    hook=my_hook,  # (exc) | (exc, attempt) | (exc, attempt, max_attempts)
)

Test plan

  • uv run ruff check --no-fix
  • uv run mypy
  • uv run pytest tests/ (136 passed, 83% coverage)
  • Release as 4.1.0 after merge
  • Follow-up: rewrite vault auth in qontract-reconcile#5585 to use @retry

Extend sretoolbox.utils.retry with optional exponential backoff, full
jitter, and hook signatures that receive attempt context. Default
linear backoff (time.sleep(attempt)) is unchanged for existing callers.

Adds tests/test_utils_retry.py as the first dedicated retry test suite.

Motivated by qontract-reconcile Vault auth resilience work (app-sre#5585).
@TGPSKI

TGPSKI commented Jun 9, 2026

Copy link
Copy Markdown
Author

@hemslo — this PR is in direct support of your suggestions and feedback on qontract-reconcile#5585.

Rather than keeping a hand-rolled retry loop in VaultClient, we extended sretoolbox.utils.retry with:

  • optional exponential backoff + full jitter (addressing the thundering-herd point)
  • attempt-aware hook signatures (exc, attempt, max_attempts) for per-attempt logging
  • unchanged default linear behavior for all existing @retry() call sites

Once this merges and releases as 4.1.0, the plan is to rewrite the vault auth path in #5585 to use the decorator. Tracking in APPSRE-14592.

Would appreciate your review when you have a moment.

Comment thread sretoolbox/utils/retry.py
Comment on lines +97 to +116
inspect.Parameter.POSITIONAL_OR_KEYWORD,
}
)


def _invoke_hook(
hook: Callable[..., None] | None,
exception: Exception,
attempt: int,
max_attempts: int,
) -> None:
if not callable(hook):
return
try:
param_count = _positional_param_count(inspect.signature(hook))
except (TypeError, ValueError):
hook(exception)
return
if param_count >= 3:
hook(exception, attempt, max_attempts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect.signature is called on every retry attempt to detect hook arity. This is both a performance issue (introspection per-attempt) and a fragile pattern — it breaks with *args, some lambdas, functools.partial, and builtins.

No major retry library does this. tenacity passes a RetryCallState object, backoff passes a details dict, stamina passes a RetryDetails dataclass. All use a single uniform argument — no signature introspection.

Suggestion: use a context object instead:

from typing import NamedTuple

class RetryInfo(NamedTuple):
    exception: Exception
    attempt: int
    max_attempts: int

Then hook dispatch becomes trivial — no inspect, no arity branching:

if hook is not None:
    hook(RetryInfo(exception, attempt, max_attempts))

The only existing hook= callers in qontract-reconcile are _log_exception(ex) in gitlab_housekeeping.py and capture_and_forget(error) in gql.py — both 1-arg. Migrating them is a one-line change each: (ex)(info) + info.exception.

This is a new feature release (4.1.0), so it's the right time to make the hook contract clean rather than building introspection machinery to preserve a signature that only 2 callers use.

Comment thread sretoolbox/utils/retry.py
Comment on lines +73 to +82
*,
backoff: BackoffStrategy,
backoff_base: float,
backoff_max: float | None,
jitter: bool,
) -> float:
if backoff == "linear":
delay = float(attempt)
else:
delay = backoff_base ** (attempt - 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jitter=True also applies to linear backoff (random.uniform(0, float(attempt))). This works but isn't documented — the PR description and docstring only discuss jitter with exponential. If intentional, document it. If not, gate it behind backoff="exponential".

Comment thread tests/test_utils_retry.py
max_attempts=4,
exceptions=RuntimeError,
backoff="exponential",
backoff_base=2.0,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for the 2-arg hook variant (exc, attempt). Tests cover 1-arg and 3-arg but the elif param_count >= 2 path in _invoke_hook has no coverage. (Moot if switching to context object approach.)

Comment thread sretoolbox/utils/retry.py
Comment on lines +36 to +41
import inspect
import itertools
import random
import time
from functools import wraps
from typing import TYPE_CHECKING, ParamSpec, TypeVar
from typing import TYPE_CHECKING, Literal, ParamSpec, TypeVar

@hemslo hemslo Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broader question: have you considered adopting tenacity instead of extending this custom decorator? tenacity is the de facto standard (~30-40M monthly downloads), already a transitive dependency via stamina in qontract-reconcile's virtualenv, and provides everything this PR adds out of the box — exponential backoff, full/equal/decorrelated jitter, composable wait strategies, attempt-aware callbacks via RetryCallState, async support, and more.

Maintaining a custom retry decorator means reimplementing (and testing) features that tenacity has battle-tested for years. The current @retry() call sites in qontract-reconcile could migrate incrementally — the decorator API is similar enough that most changes would be mechanical.

Not necessarily blocking for this PR, but worth considering whether extending sretoolbox's retry is the right long-term investment vs. adopting the ecosystem standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants