Skip to content

Retry on daemon failure #11

@rvighne

Description

@rvighne

User Story

As a developer, I would like to retry a different daemon if one fails so that the job can proceed uninterrupted.

Detailed Description

From the client's perspective, failures can be separated into the following categories:

  • "Successfully delivered" failure: the remote task completed, but returned a non-zero exit code. This should not result in a retry, as the entire graph has transitively failed.
  • Failure to execute the task: the daemon reports to the client that the job could not be ran, for example because the Docker container could not be started. The client should retry with a different daemon.
  • Timeout: this should be interpreted as network failure, and the client should retry with a different daemon, but not be surprised if a result does come in later from the daemon it gave up on.

Tasks

  • Detect the first failure above.
  • Detect the second failure above.
  • Detect the third failure above, keeping track of the number of retries.
  • Write unit tests.

Acceptance Criteria

  • Given the first failure happens, then the client emit an error, which is propagated to the user.
  • Given the second failure happens, then the user sees only a diagnostic message and the client tries the next daemon.
  • Given the third failure happens, then the user sees a diagnostic message and the client tries the next daemon.

Estimated points

3

Metadata

Metadata

Assignees

Labels

storyNew feature or enhancement

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions