Skip to content

Add ordered fallback provisioning for GCE capacity errors#554

Open
suchitrak wants to merge 4 commits into
jenkinsci:developfrom
suchitrak:gce-agent-fallback-4.683
Open

Add ordered fallback provisioning for GCE capacity errors#554
suchitrak wants to merge 4 commits into
jenkinsci:developfrom
suchitrak:gce-agent-fallback-4.683

Conversation

@suchitrak

@suchitrak suchitrak commented Jun 11, 2026

Copy link
Copy Markdown

Summary

When provisioning a GCE agent fails with a retryable capacity error (e.g. ZONE_RESOURCE_POOL_EXHAUSTED), the plugin now iterates through an ordered list of fallback candidates defined per InstanceConfiguration.

  • Each fallback candidate can override zone, machine type, and/or instance template
  • Non-retryable errors (quota, permission, config) fail immediately without attempting fallback
  • Failed instances are cleaned up before trying the next candidate
  • Detailed provisioning logging shows candidate progression (N/M, zone, template, outcome)
  • Fallback list capped at 10 entries to bound provisioner thread time
  • Unknown GCP error codes are treated as non-retryable (conservative policy)

New classes

  • FallbackCandidate — data class for a single fallback option (zone, machineType, region, subnetwork, template) with:
    • Input validation via doCheckZone/doCheckMachineType/doCheckRegion in DescriptorImpl
    • Region auto-derivation from zone name (e.g. us-west1-aus-west1)
    • MAX_FALLBACK_CANDIDATES = 10 cap
    • Full CasC/UI support with help files
  • ProvisioningErrorClassifier — classifies GCP operation errors as retryable vs non-retryable
    • Documents GCP error code reference URLs for maintainability
    • Covers ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY
    • Unknown codes → non-retryable (explicitly documented policy)

Changes to existing code

  • InstanceConfiguration.provision() — refactored to loop through primary + fallback candidates, retrying on classified-retryable errors
  • config.jelly — UI section for configuring fallback candidates per instance configuration

Commit structure

  1. FallbackCandidate data class with validation and CasC/UI support
  2. ProvisioningErrorClassifier with documented error policies
  3. InstanceConfiguration fallback provisioning logic
  4. Unit tests for all new functionality

Test plan

  • Unit tests: InstanceConfigurationFallbackTest (fallback logic, ordering, non-retryable short-circuit, all-exhausted, OperationException)
  • Unit tests: ProvisioningErrorClassifierTest (retryable codes, unknown-code policy, quota exclusion, case insensitivity)
  • CasC round-trip test in ConfigAsCodeTest
  • E2E validated on a live Jenkins HA controller with a 3-zone fallback chain (us-west1-bus-west1-cus-west1-a) under real GCP capacity pressure, confirmed via GCE metadata server and GCP audit logs
  • CI on this PR

Motivation

GCE zones frequently experience ZONE_RESOURCE_POOL_EXHAUSTED for specific machine types. Without fallback, builds queue indefinitely. This feature allows administrators to define ordered alternatives so provisioning can succeed in a different zone/template when the primary is stocked out — similar to how AWS EC2 plugin handles multiple AZs.

Design decisions

Decision Choice Rationale
Unknown error codes Non-retryable Conservative — avoids masking real bugs behind retries
Fallback cap 10 candidates max Bounds worst-case provisioner thread hold time
Region handling Auto-derived from zone when blank GCE zone names encode region; explicit override available for edge cases
Machine type w/o template Required Without a template, machine type must be specified for the GCE insert
Machine type w/ template Optional Template provides it

@suchitrak suchitrak requested a review from a team as a code owner June 11, 2026 16:26
@suchitrak suchitrak force-pushed the gce-agent-fallback-4.683 branch from 2200676 to e94028a Compare June 11, 2026 17:08
skunnath and others added 4 commits June 11, 2026 10:09
Defines the data structure for a single fallback candidate (zone,
machineType, region, subnetwork, template) with:
- Input validation via doCheck methods in DescriptorImpl
- Region auto-derivation from zone name
- MAX_FALLBACK_CANDIDATES cap (10) to bound provisioner thread time
- Help files for the Jenkins UI

Co-authored-by: Cursor <cursoragent@cursor.com>
Classifies GCE operation errors into capacity-related (retryable via
fallback) and non-retryable (abort immediately) buckets.

- Conservative unknown-error policy: unrecognized codes are non-retryable
- Covers ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY
- Explicitly excludes QUOTA errors from retry
- Documents GCP error code reference URLs for maintainability

Co-authored-by: Cursor <cursoragent@cursor.com>
Refactors provision() to iterate through primary + fallback candidates:
- Waits for GCE operation completion when fallback is configured
- Retries next candidate on retryable capacity errors
- Aborts immediately on non-retryable errors (quota, permission, config)
- Best-effort cleanup of failed VMs before trying next candidate
- Caps fallback list at MAX_FALLBACK_CANDIDATES; skips blank-zone entries
- Re-zones disk type self-links for cross-zone fallback
- Null-safe shortName() helper for logging template-based configs
- UI section for configuring fallback candidates

Co-authored-by: Cursor <cursoragent@cursor.com>
- InstanceConfigurationFallbackTest: fallback ordering, non-retryable
  abort, all-exhausted, OperationException handling, legacy no-fallback
- ProvisioningErrorClassifierTest: retryable codes, quota exclusion,
  unknown-code policy, case insensitivity, null safety
- ConfigAsCodeTest: CasC round-trip for fallbackCandidates field

Co-authored-by: Cursor <cursoragent@cursor.com>
@suchitrak suchitrak force-pushed the gce-agent-fallback-4.683 branch from e94028a to b3e7881 Compare June 11, 2026 17:11

@suchitrak suchitrak left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author's walkthrough for reviewers

I've added inline comments on every meaningful change to make this easy to review. Suggested reading order:

  1. FallbackCandidate.java (new) — the per-candidate data model (zone/machineType/region/subnetwork/template), validation, and the 10-candidate cap.
  2. ProvisioningErrorClassifier.java (new) — the policy that decides which errors trigger fallback (capacity = retry, everything else = abort; unknown = abort).
  3. InstanceConfiguration.java (only existing prod file changed) — the provision() refactor that loops through [primary, ...fallbacks]. Start at provision(), then buildProvisioningAttempts(), then provisionAttempt().
  4. TestsInstanceConfigurationFallbackTest, ProvisioningErrorClassifierTest, ConfigAsCodeTest.

Two things to keep in mind while reviewing

  • Backward compatibility: fallbackCandidates is @Nullable/defaults null. With no candidates configured, provision() runs the original single-attempt fast path unchanged. No migration needed. This is fully backward compatible. Existing users are not impacted.
  • The one design tradeoff (flagged inline at the operation wait): when fallback is configured, the GCE operation wait moves from the launcher's Future into provision(), so it briefly blocks the provisioner thread. Bounded by the 10-candidate cap × per-attempt launchTimeout. Happy to move it into the PlannedNode future instead if preferred.

* @see FallbackCandidate
*/
@Nullable
private List<FallbackCandidate> fallbackCandidates;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New field — the backbone of backward compatibility. @Nullable and defaults to null, so any existing config.xml/CasC saved before this feature deserializes unchanged. When null/empty, provisioning behaves exactly as it did before this PR.


public ComputeEngineInstance provision() throws IOException {
List<ProvisioningAttempt> attempts = buildProvisioningAttempts();
boolean fallbackEnabled = attempts.size() > 1;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provisioning entry point. buildProvisioningAttempts() returns an ordered list [primary, fallback1, fallback2, ...]. fallbackEnabled is true only when at least one fallback exists. The loop below tries each attempt in order: on a retryable (capacity) failure it logs and advances to the next candidate; on success it returns the node; if every candidate fails it throws IOException wrapping the last error.

Key: when fallbackEnabled == false (no fallbacks configured) the loop runs once and provisionAttempt(..., false) takes the original fast path — zero behavior change for existing users.

* configuration). Candidates with a blank zone are skipped with a warning. The list is capped
* at {@link FallbackCandidate#MAX_FALLBACK_CANDIDATES} entries to bound provisioner thread time.
*/
private List<ProvisioningAttempt> buildProvisioningAttempts() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds the ordered attempt list. Primary first (this config's zone/machineType/template), then each FallbackCandidate with any blank field inherited from the primary via firstNonEmpty(...). Two safety rails: candidates with a blank zone are skipped with a warning, and the list is capped at MAX_FALLBACK_CANDIDATES (10) to bound how long the provisioner thread can be held.

* and the caller should try the next fallback candidate.
* @throws IOException for non-retryable failures (the whole provision should abort).
*/
private ComputeEngineInstance provisionAttempt(ProvisioningAttempt attempt, boolean fallbackEnabled)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provisions a single attempt. The fallbackEnabled flag switches behavior:

  • false → original fast path: submit the insert and return immediately; the launcher waits on the operation (as it always has).
  • true → wait inline for the operation so async capacity errors surface here and can drive the fallback decision.

This dual-mode is intentional so non-fallback users get the exact pre-PR code path.

if (fallbackEnabled) {
Operation.Error opError = null;
try {
Operation completed = cloud.getClient()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the heart of the feature. GCE's insertInstance returns immediately with a pending Operation — capacity errors like ZONE_RESOURCE_POOL_EXHAUSTED only appear when the zone Operation reaches DONE. So when fallback is enabled we block here on waitForOperationCompletion, then classify:

  • retryable (capacity) → RetryableProvisioningException → caller tries the next candidate
  • non-retryable (quota / permission / bad config) → IOException → abort, no fallback

Reviewer note (the main design tradeoff): this wait runs on the provision() thread (the NodeProvisioner/spare-checker thread). For non-fallback configs it is skipped entirely and the launcher waits as before. The launcher later waits on the same op again — a harmless no-op since it's already DONE.

* Upper bound on fallback candidates per configuration. Prevents unbounded retry chains that
* could hold a provisioner thread for too long on a shared controller.
*/
public static final int MAX_FALLBACK_CANDIDATES = 10;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard cap on candidates per config. Bounds the worst-case provisioner-thread hold time (≤ 10 × launchTimeout).

*
* @return the derived region, or empty string if the zone is blank or has no recognizable suffix.
*/
public String getEffectiveRegion() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Region is optional in the UI. When blank it's derived from the zone name (us-west1-aus-west1), since GCE zone names encode their region. An explicit override is still available for edge cases.

* @see <a href="https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation">
* GCP: Troubleshooting VM creation</a>
*/
public final class ProvisioningErrorClassifier {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single source of truth for the fallback decision. Capacity-type errors are classified retryable (worth trying another zone); everything else (quota, permission, bad config) is non-retryable.

* <p>Maintenance: if GCP introduces additional capacity-related codes, add them here after
* confirming in the GCP documentation that retrying in another zone is appropriate.
*/
private static final String[] RETRYABLE_MARKERS = {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retryable allow-list: ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY. To support a new capacity code, add it here after confirming against the GCP docs linked in the class Javadoc.

* should retry in another zone/machine type; {@code false} for {@code null}, unknown, or
* clearly non-capacity errors (including quota failures).
*/
public static boolean isRetryable(String codeOrMessage) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservative unknown-code policy. An unrecognized/unlisted code returns false (non-retryable). This avoids masking genuine bugs behind endless zone-churning retries — a misconfiguration fails fast instead of silently cycling candidates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant