Add ordered fallback provisioning for GCE capacity errors by suchitrak · Pull Request #554 · jenkinsci/google-compute-engine-plugin

suchitrak · 2026-06-11T16:26:36Z

Summary

When provisioning a GCE agent fails with a retryable capacity error (e.g. ZONE_RESOURCE_POOL_EXHAUSTED), the plugin now iterates through an ordered list of fallback candidates defined per InstanceConfiguration.

Each fallback candidate can override zone, machine type, and/or instance template
Non-retryable errors (quota, permission, config) fail immediately without attempting fallback
Failed instances are cleaned up before trying the next candidate
Detailed provisioning logging shows candidate progression (N/M, zone, template, outcome)
Fallback list capped at 10 entries to bound provisioner thread time
Unknown GCP error codes are treated as non-retryable (conservative policy)

New classes

FallbackCandidate — data class for a single fallback option (zone, machineType, region, subnetwork, template) with:
- Input validation via doCheckZone/doCheckMachineType/doCheckRegion in DescriptorImpl
- Region auto-derivation from zone name (e.g. us-west1-a → us-west1)
- MAX_FALLBACK_CANDIDATES = 10 cap
- Full CasC/UI support with help files
ProvisioningErrorClassifier — classifies GCP operation errors as retryable vs non-retryable
- Documents GCP error code reference URLs for maintainability
- Covers ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY
- Unknown codes → non-retryable (explicitly documented policy)

Changes to existing code

InstanceConfiguration.provision() — refactored to loop through primary + fallback candidates, retrying on classified-retryable errors
config.jelly — UI section for configuring fallback candidates per instance configuration

Commit structure

FallbackCandidate data class with validation and CasC/UI support
ProvisioningErrorClassifier with documented error policies
InstanceConfiguration fallback provisioning logic
Unit tests for all new functionality

Test plan

Unit tests: InstanceConfigurationFallbackTest (fallback logic, ordering, non-retryable short-circuit, all-exhausted, OperationException)
Unit tests: ProvisioningErrorClassifierTest (retryable codes, unknown-code policy, quota exclusion, case insensitivity)
CasC round-trip test in ConfigAsCodeTest
E2E validated on a live Jenkins HA controller with a 3-zone fallback chain (us-west1-b → us-west1-c → us-west1-a) under real GCP capacity pressure, confirmed via GCE metadata server and GCP audit logs
CI on this PR

Motivation

GCE zones frequently experience ZONE_RESOURCE_POOL_EXHAUSTED for specific machine types. Without fallback, builds queue indefinitely. This feature allows administrators to define ordered alternatives so provisioning can succeed in a different zone/template when the primary is stocked out — similar to how AWS EC2 plugin handles multiple AZs.

Design decisions

Decision	Choice	Rationale
Unknown error codes	Non-retryable	Conservative — avoids masking real bugs behind retries
Fallback cap	10 candidates max	Bounds worst-case provisioner thread hold time
Region handling	Auto-derived from zone when blank	GCE zone names encode region; explicit override available for edge cases
Machine type w/o template	Required	Without a template, machine type must be specified for the GCE insert
Machine type w/ template	Optional	Template provides it

Defines the data structure for a single fallback candidate (zone, machineType, region, subnetwork, template) with: - Input validation via doCheck methods in DescriptorImpl - Region auto-derivation from zone name - MAX_FALLBACK_CANDIDATES cap (10) to bound provisioner thread time - Help files for the Jenkins UI Co-authored-by: Cursor <cursoragent@cursor.com>

Classifies GCE operation errors into capacity-related (retryable via fallback) and non-retryable (abort immediately) buckets. - Conservative unknown-error policy: unrecognized codes are non-retryable - Covers ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY - Explicitly excludes QUOTA errors from retry - Documents GCP error code reference URLs for maintainability Co-authored-by: Cursor <cursoragent@cursor.com>

Refactors provision() to iterate through primary + fallback candidates: - Waits for GCE operation completion when fallback is configured - Retries next candidate on retryable capacity errors - Aborts immediately on non-retryable errors (quota, permission, config) - Best-effort cleanup of failed VMs before trying next candidate - Caps fallback list at MAX_FALLBACK_CANDIDATES; skips blank-zone entries - Re-zones disk type self-links for cross-zone fallback - Null-safe shortName() helper for logging template-based configs - UI section for configuring fallback candidates Co-authored-by: Cursor <cursoragent@cursor.com>

- InstanceConfigurationFallbackTest: fallback ordering, non-retryable abort, all-exhausted, OperationException handling, legacy no-fallback - ProvisioningErrorClassifierTest: retryable codes, quota exclusion, unknown-code policy, case insensitivity, null safety - ConfigAsCodeTest: CasC round-trip for fallbackCandidates field Co-authored-by: Cursor <cursoragent@cursor.com>

suchitrak

Author's walkthrough for reviewers

I've added inline comments on every meaningful change to make this easy to review. Suggested reading order:

FallbackCandidate.java (new) — the per-candidate data model (zone/machineType/region/subnetwork/template), validation, and the 10-candidate cap.
ProvisioningErrorClassifier.java (new) — the policy that decides which errors trigger fallback (capacity = retry, everything else = abort; unknown = abort).
InstanceConfiguration.java (only existing prod file changed) — the provision() refactor that loops through [primary, ...fallbacks]. Start at provision(), then buildProvisioningAttempts(), then provisionAttempt().
Tests — InstanceConfigurationFallbackTest, ProvisioningErrorClassifierTest, ConfigAsCodeTest.

Two things to keep in mind while reviewing

Backward compatibility: fallbackCandidates is @Nullable/defaults null. With no candidates configured, provision() runs the original single-attempt fast path unchanged. No migration needed. This is fully backward compatible. Existing users are not impacted.
The one design tradeoff (flagged inline at the operation wait): when fallback is configured, the GCE operation wait moves from the launcher's Future into provision(), so it briefly blocks the provisioner thread. Bounded by the 10-candidate cap × per-attempt launchTimeout. Happy to move it into the PlannedNode future instead if preferred.

suchitrak · 2026-06-12T00:09:44Z

+     * @see FallbackCandidate
+     */
+    @Nullable
+    private List<FallbackCandidate> fallbackCandidates;


New field — the backbone of backward compatibility. @Nullable and defaults to null, so any existing config.xml/CasC saved before this feature deserializes unchanged. When null/empty, provisioning behaves exactly as it did before this PR.

suchitrak · 2026-06-12T00:09:44Z


    public ComputeEngineInstance provision() throws IOException {
+        List<ProvisioningAttempt> attempts = buildProvisioningAttempts();
+        boolean fallbackEnabled = attempts.size() > 1;


Provisioning entry point. buildProvisioningAttempts() returns an ordered list [primary, fallback1, fallback2, ...]. fallbackEnabled is true only when at least one fallback exists. The loop below tries each attempt in order: on a retryable (capacity) failure it logs and advances to the next candidate; on success it returns the node; if every candidate fails it throws IOException wrapping the last error.

Key: when fallbackEnabled == false (no fallbacks configured) the loop runs once and provisionAttempt(..., false) takes the original fast path — zero behavior change for existing users.

suchitrak · 2026-06-12T00:09:44Z

+     * configuration). Candidates with a blank zone are skipped with a warning. The list is capped
+     * at {@link FallbackCandidate#MAX_FALLBACK_CANDIDATES} entries to bound provisioner thread time.
+     */
+    private List<ProvisioningAttempt> buildProvisioningAttempts() {


Builds the ordered attempt list. Primary first (this config's zone/machineType/template), then each FallbackCandidate with any blank field inherited from the primary via firstNonEmpty(...). Two safety rails: candidates with a blank zone are skipped with a warning, and the list is capped at MAX_FALLBACK_CANDIDATES (10) to bound how long the provisioner thread can be held.

suchitrak · 2026-06-12T00:09:44Z

+     *     and the caller should try the next fallback candidate.
+     * @throws IOException for non-retryable failures (the whole provision should abort).
+     */
+    private ComputeEngineInstance provisionAttempt(ProvisioningAttempt attempt, boolean fallbackEnabled)


Provisions a single attempt. The fallbackEnabled flag switches behavior:

false → original fast path: submit the insert and return immediately; the launcher waits on the operation (as it always has).

true → wait inline for the operation so async capacity errors surface here and can drive the fallback decision.

This dual-mode is intentional so non-fallback users get the exact pre-PR code path.

suchitrak · 2026-06-12T00:09:44Z

+        if (fallbackEnabled) {
+            Operation.Error opError = null;
+            try {
+                Operation completed = cloud.getClient()


⭐ This is the heart of the feature. GCE's insertInstance returns immediately with a pending Operation — capacity errors like ZONE_RESOURCE_POOL_EXHAUSTED only appear when the zone Operation reaches DONE. So when fallback is enabled we block here on waitForOperationCompletion, then classify:

retryable (capacity) → RetryableProvisioningException → caller tries the next candidate

non-retryable (quota / permission / bad config) → IOException → abort, no fallback

Reviewer note (the main design tradeoff): this wait runs on the provision() thread (the NodeProvisioner/spare-checker thread). For non-fallback configs it is skipped entirely and the launcher waits as before. The launcher later waits on the same op again — a harmless no-op since it's already DONE.

suchitrak · 2026-06-12T00:09:44Z

+     * Upper bound on fallback candidates per configuration. Prevents unbounded retry chains that
+     * could hold a provisioner thread for too long on a shared controller.
+     */
+    public static final int MAX_FALLBACK_CANDIDATES = 10;


Hard cap on candidates per config. Bounds the worst-case provisioner-thread hold time (≤ 10 × launchTimeout).

suchitrak · 2026-06-12T00:09:44Z

+     *
+     * @return the derived region, or empty string if the zone is blank or has no recognizable suffix.
+     */
+    public String getEffectiveRegion() {


Region is optional in the UI. When blank it's derived from the zone name (us-west1-a → us-west1), since GCE zone names encode their region. An explicit override is still available for edge cases.

suchitrak · 2026-06-12T00:09:44Z

+ * @see <a href="https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation">
+ *     GCP: Troubleshooting VM creation</a>
+ */
+public final class ProvisioningErrorClassifier {


Single source of truth for the fallback decision. Capacity-type errors are classified retryable (worth trying another zone); everything else (quota, permission, bad config) is non-retryable.

suchitrak · 2026-06-12T00:09:44Z

+     * <p>Maintenance: if GCP introduces additional capacity-related codes, add them here after
+     * confirming in the GCP documentation that retrying in another zone is appropriate.
+     */
+    private static final String[] RETRYABLE_MARKERS = {


The retryable allow-list: ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY. To support a new capacity code, add it here after confirming against the GCP docs linked in the class Javadoc.

suchitrak · 2026-06-12T00:09:44Z

+     *     should retry in another zone/machine type; {@code false} for {@code null}, unknown, or
+     *     clearly non-capacity errors (including quota failures).
+     */
+    public static boolean isRetryable(String codeOrMessage) {


Conservative unknown-code policy. An unrecognized/unlisted code returns false (non-retryable). This avoids masking genuine bugs behind endless zone-churning retries — a misconfiguration fails fast instead of silently cycling candidates.

suchitrak requested a review from a team as a code owner June 11, 2026 16:26

suchitrak force-pushed the gce-agent-fallback-4.683 branch from 2200676 to e94028a Compare June 11, 2026 17:08

skunnath and others added 4 commits June 11, 2026 10:09

suchitrak force-pushed the gce-agent-fallback-4.683 branch from e94028a to b3e7881 Compare June 11, 2026 17:11

suchitrak commented Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ordered fallback provisioning for GCE capacity errors#554

Add ordered fallback provisioning for GCE capacity errors#554
suchitrak wants to merge 4 commits into
jenkinsci:developfrom
suchitrak:gce-agent-fallback-4.683

suchitrak commented Jun 11, 2026 •

edited

Loading

Uh oh!

suchitrak left a comment •

edited

Loading

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

suchitrak Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

suchitrak commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New classes

Changes to existing code

Commit structure

Test plan

Motivation

Design decisions

Uh oh!

suchitrak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Author's walkthrough for reviewers

Two things to keep in mind while reviewing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suchitrak commented Jun 11, 2026 •

edited

Loading

suchitrak left a comment •

edited

Loading