Add ordered fallback provisioning for GCE capacity errors#554
Add ordered fallback provisioning for GCE capacity errors#554suchitrak wants to merge 4 commits into
Conversation
2200676 to
e94028a
Compare
Defines the data structure for a single fallback candidate (zone, machineType, region, subnetwork, template) with: - Input validation via doCheck methods in DescriptorImpl - Region auto-derivation from zone name - MAX_FALLBACK_CANDIDATES cap (10) to bound provisioner thread time - Help files for the Jenkins UI Co-authored-by: Cursor <cursoragent@cursor.com>
Classifies GCE operation errors into capacity-related (retryable via fallback) and non-retryable (abort immediately) buckets. - Conservative unknown-error policy: unrecognized codes are non-retryable - Covers ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY - Explicitly excludes QUOTA errors from retry - Documents GCP error code reference URLs for maintainability Co-authored-by: Cursor <cursoragent@cursor.com>
Refactors provision() to iterate through primary + fallback candidates: - Waits for GCE operation completion when fallback is configured - Retries next candidate on retryable capacity errors - Aborts immediately on non-retryable errors (quota, permission, config) - Best-effort cleanup of failed VMs before trying next candidate - Caps fallback list at MAX_FALLBACK_CANDIDATES; skips blank-zone entries - Re-zones disk type self-links for cross-zone fallback - Null-safe shortName() helper for logging template-based configs - UI section for configuring fallback candidates Co-authored-by: Cursor <cursoragent@cursor.com>
- InstanceConfigurationFallbackTest: fallback ordering, non-retryable abort, all-exhausted, OperationException handling, legacy no-fallback - ProvisioningErrorClassifierTest: retryable codes, quota exclusion, unknown-code policy, case insensitivity, null safety - ConfigAsCodeTest: CasC round-trip for fallbackCandidates field Co-authored-by: Cursor <cursoragent@cursor.com>
e94028a to
b3e7881
Compare
There was a problem hiding this comment.
Author's walkthrough for reviewers
I've added inline comments on every meaningful change to make this easy to review. Suggested reading order:
FallbackCandidate.java(new) — the per-candidate data model (zone/machineType/region/subnetwork/template), validation, and the 10-candidate cap.ProvisioningErrorClassifier.java(new) — the policy that decides which errors trigger fallback (capacity = retry, everything else = abort; unknown = abort).InstanceConfiguration.java(only existing prod file changed) — theprovision()refactor that loops through[primary, ...fallbacks]. Start atprovision(), thenbuildProvisioningAttempts(), thenprovisionAttempt().- Tests —
InstanceConfigurationFallbackTest,ProvisioningErrorClassifierTest,ConfigAsCodeTest.
Two things to keep in mind while reviewing
- Backward compatibility:
fallbackCandidatesis@Nullable/defaults null. With no candidates configured,provision()runs the original single-attempt fast path unchanged. No migration needed. This is fully backward compatible. Existing users are not impacted. - The one design tradeoff (flagged inline at the operation wait): when fallback is configured, the GCE operation wait moves from the launcher's
Futureintoprovision(), so it briefly blocks the provisioner thread. Bounded by the 10-candidate cap × per-attemptlaunchTimeout. Happy to move it into thePlannedNodefuture instead if preferred.
| * @see FallbackCandidate | ||
| */ | ||
| @Nullable | ||
| private List<FallbackCandidate> fallbackCandidates; |
There was a problem hiding this comment.
New field — the backbone of backward compatibility. @Nullable and defaults to null, so any existing config.xml/CasC saved before this feature deserializes unchanged. When null/empty, provisioning behaves exactly as it did before this PR.
|
|
||
| public ComputeEngineInstance provision() throws IOException { | ||
| List<ProvisioningAttempt> attempts = buildProvisioningAttempts(); | ||
| boolean fallbackEnabled = attempts.size() > 1; |
There was a problem hiding this comment.
Provisioning entry point. buildProvisioningAttempts() returns an ordered list [primary, fallback1, fallback2, ...]. fallbackEnabled is true only when at least one fallback exists. The loop below tries each attempt in order: on a retryable (capacity) failure it logs and advances to the next candidate; on success it returns the node; if every candidate fails it throws IOException wrapping the last error.
Key: when fallbackEnabled == false (no fallbacks configured) the loop runs once and provisionAttempt(..., false) takes the original fast path — zero behavior change for existing users.
| * configuration). Candidates with a blank zone are skipped with a warning. The list is capped | ||
| * at {@link FallbackCandidate#MAX_FALLBACK_CANDIDATES} entries to bound provisioner thread time. | ||
| */ | ||
| private List<ProvisioningAttempt> buildProvisioningAttempts() { |
There was a problem hiding this comment.
Builds the ordered attempt list. Primary first (this config's zone/machineType/template), then each FallbackCandidate with any blank field inherited from the primary via firstNonEmpty(...). Two safety rails: candidates with a blank zone are skipped with a warning, and the list is capped at MAX_FALLBACK_CANDIDATES (10) to bound how long the provisioner thread can be held.
| * and the caller should try the next fallback candidate. | ||
| * @throws IOException for non-retryable failures (the whole provision should abort). | ||
| */ | ||
| private ComputeEngineInstance provisionAttempt(ProvisioningAttempt attempt, boolean fallbackEnabled) |
There was a problem hiding this comment.
Provisions a single attempt. The fallbackEnabled flag switches behavior:
false→ original fast path: submit the insert and return immediately; the launcher waits on the operation (as it always has).true→ wait inline for the operation so async capacity errors surface here and can drive the fallback decision.
This dual-mode is intentional so non-fallback users get the exact pre-PR code path.
| if (fallbackEnabled) { | ||
| Operation.Error opError = null; | ||
| try { | ||
| Operation completed = cloud.getClient() |
There was a problem hiding this comment.
⭐ This is the heart of the feature. GCE's insertInstance returns immediately with a pending Operation — capacity errors like ZONE_RESOURCE_POOL_EXHAUSTED only appear when the zone Operation reaches DONE. So when fallback is enabled we block here on waitForOperationCompletion, then classify:
- retryable (capacity) →
RetryableProvisioningException→ caller tries the next candidate - non-retryable (quota / permission / bad config) →
IOException→ abort, no fallback
Reviewer note (the main design tradeoff): this wait runs on the provision() thread (the NodeProvisioner/spare-checker thread). For non-fallback configs it is skipped entirely and the launcher waits as before. The launcher later waits on the same op again — a harmless no-op since it's already DONE.
| * Upper bound on fallback candidates per configuration. Prevents unbounded retry chains that | ||
| * could hold a provisioner thread for too long on a shared controller. | ||
| */ | ||
| public static final int MAX_FALLBACK_CANDIDATES = 10; |
There was a problem hiding this comment.
Hard cap on candidates per config. Bounds the worst-case provisioner-thread hold time (≤ 10 × launchTimeout).
| * | ||
| * @return the derived region, or empty string if the zone is blank or has no recognizable suffix. | ||
| */ | ||
| public String getEffectiveRegion() { |
There was a problem hiding this comment.
Region is optional in the UI. When blank it's derived from the zone name (us-west1-a → us-west1), since GCE zone names encode their region. An explicit override is still available for edge cases.
| * @see <a href="https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-vm-creation"> | ||
| * GCP: Troubleshooting VM creation</a> | ||
| */ | ||
| public final class ProvisioningErrorClassifier { |
There was a problem hiding this comment.
Single source of truth for the fallback decision. Capacity-type errors are classified retryable (worth trying another zone); everything else (quota, permission, bad config) is non-retryable.
| * <p>Maintenance: if GCP introduces additional capacity-related codes, add them here after | ||
| * confirming in the GCP documentation that retrying in another zone is appropriate. | ||
| */ | ||
| private static final String[] RETRYABLE_MARKERS = { |
There was a problem hiding this comment.
The retryable allow-list: ZONE_RESOURCE_POOL_EXHAUSTED, STOCKOUT, RESOURCE_NOT_READY. To support a new capacity code, add it here after confirming against the GCP docs linked in the class Javadoc.
| * should retry in another zone/machine type; {@code false} for {@code null}, unknown, or | ||
| * clearly non-capacity errors (including quota failures). | ||
| */ | ||
| public static boolean isRetryable(String codeOrMessage) { |
There was a problem hiding this comment.
Conservative unknown-code policy. An unrecognized/unlisted code returns false (non-retryable). This avoids masking genuine bugs behind endless zone-churning retries — a misconfiguration fails fast instead of silently cycling candidates.
Summary
When provisioning a GCE agent fails with a retryable capacity error (e.g.
ZONE_RESOURCE_POOL_EXHAUSTED), the plugin now iterates through an ordered list of fallback candidates defined perInstanceConfiguration.New classes
FallbackCandidate— data class for a single fallback option (zone, machineType, region, subnetwork, template) with:doCheckZone/doCheckMachineType/doCheckRegionin DescriptorImplus-west1-a→us-west1)MAX_FALLBACK_CANDIDATES = 10capProvisioningErrorClassifier— classifies GCP operation errors as retryable vs non-retryableZONE_RESOURCE_POOL_EXHAUSTED,STOCKOUT,RESOURCE_NOT_READYChanges to existing code
InstanceConfiguration.provision()— refactored to loop through primary + fallback candidates, retrying on classified-retryable errorsconfig.jelly— UI section for configuring fallback candidates per instance configurationCommit structure
FallbackCandidatedata class with validation and CasC/UI supportProvisioningErrorClassifierwith documented error policiesInstanceConfigurationfallback provisioning logicTest plan
InstanceConfigurationFallbackTest(fallback logic, ordering, non-retryable short-circuit, all-exhausted, OperationException)ProvisioningErrorClassifierTest(retryable codes, unknown-code policy, quota exclusion, case insensitivity)ConfigAsCodeTestus-west1-b→us-west1-c→us-west1-a) under real GCP capacity pressure, confirmed via GCE metadata server and GCP audit logsMotivation
GCE zones frequently experience
ZONE_RESOURCE_POOL_EXHAUSTEDfor specific machine types. Without fallback, builds queue indefinitely. This feature allows administrators to define ordered alternatives so provisioning can succeed in a different zone/template when the primary is stocked out — similar to how AWS EC2 plugin handles multiple AZs.Design decisions