Skip to content

Use only name based check for VM existence during deletion - remove label based filtering#534

Merged
raul-arabaolaza merged 1 commit into
jenkinsci:developfrom
gbhat618:vm-deletion-by-name
Apr 1, 2026
Merged

Use only name based check for VM existence during deletion - remove label based filtering#534
raul-arabaolaza merged 1 commit into
jenkinsci:developfrom
gbhat618:vm-deletion-by-name

Conversation

@gbhat618

@gbhat618 gbhat618 commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

In CasC based controller, if the instanceId is not put in configuration, will get recomputed upon reapplying CasC (such as on restart) -

@DataBoundSetter
public void setInstanceId(String instanceId) {
if (Strings.isNullOrEmpty(instanceId)) {
this.instanceId = UUID.randomUUID().toString();
} else {
this.instanceId = instanceId;
}
}

Post such re-computation of instanceId, the VMs provisioned prior will not be deleted from the _terminate as the filter won't find them -

Map<String, String> filterLabel = ImmutableMap.of(CLOUD_ID_LABEL_KEY, cloud.getInstanceId());
var instanceExistsInCloud =
cloud.getClient().listInstancesWithLabel(cloud.getProjectId(), filterLabel).stream()
.anyMatch(instance -> instance.getName().equals(name));

Those VMs are assumed to be non-existing in GCP and Jenkins node is removed, the VMs will continue to run in the cloud.

I have considered completely removing the VM existence check entirely but noticed this discussion https://github.com/jenkinsci/google-compute-engine-plugin/pull/489/changes#r1871950758

This PR propose to use getInstance(projectId, zone, name) to filter a VM which should also get a unique VM. When multiple Jenkins controllers are pointing to a GCP project, and if one controller is having an agent with a particular name, other controllers would not have had chance to provision that VM, would have failed during provisioning even if they had computed the same agent name.

Investigation

This was able to prove by adding more logs

diff
diff --git a/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java b/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java
index 9a394d0..96e3b38 100644
--- a/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java
+++ b/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java
@@ -138,6 +138,7 @@ public class ComputeEngineInstance extends AbstractCloudSlave {
             }

             Map<String, String> filterLabel = ImmutableMap.of(CLOUD_ID_LABEL_KEY, cloud.getInstanceId());
+            LOGGER.log(Level.FINE, "Filtering using cloud id: " + cloud.getInstanceId());
             var instanceExistsInCloud =
                     cloud.getClient().listInstancesWithLabel(cloud.getProjectId(), filterLabel).stream()
                             .anyMatch(instance -> instance.getName().equals(name));
@@ -146,6 +147,15 @@ public class ComputeEngineInstance extends AbstractCloudSlave {
             // return immediately, hoping for the best.
             if (instanceExistsInCloud) {
                 cloud.getClient().terminateInstanceAsync(cloud.getProjectId(), zone, name);
+                LOGGER.finer(() -> "Termination request was made for " + this.getNodeName());
+            } else {
+                LOGGER.finer(() -> "Instance " + name + " does not exist");
+                try {
+                    var instance = cloud.getClient().getInstance(cloud.getProjectId(), zone, name);
+                    LOGGER.finer(() -> "Got the instance " + instance.getName());
+                } catch (Exception e) {
+                    LOGGER.log(Level.WARNING, "Error getting instance " + name, e);
+                }
             }
         } catch (CloudNotFoundException cnfe) {
             listener.error(cnfe.getMessage());

logs were,

2026-03-31 17:20:58.308+0000 [id=1287]      FINE    c.g.j.p.c.ComputeEngineInstance#_terminate: Filtering using cloud id: 5592ce99-f477-48bc-b0ee-e399e2c6f920
2026-03-31 17:20:58.784+0000 [id=1287]      FINER   c.g.j.p.c.ComputeEngineInstance#_terminate: Instance gbhat-15708-gce-pqvxjx does not exist
2026-03-31 17:20:58.872+0000 [id=1287]      FINER   c.g.j.p.c.ComputeEngineInstance#_terminate: Got the instance gbhat-15708-gce-pqvxjx

Instance had label as,

→ gcloud compute instances describe gbhat-15708-gce-pqvxjx --zone=us-east1-b --format='json(labels)'
{
  "labels": {
    "jenkins_cloud_id": "2ed0b892-fb9c-4ec2-b0c0-bf990fe25043",
    "jenkins_config_name": "gbhat-15708-gce",
    "jenkins_node_last_refresh": "2026_03_31t17_12_47_010z"
  }
}

Jenkins config.xml showed the id as,

bash-5.1$ cat config.xml | grep instanceId
      <instanceId>5592ce99-f477-48bc-b0ee-e399e2c6f920</instanceId>

Testing done

Automated Testing

  • Ran the existing integration test ComputeEngineCloudOneShotInstanceIT - which has some VM deletion related assertions.

Manual Testing

  • verified issue no more happens - in the context of CloudBees CI CasC controller where the problem observed.
  • manually killed off VMs in GCP and observed no log noise

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@gbhat618 gbhat618 marked this pull request as ready for review April 1, 2026 05:58
@gbhat618

gbhat618 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

The only theoretical risk (far from being practical) with project/zone/name based filtering without considering the additional filtering rule jenkins_cloud_id are,

  • what if 2 controllers point to the same GCP project, both using the same name prefix - and end up generating the same VM name? the VM would belong to only 1 controller; and could the other controller delete it wrongly.
    • the name collision happening is a very probability given that name is generated with 6 suffix random chars from the space (a-z: 28 chars) + (0-9: 10 chars) of 36 chars, leading to ~2.18 billion possibilities, to reach even 0.023% probability there needs 1k running instances;
      private String uniqueName() {
      char[][] pairs = {{'a', 'z'}, {'0', '9'}};
      RandomStringGenerator generator =
      new RandomStringGenerator.Builder().withinRange(pairs).build();
      String suffix = generator.generate(6);
      String prefix = namePrefix;
      if (!prefix.endsWith(("-"))) {
      prefix += "-";
      }
      return prefix + suffix;
      }
      . It is safe to say this is not going to have collisions in practical sense in the scale Jenkins controllers are used. And usually there is a different name prefix that people would configure when using multiple configurations pointing to same GCP Project.

@jglick

jglick commented Apr 1, 2026

Copy link
Copy Markdown
Member

Is this related to #504?

@jglick

jglick commented Apr 1, 2026

Copy link
Copy Markdown
Member

BTW use permalinks to the conversations, not files, tab: #489 (comment)

@jglick jglick left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds OK. Do not really understand the implications.

@gbhat618

gbhat618 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Is this related to #504?

yes, it is related.

@gbhat618

gbhat618 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Do not really understand the implications.

Nothing concrete as far as I can tell. The only theoretical implication is the one I put above #534 (comment) — which is far from practical imo.

@gbhat618

gbhat618 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

IMPORTANT: /label bug

try {
return cloud.getClient().getInstance(cloud.getProjectId(), zone, name) != null;
} catch (GoogleJsonResponseException gjre) {
if (gjre.getStatusCode() == 404) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example exception trace

exception
GET https://compute.googleapis.com/compute/v1/projects/tiger-team-testing/zones/us-east1-b/instances/name
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "The resource 'projects/tiger-team-testing/zones/us-east1-b/instances/name' was not found",
    "reason" : "notFound"
  } ],
  "message" : "The resource 'projects/tiger-team-testing/zones/us-east1-b/instances/name' was not found"
}
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
	at PluginClassLoader for google-oauth-plugin//com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576)
	at PluginClassLoader for google-compute-engine//com.google.cloud.graphite.platforms.plugin.client.ComputeWrapper.getInstance(ComputeWrapper.java:125)
	at PluginClassLoader for google-compute-engine//com.google.cloud.graphite.platforms.plugin.client.ComputeClient.getInstance(ComputeClient.java:371)
	at PluginClassLoader for google-compute-engine//com.google.jenkins.plugins.computeengine.ComputeEngineInstance.instanceExistsInCloud(ComputeEngineInstance.java:174)
	at PluginClassLoader for google-compute-engine//com.google.jenkins.plugins.computeengine.ComputeEngineInstance._terminate(ComputeEngineInstance.java:137)
	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:88)
	at PluginClassLoader for durable-task//org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy.lambda$done$5(OnceRetentionStrategy.java:141)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@raul-arabaolaza raul-arabaolaza added the bug Something isn't working label Apr 1, 2026
@raul-arabaolaza raul-arabaolaza merged commit 0a3fb1b into jenkinsci:develop Apr 1, 2026
17 checks passed
@gbhat618 gbhat618 deleted the vm-deletion-by-name branch April 1, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants