Skip to content

Conversation

@Mykematt
Copy link
Contributor

Description

Problem: In Kubernetes environments with ephemeral pods, multiple jobs starting simultaneously would redundantly download the same plugin, wasting time and bandwidth.

Solution: Implemented file locking for shared plugin storage in Kubernetes agents to prevent race conditions during plugin downloads. Added enhanced logging to show:

  • When shared storage is being used
  • When a job is waiting for another job to finish downloading a plugin
  • When a plugin is successfully reused from cache (with commit hash)
Screenshot 2025-12-23 at 1 56 36 PM

Context

Linear: SUP-5805

Changes

Core Implementation (internal/job/plugin.go):

  • Added openCachedPlugin() helper function to DRY up duplicated cached plugin handling code
  • Enhanced acquirePluginLock() with logging for lock acquisition wait states
  • Updated checkoutPlugin() to support shared plugin storage with file locking when BUILDKITE_PLUGINS_PATH_INCLUDES_AGENT_NAME=false

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

I consulted Claude for potential approaches, then wrote the implementation myself

@Mykematt
Copy link
Contributor Author

Mykematt commented Dec 23, 2025

Tested with EKS cluster using EFS-backed PVC for /workspace/plugins: https://buildkite.com/olabuildkite/buildkite-kubernetes/builds/346/steps/table

Job 1: checks to ensure shared plugin path is empty
Job 2: downloads the plugin binary since workspace is empty
Job 3: could not acquire lock so waited for lock to be released; once released, saw binary already downloaded and skipped downloading
Job 4: saw the binary immediately

Fix is mainly between Jobs 2 and 3. No race condition between them, and only one of them downloaded the binary by locking the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants