Multi-tenant self-hosted GitHub Actions runners on Kubernetes, designed for shared clusters where many teams run runners side by side.
Actions Runner Controller (ARC) scale-set mode is the common starting point. Once many teams share one cluster, three gaps open up that ARC doesn't address together — GitHub Actions Gateway (GAG) is built to close them:
| Gap at multi-tenant scale | How GAG closes it |
|---|---|
| An evicted runner pod leaves its job stuck in GitHub's queue, up to the 24-hour timeout | Cancels the job lock in seconds and reruns, with a per-job retry budget |
| Tenants can't be given isolated GitHub egress IPs | Dedicated per-tenant egress IP pool for allowlisting and contained blast radius |
| Idle runner and listener compute stays provisioned between jobs | Workers scale to zero between jobs; listeners run as ~60 KiB goroutines, not ~256 MiB pods |
Each team self-serves a fully isolated gateway from a single ActionsGateway custom resource (CR), running many runner groups (CPU, GPU, large-memory, …) under one shared ResourceQuota. The sections below cover the problem, how GAG solves it, and how it works.
Running many runner groups for one tenant in a shared Kubernetes namespace creates four compounding problems that ARC scale-set mode does not address together:
Scheduling starvation under a shared ResourceQuota. Each ARC AutoscalingRunnerSet has its own maxRunners cap, but there is no primitive for "GPU runners must always be able to claim at least N slots, regardless of how many CPU runners are active." When cheap CPU pods exhaust namespace quota first, the most expensive hardware reliably loses the race.
Listener overhead at scale. ARC's scale-set listener is one pod per scale set running a full .NET runtime — roughly 256 MiB resident, plus a cluster IP, held alive 24/7 to long-poll GitHub. A tenant with 10 scale sets pays ~2.5 GiB of memory and 10 pod slots at rest, before any job runs. Teams that also pin minRunners > 0 to mask runner-pod cold-start latency multiply this further with idle runner pods on expensive hardware.
No automatic recovery from worker eviction. When a runner pod is preempted, OOM-killed, or lost to a node failure, ARC has no built-in flow to fast-cancel the GitHub job lock and rerun. The runner is left orphaned and the job stays stuck in GitHub's queue — up to its 24-hour timeout — until someone manually clears the runner and reruns the workflow.
Platform team as bottleneck. Onboarding a tenant means provisioning namespace, quotas, controller scope, scale sets, NetworkPolicies, and egress — a platform-team checklist per team. Subsequent changes (new runner type, quota adjustment, scaling tweak) land as tickets.
Scheduling priority tiers per RunnerGroup. The priorityTiers field maps Kubernetes PriorityClass objects to cumulative pod-count thresholds. The first N pods of a GPU runner group get a preempting PriorityClass and will displace lower-priority CPU pods when quota is contended — guaranteeing they schedule. Higher tiers use preemptionPolicy: Never, so burst capacity gains scheduling preference without evicting running jobs. A final threshold caps total concurrency per group. Crucially, GAG gates admission at the broker-claim layer — it decides whether to claim a job from GitHub before acquiring it, so a job it cannot place is left queued for redelivery rather than claimed-then-cancelled. A Kubernetes job-queue manager such as Kueue operates one layer below this, on pod creation after the job is already claimed, and structurally cannot make that call (see Appendix D.5).
Automatic eviction retry with fast lock cancel. When the AGC sees a worker pod in Evicted status, it immediately stops lock renewal so GitHub cancels the job in seconds instead of waiting the full lock expiry, then calls GitHub's rerun API to reschedule. A configurable per-job retry budget prevents loops on persistently failing workloads.
Per-tenant dedicated egress IP pool. A Horizontal Pod Autoscaler (HPA)-managed pool of stateless HTTPS CONNECT proxy pods per tenant. All GitHub traffic from the AGC and worker pods routes through this pool, so each tenant gets egress IPs never shared with other tenants. Enables per-team allowlisting on the GitHub side, clean per-tenant audit attribution, and contained blast radius for rate limits or abuse flags.
Self-service tenant management via one CR. The Gateway Manager Controller (GMC) watches ActionsGateway CRs in tenant namespaces and provisions everything the tenant needs — RBAC, NetworkPolicies, egress proxy, AGC, and every runner group declared in the CR — all within the platform-owned namespace ResourceQuota (the platform admin owns the quota; the GMC operates inside it). No cluster-admin involvement after initial GMC install. Because tenants control their own configuration, they can diagnose their own runner behavior without escalating to the platform team.
Scale workers to zero with low listener overhead. Worker pods are created only when a job is acquired and deleted immediately on completion — the same scale-to-zero behavior as ARC scale-set mode with minRunners: 0, so GPU nodes return to the cluster scheduler the moment a job finishes. The difference is the listener: GAG runs every RunnerGroup's listener as a goroutine (~60 KiB resident) inside one shared AGC pod, instead of one ~256 MiB .NET listener pod per scale set. Tenants do not need to pin minRunners > 0 to mask cold-start latency, so the silent re-introduction of idle GPU pods that pattern causes does not happen.
Per-tenant utilization metrics. Both the GMC and AGC expose Prometheus metrics scoped per tenant and runner group. Teams have the data to understand their own GPU utilization and make the case for quota adjustments without relying on cluster-wide visibility.
A four-tier system:
Tenant namespace System namespace
════════════════ ════════════════
┌──────────────────────┐ ┌──────────────────────────────┐
│ ActionsGateway CR │──── watch ───▶│ Gateway Manager Controller │
│ (namespace-scoped) │ │ (GMC) │
└──────────────────────┘ └───────────────┬──────────────┘
┌────────────── provisions ──────────────┘
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Tenant namespace │
│ • Egress Proxy Pool HPA-managed, per-tenant egress IPs │
│ • Actions Gateway Controller AGC, goroutine multiplexer │
│ • Ephemeral Worker Pods one per job, GC'd on completion │
└──────────────────────────────────────────────────────────────────────┘
Tier 1 — Gateway Manager Controller (GMC). A cluster-scoped operator deployed once by the platform team. It watches namespace-scoped ActionsGateway CRs across all namespaces and provisions a fully isolated gateway instance for each tenant — role-based access control (RBAC), network policies, resource quotas, egress proxy, and AGC — entirely within the tenant's existing namespace.
Tier 2 — AGC. A Go-based operator deployed once per tenant. Instead of one pod per runner slot, it multiplexes virtual runner sessions as goroutines — designed to scale to thousands per AGC pod. Compute is provisioned only when a job is acquired and released immediately on completion (the finished pod object is deleted after a short configurable TTL). At steady state each goroutine is designed to cost ~60 KiB resident — a projected reduction of over 4,000× compared to a full .NET Runner.Listener process. The thousands-per-AGC ceiling is a design target, not yet validated at scale; the load test that would confirm it is deferred post-1.0 (see Appendix A — Capacity Targets & SLOs).
Tier 3 — Egress Proxy Pool. A Horizontal Pod Autoscaler (HPA)-managed pool of stateless HTTPS CONNECT proxy pods per tenant. All GitHub traffic from the AGC and worker pods routes through this pool, giving each tenant a dedicated set of egress IPs never shared with other tenants. Supports per-team IP allowlisting, clean audit trails, and contained blast radius.
Tier 4 — Ephemeral Worker Pod. A short-lived pod that executes exactly one workflow job and is immediately deleted on completion. Because worker pods exist only while a job is running, zero compute is idle between jobs — GPU nodes return to the cluster scheduler the moment a job finishes.
For the full design, see docs/design/.
| Section | |
|---|---|
| Executive Summary & Problem Statement | 01-executive-summary.md |
| Core Architectural Components | 02-architecture.md |
| API & Data Contract Specifications | 03-api-contracts.md |
| Operational Lifecycle Execution Flows | 04-operational-flows.md |
| Security & Threat Risk Assessment | 05-security.md |
| Capacity Targets & SLOs | appendix-a-capacity-slos.md |
| Alternatives Considered | appendix-d-alternatives-considered.md |
| Optional Future Enhancements | appendix-g-future-enhancements.md |
GAG ships as the actions-gateway Helm chart, which installs the Gateway Manager Controller (GMC) and its cluster prerequisites. The GMC then provisions per-tenant gateways at runtime from each ActionsGateway CR.
The chart is published, cosign-signed, to the GHCR OCI registry. The current release is 1.0.0 (GA) — install it straight from the registry:
helm install gag oci://ghcr.io/actions-gateway/charts/actions-gateway \
--version 1.0.0 \
--namespace gmc-system --create-namespace \
--set gmc.image.digest=sha256:<gmc> \
--set agc.image.digest=sha256:<agc> \
--set proxy.image.digest=sha256:<proxy>Copy the image digests from the release notes and verify the signatures before installing. See the Installation guide for prerequisites, image-digest pinning, the cert-manager toggle, healthy-install verification, and uninstall — and the chart README for the full values reference.
For day-2 operations — helm upgrade / rollback, per-component upgrades, and runbooks — see the operations docs, in particular the upgrade guide.
See docs/getting-started.md for the full walkthrough: GitHub App Secret, ActionsGateway CR, and GMC deployment.
Coming from Actions Runner Controller (ARC)? The Migrating from ARC guide maps ARC scale-set concepts onto GAG and walks one runner group across with zero downtime.
Both the GMC and AGC expose Prometheus metrics at /metrics. See docs/operations/observability.md for the full metrics reference.
See docs/design/appendix-a-capacity-slos.md for per-AGC, per-installation, and per-proxy limits and Service Level Objective (SLO) targets.
Questions, ideas, or running GAG in a real cluster? Open an issue — it's the place for setup help, bug reports, and feature requests. Issues opened by operators are the adoption signal the project cares about most.
See the public roadmap for what's shipped and what's next.
Run make (or make help) for the full list of targets. The most common ones:
# Build all binaries (agc, gmc, probe, proxy) into .build/
make build
# Build tool binaries (controller-gen, setup-envtest, ginkgo, kubebuilder)
make tools
# Bring up a kind cluster + local registry, build+push images, and run the standard e2e suite
make e2e-up
# Tear down the kind cluster when done
make e2e-cleanThis repo uses a go.work workspace, so go test ./... from the repo root
does not discover all modules. Use the per-module commands:
(cd broker && go test ./...) # broker module
(cd githubapp && go test ./...) # githubapp module
(cd cmd/agc && go test ./...) # AGC module
(cd cmd/gmc && go test ./...) # GMC module
(cd cmd/probe && go test ./...) # probe moduleIntegration tests require the envtest binaries staged via
KUBEBUILDER_ASSETS:
make setup-envtest
export KUBEBUILDER_ASSETS=$(.build/setup-envtest use 1.30.x \
--bin-dir /tmp/envtest-bins -p path)
(cd cmd/agc && go test -v -tags integration -timeout 5m -count=1 \
./internal/controller/integration/...)
(cd cmd/gmc && go test -v -tags integration -timeout 5m -count=1 \
./internal/controller/integration/...)broker/ GitHub broker client (session management, crypto, metrics)
githubapp/ GitHub App authentication and runner registration
cmd/agc/ Actions Gateway Controller binary
cmd/gmc/ Gateway Manager Controller binary (kubebuilder-generated)
cmd/proxy/ Egress proxy binary
cmd/worker/ Worker pod entrypoint
cmd/probe/ Diagnostic probe for live investigations
docs/ Documentation hub — see docs/README.md
docs/design/ Full system design documentation
docs/development/ Developer workflow guides
docs/operations/ Operator runbooks and references
docs/plan/ Implementation plans and audits
test/ E2E test infrastructure (fakegithub stub, kind configs)
tools/ Vendored build tools (controller-gen, setup-envtest)
vendor/ Workspace-vendored runtime dependencies (`go work vendor`)
GitHub Actions Gateway is licensed under the Apache License 2.0
(SPDX identifier Apache-2.0). Each published container image also carries this
in its org.opencontainers.image.licenses label. Copyright is asserted in the
NOTICE file.
A secure, dedicated gateway to GitHub for each tenant.
Don't let noisy neighbors or secret exfiltrators ruin your sleep.