GitHub Actions Gateway

Multi-tenant self-hosted GitHub Actions runners on Kubernetes, designed for shared clusters where many teams run runners side by side.

Actions Runner Controller (ARC) scale-set mode is the common starting point. Once many teams share one cluster, three gaps open up that ARC doesn't address together — GitHub Actions Gateway (GAG) is built to close them:

Gap at multi-tenant scale	How GAG closes it
An evicted runner pod leaves its job stuck in GitHub's queue, up to the 24-hour timeout	Cancels the job lock in seconds and reruns, with a per-job retry budget
Tenants can't be given isolated GitHub egress IPs	Dedicated per-tenant egress IP pool for allowlisting and contained blast radius
Idle runner and listener compute stays provisioned between jobs	Workers scale to zero between jobs; listeners run as ~60 KiB goroutines, not ~256 MiB pods

Each team self-serves a fully isolated gateway from a single ActionsGateway custom resource (CR), running many runner groups (CPU, GPU, large-memory, …) under one shared ResourceQuota. The sections below cover the problem, how GAG solves it, and how it works.

The Problem

Running many runner groups for one tenant in a shared Kubernetes namespace creates four compounding problems that ARC scale-set mode does not address together:

Scheduling starvation under a shared ResourceQuota. Each ARC AutoscalingRunnerSet has its own maxRunners cap, but there is no primitive for "GPU runners must always be able to claim at least N slots, regardless of how many CPU runners are active." When cheap CPU pods exhaust namespace quota first, the most expensive hardware reliably loses the race.

Listener overhead at scale. ARC's scale-set listener is one pod per scale set running a full .NET runtime — roughly 256 MiB resident, plus a cluster IP, held alive 24/7 to long-poll GitHub. A tenant with 10 scale sets pays ~2.5 GiB of memory and 10 pod slots at rest, before any job runs. Teams that also pin minRunners > 0 to mask runner-pod cold-start latency multiply this further with idle runner pods on expensive hardware.

No automatic recovery from worker eviction. When a runner pod is preempted, OOM-killed, or lost to a node failure, ARC has no built-in flow to fast-cancel the GitHub job lock and rerun. The runner is left orphaned and the job stays stuck in GitHub's queue — up to its 24-hour timeout — until someone manually clears the runner and reruns the workflow.

Platform team as bottleneck. Onboarding a tenant means provisioning namespace, quotas, controller scope, scale sets, NetworkPolicies, and egress — a platform-team checklist per team. Subsequent changes (new runner type, quota adjustment, scaling tweak) land as tickets.

The Solution

Scheduling priority tiers per RunnerGroup. The priorityTiers field maps Kubernetes PriorityClass objects to cumulative pod-count thresholds. The first N pods of a GPU runner group get a preempting PriorityClass and will displace lower-priority CPU pods when quota is contended — guaranteeing they schedule. Higher tiers use preemptionPolicy: Never, so burst capacity gains scheduling preference without evicting running jobs. A final threshold caps total concurrency per group. Crucially, GAG gates admission at the broker-claim layer — it decides whether to claim a job from GitHub before acquiring it, so a job it cannot place is left queued for redelivery rather than claimed-then-cancelled. A Kubernetes job-queue manager such as Kueue operates one layer below this, on pod creation after the job is already claimed, and structurally cannot make that call (see Appendix D.5).

Automatic eviction retry with fast lock cancel. When the AGC sees a worker pod in Evicted status, it immediately stops lock renewal so GitHub cancels the job in seconds instead of waiting the full lock expiry, then calls GitHub's rerun API to reschedule. A configurable per-job retry budget prevents loops on persistently failing workloads.

Per-tenant dedicated egress IP pool. A Horizontal Pod Autoscaler (HPA)-managed pool of stateless HTTPS CONNECT proxy pods per tenant. All GitHub traffic from the AGC and worker pods routes through this pool, so each tenant gets egress IPs never shared with other tenants. Enables per-team allowlisting on the GitHub side, clean per-tenant audit attribution, and contained blast radius for rate limits or abuse flags.

Self-service tenant management via one CR. The Gateway Manager Controller (GMC) watches ActionsGateway CRs in tenant namespaces and provisions everything the tenant needs — RBAC, NetworkPolicies, egress proxy, AGC, and every runner group declared in the CR — all within the platform-owned namespace ResourceQuota (the platform admin owns the quota; the GMC operates inside it). No cluster-admin involvement after initial GMC install. Because tenants control their own configuration, they can diagnose their own runner behavior without escalating to the platform team.

Scale workers to zero with low listener overhead. Worker pods are created only when a job is acquired and deleted immediately on completion — the same scale-to-zero behavior as ARC scale-set mode with minRunners: 0, so GPU nodes return to the cluster scheduler the moment a job finishes. The difference is the listener: GAG runs every RunnerGroup's listener as a goroutine (~60 KiB resident) inside one shared AGC pod, instead of one ~256 MiB .NET listener pod per scale set. Tenants do not need to pin minRunners > 0 to mask cold-start latency, so the silent re-introduction of idle GPU pods that pattern causes does not happen.

Per-tenant utilization metrics. Both the GMC and AGC expose Prometheus metrics scoped per tenant and runner group. Teams have the data to understand their own GPU utilization and make the case for quota adjustments without relying on cluster-wide visibility.

Architecture

A four-tier system:

  Tenant namespace                         System namespace
  ════════════════                         ════════════════

  ┌──────────────────────┐               ┌──────────────────────────────┐
  │  ActionsGateway CR   │──── watch ───▶│  Gateway Manager Controller  │
  │  (namespace-scoped)  │               │            (GMC)             │
  └──────────────────────┘               └───────────────┬──────────────┘
                ┌────────────── provisions ──────────────┘
                ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  Tenant namespace                                                    │
  │    • Egress Proxy Pool           HPA-managed, per-tenant egress IPs  │
  │    • Actions Gateway Controller  AGC, goroutine multiplexer          │
  │    • Ephemeral Worker Pods       one per job, GC'd on completion     │
  └──────────────────────────────────────────────────────────────────────┘

Tier 1 — Gateway Manager Controller (GMC). A cluster-scoped operator deployed once by the platform team. It watches namespace-scoped ActionsGateway CRs across all namespaces and provisions a fully isolated gateway instance for each tenant — role-based access control (RBAC), network policies, resource quotas, egress proxy, and AGC — entirely within the tenant's existing namespace.

Tier 2 — AGC. A Go-based operator deployed once per tenant. Instead of one pod per runner slot, it multiplexes virtual runner sessions as goroutines — designed to scale to thousands per AGC pod. Compute is provisioned only when a job is acquired and released immediately on completion (the finished pod object is deleted after a short configurable TTL). At steady state each goroutine is designed to cost ~60 KiB resident — a projected reduction of over 4,000× compared to a full .NET Runner.Listener process. The thousands-per-AGC ceiling is a design target, not yet validated at scale; the load test that would confirm it is deferred post-1.0 (see Appendix A — Capacity Targets & SLOs).

Tier 3 — Egress Proxy Pool. A Horizontal Pod Autoscaler (HPA)-managed pool of stateless HTTPS CONNECT proxy pods per tenant. All GitHub traffic from the AGC and worker pods routes through this pool, giving each tenant a dedicated set of egress IPs never shared with other tenants. Supports per-team IP allowlisting, clean audit trails, and contained blast radius.

Tier 4 — Ephemeral Worker Pod. A short-lived pod that executes exactly one workflow job and is immediately deleted on completion. Because worker pods exist only while a job is running, zero compute is idle between jobs — GPU nodes return to the cluster scheduler the moment a job finishes.

For the full design, see docs/design/.

Section
Executive Summary & Problem Statement	01-executive-summary.md
Core Architectural Components	02-architecture.md
API & Data Contract Specifications	03-api-contracts.md
Operational Lifecycle Execution Flows	04-operational-flows.md
Security & Threat Risk Assessment	05-security.md
Capacity Targets & SLOs	appendix-a-capacity-slos.md
Alternatives Considered	appendix-d-alternatives-considered.md
Optional Future Enhancements	appendix-g-future-enhancements.md

Installation

GAG ships as the actions-gateway Helm chart, which installs the Gateway Manager Controller (GMC) and its cluster prerequisites. The GMC then provisions per-tenant gateways at runtime from each ActionsGateway CR.

The chart is published, cosign-signed, to the GHCR OCI registry. The current release is 1.0.0 (GA) — install it straight from the registry:

helm install gag oci://ghcr.io/actions-gateway/charts/actions-gateway \
  --version 1.0.0 \
  --namespace gmc-system --create-namespace \
  --set gmc.image.digest=sha256:<gmc> \
  --set agc.image.digest=sha256:<agc> \
  --set proxy.image.digest=sha256:<proxy>

Copy the image digests from the release notes and verify the signatures before installing. See the Installation guide for prerequisites, image-digest pinning, the cert-manager toggle, healthy-install verification, and uninstall — and the chart README for the full values reference.

For day-2 operations — helm upgrade / rollback, per-component upgrades, and runbooks — see the operations docs, in particular the upgrade guide.

Quick Start

See docs/getting-started.md for the full walkthrough: GitHub App Secret, ActionsGateway CR, and GMC deployment.

Coming from Actions Runner Controller (ARC)? The Migrating from ARC guide maps ARC scale-set concepts onto GAG and walks one runner group across with zero downtime.

Observability

Both the GMC and AGC expose Prometheus metrics at /metrics. See docs/operations/observability.md for the full metrics reference.

Capacity Reference

See docs/design/appendix-a-capacity-slos.md for per-AGC, per-installation, and per-proxy limits and Service Level Objective (SLO) targets.

Community

Questions, ideas, or running GAG in a real cluster? Open an issue — it's the place for setup help, bug reports, and feature requests. Issues opened by operators are the adoption signal the project cares about most.

See the public roadmap for what's shipped and what's next.

Development

Run make (or make help) for the full list of targets. The most common ones:

# Build all binaries (agc, gmc, probe, proxy) into .build/
make build

# Build tool binaries (controller-gen, setup-envtest, ginkgo, kubebuilder)
make tools

# Bring up a kind cluster + local registry, build+push images, and run the standard e2e suite
make e2e-up

# Tear down the kind cluster when done
make e2e-clean

Running tests

This repo uses a go.work workspace, so go test ./... from the repo root does not discover all modules. Use the per-module commands:

(cd broker     && go test ./...)    # broker module
(cd githubapp  && go test ./...)    # githubapp module
(cd cmd/agc   && go test ./...)     # AGC module
(cd cmd/gmc   && go test ./...)     # GMC module
(cd cmd/probe && go test ./...)     # probe module

Integration tests require the envtest binaries staged via KUBEBUILDER_ASSETS:

make setup-envtest
export KUBEBUILDER_ASSETS=$(.build/setup-envtest use 1.30.x \
    --bin-dir /tmp/envtest-bins -p path)

(cd cmd/agc && go test -v -tags integration -timeout 5m -count=1 \
    ./internal/controller/integration/...)
(cd cmd/gmc && go test -v -tags integration -timeout 5m -count=1 \
    ./internal/controller/integration/...)

Repository Layout

broker/          GitHub broker client (session management, crypto, metrics)
githubapp/       GitHub App authentication and runner registration
cmd/agc/         Actions Gateway Controller binary
cmd/gmc/         Gateway Manager Controller binary (kubebuilder-generated)
cmd/proxy/       Egress proxy binary
cmd/worker/      Worker pod entrypoint
cmd/probe/       Diagnostic probe for live investigations
docs/            Documentation hub — see docs/README.md
docs/design/     Full system design documentation
docs/development/ Developer workflow guides
docs/operations/ Operator runbooks and references
docs/plan/       Implementation plans and audits
test/            E2E test infrastructure (fakegithub stub, kind configs)
tools/           Vendored build tools (controller-gen, setup-envtest)
vendor/          Workspace-vendored runtime dependencies (`go work vendor`)

License

GitHub Actions Gateway is licensed under the Apache License 2.0 (SPDX identifier Apache-2.0). Each published container image also carries this in its org.opencontainers.image.licenses label. Copyright is asserted in the NOTICE file.

_{A secure, dedicated gateway to GitHub for each tenant.
Don't let noisy neighbors or secret exfiltrators ruin your sleep.}

Name		Name	Last commit message	Last commit date
Latest commit History 897 Commits
.claude		.claude
.githooks		.githooks
.github		.github
api		api
broker		broker
charts		charts
claude-usage		claude-usage
cmd		cmd
deploy		deploy
docs		docs
githubapp		githubapp
overrides		overrides
scripts		scripts
test		test
testdata		testdata
tmp		tmp
tools		tools
updatecli.d		updatecli.d
vendor		vendor
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.yamllint.yaml		.yamllint.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
THIRD-PARTY-NOTICES		THIRD-PARTY-NOTICES
artifacthub-repo.yml		artifacthub-repo.yml
coverage-baseline.txt		coverage-baseline.txt
docker-bake.hcl		docker-bake.hcl
go.work		go.work
go.work.sum		go.work.sum
mkdocs.yml		mkdocs.yml
requirements-docs.txt		requirements-docs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitHub Actions Gateway

The Problem

The Solution

Architecture

Installation

Quick Start

Observability

Capacity Reference

Community

Development

Running tests

Repository Layout

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GitHub Actions Gateway

The Problem

The Solution

Architecture

Installation

Quick Start

Observability

Capacity Reference

Community

Development

Running tests

Repository Layout

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages