Go representation of pleme-io's service-lifecycle model. The Go counterpart
to the Rust service-lifecycle crate: the same model, so every Go service and
tool supervises itself the same way — one signal-aware context, one ordered
graceful shutdown, one health surface, one run-loop.
No ad-hoc
signal.Notifyboilerplate, no hand-rolled shutdown ordering, no per-service/healthzhandler. Wire oneApponce; every binary behaves identically under SIGINT/SIGTERM, a failing dependency, and a Kubernetes probe.
The single external dependency is golang.org/x/sync/errgroup (Go-team owned,
de-facto stdlib) for the ctx-aware goroutine group. Everything else is stdlib
(context, os/signal, log/slog, errors, net/http).
lifecycle.App is the one owner that runs the work, flips readiness, drains, and
tears down in order. Construct it the canonical way and call Run once:
app, err := lifecycle.New(cfg.Lifecycle, lifecycle.WithLogger(log))
// or: lifecycle.FromConfig(cfg.Lifecycle) // consumes the shikumi sub-struct
if err != nil { return err }
app.Go("reconcile", reconcileLoop). // errgroup (ctx-aware)
Actor("http", srv.serve, func(error){ srv.Close() }). // oklog/run shape
Supervise("poller", poll, lifecycle.DefaultBackoff()). // suture-style restart
Probe("db", lifecycle.ProbeFunc(db.PingContext)). // readiness dependency
OnShutdown("db", func(context.Context) error { return db.Close() })
return app.Run(ctx) // encodes the k8s shutdown choreographyRun choreography, in order:
- derive a signal-aware run context;
- mount the health planes (+ the deferred
/metricsseam) and start any periodic probe loops; - run the work group — the first fatal error or a delivered signal begins teardown;
- flip readiness DOWN first → drain (sleep
DrainIntervaland, concurrently, run everyOnDrainremote-sessionDrainableinside that one window) → cancel the group → wait → run the LIFOShutdownstack underShutdownGrace(kept below the pod'sterminationGracePeriodSeconds). This ordering eliminates rolling-deploy 502s.
The readiness-down sleep only tells external load balancers to stop sending new
traffic — it does nothing about sessions the process is already holding on a
remote peer (SRA SSH/web sessions, a SOCKS tunnel, an event-forwarding channel,
a long-poll subscription). A local LIFO OnShutdown close stack cannot reach
those: the sessions live on the peer, not in this process. OnDrain registers a
Drainable that runs during the drain window to release them:
app.OnDrain("sra-sessions", lifecycle.DrainFunc(func(ctx context.Context) error {
sra.StopAcceptingSessions() // refuse new remote sessions
return sra.WaitForActiveSessions(ctx) // let live ones finish within ctx
}))All registered drainers run concurrently (different peers, independent waits)
under the single DrainInterval budget — a Drainable that ignores its ctx
deadline is abandoned (and reported as an error) when the window closes, never
blocking past the budget. Panics are isolated. Registering after Run starts is
ignored. With no drainers registered the drain is exactly the historical sleep.
Use OnShutdown for local resources, OnDrain for remote sessions.
| Verb | Shape | Use when |
|---|---|---|
app.Go(name, fn) |
x/sync/errgroup (ctx-aware) | the work watches a context |
app.Actor(name, execute, interrupt) |
oklog/run pair (in-package) | the work blocks and can't watch ctx (Accept loops) |
app.Supervise(name, fn, backoff) |
suture-style restart (in-package) | the work should restart with backoff on crash |
Supervise honours ErrDoNotRestart (stop cleanly) and ErrTerminate (stop and
propagate). Every spawned unit and shutdown hook recovers panics into errors.
App composes four primitives that are also usable directly:
SignalContext— acontext.Contextthat cancels when the process is signalled (SIGINT/SIGTERM by default). The root of every run.Shutdown— named hooks run in LIFO order under a single deadline, with errors aggregated (errors.Join). The observable analog of a defer stack.Registry/Probe— liveness/readiness/startup aggregation, tri-state (up/down/unknown), optional per-probeWithCache/WithPeriodic, transition listeners, plus a stdlibhttp.Handlerexposing/livez,/healthz,/readyz,/startupz. lifecycle-go is the single fleet owner of the health planes.RunLoop— a ticking work loop that stops on context cancellation, with optional exponential backoff on error.
package main
import (
"context"
"log/slog"
"net/http"
"time"
"github.com/pleme-io/lifecycle-go"
)
func main() {
// 1. Root context cancels on SIGINT/SIGTERM.
ctx, stop := lifecycle.SignalContext(context.Background())
defer stop()
// 2. Ordered, bounded teardown (LIFO — reverse of acquisition order).
srv := &http.Server{Addr: ":8080"}
sd := lifecycle.NewShutdown(slog.Default())
sd.Add("http-server", srv.Shutdown)
sd.Add("db", func(context.Context) error { return db.Close() })
// 3. Health surface for Kubernetes probes.
reg := lifecycle.NewRegistry()
reg.RegisterLiveness("self", lifecycle.ProbeFunc(func(context.Context) error { return nil }))
reg.RegisterReadiness("db", lifecycle.ProbeFunc(db.PingContext))
srv.Handler = reg.Handler() // serves /healthz and /readyz
go srv.ListenAndServe()
// 4. Background work loop, with backoff on error.
go lifecycle.RunLoop(ctx, 30*time.Second, reconcile,
lifecycle.WithLoopLogger(slog.Default()),
lifecycle.WithBackoff(5*time.Minute),
)
<-ctx.Done() // a signal arrived
_ = sd.Run(context.Background(), 30*time.Second)
}| Path | Plane | Question | Failure action (k8s) |
|---|---|---|---|
/healthz, /livez |
liveness | "is the process wedged?" | restart the pod |
/readyz |
readiness | "can it serve traffic now?" | pull from rotation |
/startupz |
startup | "has it finished booting?" | gate liveness during boot |
Each returns 200 when its plane is OK and 503 otherwise, with a small JSON
body — {"status":"ok"|"fail","checks":{<name>:"ok"|<error>}} — for humans and
log scrapers. Keep liveness probes dependency-free so a flaky downstream never
triggers restarts.
Hooks run last-in-first-out: the resource registered last (typically
acquired last) is released first. The HTTP server stops accepting before the DB
pool closes, the pool closes before the metrics flusher, and so on. Errors are
aggregated with errors.Join, never short-circuited — one failing close
does not skip the rest. Once the per-shutdown deadline passes, remaining hooks
are skipped and reported.
WithImmediateTick()— fire once on entry before the first interval.WithStopOnError()— a tick error terminates the loop (becomes the return).WithBackoff(max)— double the inter-tick delay on consecutive errors up tomax, resetting on the first success.WithLoopLogger(log)— log tick errors and backoff decisions.
go build ./...
go test ./...