Skip to content

pleme-io/lifecycle-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lifecycle-go

Go representation of pleme-io's service-lifecycle model. The Go counterpart to the Rust service-lifecycle crate: the same model, so every Go service and tool supervises itself the same way — one signal-aware context, one ordered graceful shutdown, one health surface, one run-loop.

No ad-hoc signal.Notify boilerplate, no hand-rolled shutdown ordering, no per-service /healthz handler. Wire one App once; every binary behaves identically under SIGINT/SIGTERM, a failing dependency, and a Kubernetes probe.

The single external dependency is golang.org/x/sync/errgroup (Go-team owned, de-facto stdlib) for the ctx-aware goroutine group. Everything else is stdlib (context, os/signal, log/slog, errors, net/http).

App — the composed owner (BOREALIS §2.5)

lifecycle.App is the one owner that runs the work, flips readiness, drains, and tears down in order. Construct it the canonical way and call Run once:

app, err := lifecycle.New(cfg.Lifecycle, lifecycle.WithLogger(log))
// or: lifecycle.FromConfig(cfg.Lifecycle)   // consumes the shikumi sub-struct
if err != nil { return err }

app.Go("reconcile", reconcileLoop).                       // errgroup (ctx-aware)
    Actor("http",  srv.serve, func(error){ srv.Close() }). // oklog/run shape
    Supervise("poller", poll, lifecycle.DefaultBackoff()). // suture-style restart
    Probe("db", lifecycle.ProbeFunc(db.PingContext)).      // readiness dependency
    OnShutdown("db", func(context.Context) error { return db.Close() })

return app.Run(ctx)   // encodes the k8s shutdown choreography

Run choreography, in order:

  1. derive a signal-aware run context;
  2. mount the health planes (+ the deferred /metrics seam) and start any periodic probe loops;
  3. run the work group — the first fatal error or a delivered signal begins teardown;
  4. flip readiness DOWN first → drain (sleep DrainInterval and, concurrently, run every OnDrain remote-session Drainable inside that one window) → cancel the group → wait → run the LIFO Shutdown stack under ShutdownGrace (kept below the pod's terminationGracePeriodSeconds). This ordering eliminates rolling-deploy 502s.

Draining remote sessions (OnDrain / Drainable)

The readiness-down sleep only tells external load balancers to stop sending new traffic — it does nothing about sessions the process is already holding on a remote peer (SRA SSH/web sessions, a SOCKS tunnel, an event-forwarding channel, a long-poll subscription). A local LIFO OnShutdown close stack cannot reach those: the sessions live on the peer, not in this process. OnDrain registers a Drainable that runs during the drain window to release them:

app.OnDrain("sra-sessions", lifecycle.DrainFunc(func(ctx context.Context) error {
    sra.StopAcceptingSessions()           // refuse new remote sessions
    return sra.WaitForActiveSessions(ctx) // let live ones finish within ctx
}))

All registered drainers run concurrently (different peers, independent waits) under the single DrainInterval budget — a Drainable that ignores its ctx deadline is abandoned (and reported as an error) when the window closes, never blocking past the budget. Panics are isolated. Registering after Run starts is ignored. With no drainers registered the drain is exactly the historical sleep. Use OnShutdown for local resources, OnDrain for remote sessions.

Three goroutine shapes

Verb Shape Use when
app.Go(name, fn) x/sync/errgroup (ctx-aware) the work watches a context
app.Actor(name, execute, interrupt) oklog/run pair (in-package) the work blocks and can't watch ctx (Accept loops)
app.Supervise(name, fn, backoff) suture-style restart (in-package) the work should restart with backoff on crash

Supervise honours ErrDoNotRestart (stop cleanly) and ErrTerminate (stop and propagate). Every spawned unit and shutdown hook recovers panics into errors.

The four leaf primitives

App composes four primitives that are also usable directly:

  • SignalContext — a context.Context that cancels when the process is signalled (SIGINT/SIGTERM by default). The root of every run.
  • Shutdown — named hooks run in LIFO order under a single deadline, with errors aggregated (errors.Join). The observable analog of a defer stack.
  • Registry / Probe — liveness/readiness/startup aggregation, tri-state (up/down/unknown), optional per-probe WithCache/WithPeriodic, transition listeners, plus a stdlib http.Handler exposing /livez,/healthz,/readyz, /startupz. lifecycle-go is the single fleet owner of the health planes.
  • RunLoop — a ticking work loop that stops on context cancellation, with optional exponential backoff on error.

Usage

package main

import (
	"context"
	"log/slog"
	"net/http"
	"time"

	"github.com/pleme-io/lifecycle-go"
)

func main() {
	// 1. Root context cancels on SIGINT/SIGTERM.
	ctx, stop := lifecycle.SignalContext(context.Background())
	defer stop()

	// 2. Ordered, bounded teardown (LIFO — reverse of acquisition order).
	srv := &http.Server{Addr: ":8080"}
	sd := lifecycle.NewShutdown(slog.Default())
	sd.Add("http-server", srv.Shutdown)
	sd.Add("db", func(context.Context) error { return db.Close() })

	// 3. Health surface for Kubernetes probes.
	reg := lifecycle.NewRegistry()
	reg.RegisterLiveness("self", lifecycle.ProbeFunc(func(context.Context) error { return nil }))
	reg.RegisterReadiness("db", lifecycle.ProbeFunc(db.PingContext))
	srv.Handler = reg.Handler() // serves /healthz and /readyz

	go srv.ListenAndServe()

	// 4. Background work loop, with backoff on error.
	go lifecycle.RunLoop(ctx, 30*time.Second, reconcile,
		lifecycle.WithLoopLogger(slog.Default()),
		lifecycle.WithBackoff(5*time.Minute),
	)

	<-ctx.Done() // a signal arrived
	_ = sd.Run(context.Background(), 30*time.Second)
}

Health endpoints

Path Plane Question Failure action (k8s)
/healthz, /livez liveness "is the process wedged?" restart the pod
/readyz readiness "can it serve traffic now?" pull from rotation
/startupz startup "has it finished booting?" gate liveness during boot

Each returns 200 when its plane is OK and 503 otherwise, with a small JSON body — {"status":"ok"|"fail","checks":{<name>:"ok"|<error>}} — for humans and log scrapers. Keep liveness probes dependency-free so a flaky downstream never triggers restarts.

Shutdown ordering

Hooks run last-in-first-out: the resource registered last (typically acquired last) is released first. The HTTP server stops accepting before the DB pool closes, the pool closes before the metrics flusher, and so on. Errors are aggregated with errors.Join, never short-circuited — one failing close does not skip the rest. Once the per-shutdown deadline passes, remaining hooks are skipped and reported.

Run-loop options

  • WithImmediateTick() — fire once on entry before the first interval.
  • WithStopOnError() — a tick error terminates the loop (becomes the return).
  • WithBackoff(max) — double the inter-tick delay on consecutive errors up to max, resetting on the first success.
  • WithLoopLogger(log) — log tick errors and backoff decisions.

Build & test

go build ./...
go test ./...

About

Go representation of pleme-io's service-lifecycle model — one signal-aware context, one ordered startup/shutdown, so every Go service and tool supervises itself the same way (Go counterpart to the Rust service-lifecycle crate)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors