Local SRE Learning Lab

This project is a local playground for learning practical SRE concepts on Windows with Docker Desktop:

Golden signals: latency, traffic, errors, saturation
SLI/SLO design and error budgets
Metrics, logs, and traces
Capacity planning basics
Alerting and resilience patterns
Automated run summaries
Load testing and behavior under stress

The lab uses small Node.js applications so we can practice:

Service-to-service tracing
Error propagation between services
Latency analysis
Load testing and bottleneck discovery

This version now uses three services plus Postgres so the traces are more realistic:

app-a -> app-b -> app-c
app-c -> postgres

What We Are Building

app-a: entry service that receives user traffic
app-b: middle-tier service called by app-a
app-c: deeper dependency called by app-b
OpenTelemetry Collector: central telemetry pipeline
Prometheus: metrics storage and querying
Loki: logs storage
Tempo: traces storage
Grafana: dashboards and correlation across metrics/logs/traces

Architecture

flowchart LR
    U["Load Generator / Browser"] --> A["app-a (Node.js API)"]
    A --> B["app-b (Node.js API)"]
    B --> C["app-c (Node.js API)"]
    C --> D["Postgres"]
    A --> O["OpenTelemetry Collector"]
    B --> O
    C --> O
    O --> P["Prometheus"]
    O --> L["Loki"]
    O --> T["Tempo"]
    G["Grafana"] --> P
    G --> L
    G --> T

Project Layout

docker-compose.yml: full local stack
services/app-a: entry app
services/app-b: downstream app
services/app-c: deeper dependency app
postgres: seeded demo database
grafana/dashboards: prebuilt Grafana dashboards
load-tests: k6 traffic scripts
otel/collector-config.yaml: telemetry pipelines
prometheus/prometheus.yml: Prometheus scrape config
grafana/provisioning/datasources/datasources.yml: preconfigured Grafana datasources
grafana/provisioning/dashboards/dashboards.yml: dashboard provisioning
docs/learning-plan.md: what to learn and how to use this repo
docs/sli-slo-error-budget.md: SLIs, SLOs, and error budget examples
docs/capacity-planning.md: capacity planning approach
docs/15-minute-capacity-exercise.md: guided 15-minute exercise, reproduction steps, and dashboard interpretation
docs/sre-learning-path.md: beginner to advanced roadmap for this lab
docs/roadmap-v2-v4.md: staged roadmap for evolving the lab into a more production-like platform
docs/sre-learning-interview-guide.md: learning notes and interview preparation guide
docs/incident-runbook.md: investigation checklist and mitigation guide
docs/services-overview.md: each service, its role, and its telemetry
docs/dashboard-panel-guide.md: each dashboard and panel explained
docs/deployment-validation.md: deployment, validation, and GitHub handoff steps
docs/manual-load-testing.md: exact terminal-driven load testing, including 100 req/sec examples
docs/load-testing-cheatsheet.md: quick copy-paste load commands for 50, 100, and 200 req/sec

Quick Start

Install Docker Desktop and make sure Linux containers are enabled.
Copy .env.example to .env if you want custom ports, credentials, or report path overrides.
From this folder, run:

docker compose up --build

Cross-platform bootstrap scripts:

Windows PowerShell:

.\scripts\bootstrap.ps1

Linux / macOS:

./scripts/bootstrap.sh

Open these URLs:

Grafana: http://localhost:3000
Prometheus: http://localhost:9090
Jaeger: http://localhost:16686
Dozzle logs UI: http://localhost:8080
app-a: http://localhost:3001/health
app-b: http://localhost:3002/health
app-c: http://localhost:3003/health
Postgres: localhost:5432 (sre / sre, database srelab)
Browser control panel: http://localhost:3001

Grafana default login:

Username: admin
Password: admin

You can override these in .env.

Demo Endpoints

Use app-a as the entry point:

GET /health
GET /ready
GET /api/demo

Example:

curl "http://localhost:3001/api/demo?items=3&latencyMs=100"

To simulate failures:

curl "http://localhost:3001/api/demo?failureRate=0.4"

To simulate extra CPU work:

curl "http://localhost:3001/api/demo?cpuMs=150"

To simulate deeper dependency latency:

curl "http://localhost:3001/api/demo?dependencyLatencyMs=120"

Screenshots

Browser Control Panel

Grafana Golden Signals

Capacity Planning

SLO and Error Budget

Jaeger Trace Waterfall View

Dozzle Live Logs

Load Testing

For exact fixed-rate traffic like 100 req/sec, use the dedicated guide:

Most common exact 100 req/sec runs:

PowerShell:

docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=baseline --env RATE=100 --env DURATION=1m

docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=retry-storm --env RATE=100 --env DURATION=1m

docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=db-saturation --env RATE=100 --env DURATION=1m

Bash:

docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=baseline --env RATE=100 --env DURATION=1m

docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=retry-storm --env RATE=100 --env DURATION=1m

docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=db-saturation --env RATE=100 --env DURATION=1m

Use the browser UI for interactive learning. Use k6 for exact request-rate tests.

k6 script:

PowerShell:

k6 run .\load-tests\sre-demo.js

Bash:

k6 run ./load-tests/sre-demo.js

Docker-based k6 run if you do not want a local install:

PowerShell:

docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=baseline

Bash:

docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=baseline

Scenario examples:

PowerShell:

k6 run --env SCENARIO=baseline .\load-tests\sre-demo.js
k6 run --env SCENARIO=latency .\load-tests\sre-demo.js
k6 run --env SCENARIO=errors .\load-tests\sre-demo.js
k6 run --env SCENARIO=stress .\load-tests\sre-demo.js

Bash:

k6 run --env SCENARIO=baseline ./load-tests/sre-demo.js
k6 run --env SCENARIO=latency ./load-tests/sre-demo.js
k6 run --env SCENARIO=errors ./load-tests/sre-demo.js
k6 run --env SCENARIO=stress ./load-tests/sre-demo.js

Quick smoke test:

PowerShell:

k6 run --env SCENARIO=baseline --env QUICK=1 .\load-tests\sre-demo.js

Bash:

k6 run --env SCENARIO=baseline --env QUICK=1 ./load-tests/sre-demo.js

Basic PowerShell loop if you do not want to install k6:

1..100 | ForEach-Object {
  Start-Job { curl "http://localhost:3001/api/demo?latencyMs=50" } | Out-Null
}

Or use a dedicated tool like k6 or hey from your machine.

Suggested Learning Flow

Start the stack and generate a small amount of traffic.
Open Grafana and inspect:
- Request rate
- Latency percentiles
- Error rate
- Trace spans between app-a, app-b, and app-c
- Application logs for failed requests
Increase latency and failure rate through query parameters.
Define an SLI/SLO target and observe whether the system meets it.
Load test the stack and estimate practical local capacity.
Translate the observations into a simple capacity plan.

Included Dashboards

Grafana now auto-loads:

SRE Golden Signals: traffic, latency, errors, and runtime saturation indicators
Service Dependency Overview: request path visibility for app-a, app-b, and app-c
SLO and Error Budget: availability SLI, latency SLI, burn-rate views, and remaining budget
Database Health: Postgres exporter metrics and app-c DB query behavior
Capacity Planning: throughput, p95 latency, error ratio, event-loop pressure, and DB saturation
Alerting and Runbook: alert-condition visibility plus investigation steps

Prometheus alerting now includes:

latency SLO breach alerts
fast error-budget burn alerts
DB latency and connection-pressure alerts
resilience activity alerts for fallbacks and circuit-breaker events

Resilience patterns in the app now include:

downstream timeout control
retry control
circuit-breaker behavior
graceful stub fallback mode

Dedicated UIs:

Jaeger: dedicated trace search and span waterfall view
Dozzle: dedicated live container logs view

Deployment and validation steps are documented in:

Session Report Export

After a scheduled run or guided experiment, you can export:

JSON from the browser UI
Markdown from the browser UI
HTML from the browser UI
PDF with the export script:

PowerShell:

.\scripts\export-session-report.ps1 -Pdf

Linux / macOS shell:

python3 -m webbrowser "http://localhost:3001/api/session-report/latest/html"

This creates files in .\reports.

You can also browse persisted reports in the app at:

http://localhost:3001/reports

Config Defaults And Overrides

This lab now supports defaults with optional overrides through .env.

Examples:

host ports
Grafana admin username and password
Postgres database, user, and password
DB pool max
reports host directory

Start from:

.env.example

This makes the setup easier to reuse across:

Windows
Linux
macOS

What Persists After Git Pull Or Restart

If you git pull and start the stack again, the application setup comes from code and should work automatically after:

PowerShell:

docker compose up --build -d

Bash:

docker compose up --build -d

State behavior:

persists in Docker volume:
- Postgres data in postgres_data
- app log files in app_logs
recreated from code on startup:
- Grafana dashboards
- Prometheus rules
- application code and routes
- docs and scripts
does not persist unless you export it:
- in-memory browser/guided session history
- latest generated session report in app-a

So:

your database content stays unless you remove Docker volumes
your dashboards and rules come from the repo automatically
session reports are now persisted automatically in reports

Reset Scripts

Windows PowerShell:

.\scripts\reset-lab.ps1
.\scripts\reset-lab.ps1 -RemoveVolumes

Linux / macOS:

./scripts/reset-lab.sh
./scripts/reset-lab.sh --volumes

Reproduce The 15-Minute Capacity Exercise

Use the dedicated guide here:

docs/15-minute-capacity-exercise.md

That guide explains:

how the 15-minute run was executed
which phases were used
which dashboards to watch during the run
why low RPS can still produce very high latency
how to decide whether to scale app containers or fix the database tier first

Good Extra Topics To Add Later

Alert rules for burn-rate alerts
Synthetic checks
Chaos testing
Retry and timeout tuning
Circuit breaker patterns
Horizontal scaling experiments
Queue-based workloads
Database dependency simulation

Local Machine Note

This project is intended to run on your local machine.

The repository files live in your local workspace after you clone it.
Containers started with Docker Desktop or Docker Engine also run locally.
This is a local lab setup, not a hosted or remote environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local SRE Learning Lab

What We Are Building

Architecture

Project Layout

Quick Start

Demo Endpoints

Screenshots

Browser Control Panel

Grafana Golden Signals

Capacity Planning

SLO and Error Budget

Jaeger Trace Waterfall View

Dozzle Live Logs

Load Testing

Suggested Learning Flow

Included Dashboards

Session Report Export

Config Defaults And Overrides

What Persists After Git Pull Or Restart

Reset Scripts

Reproduce The 15-Minute Capacity Exercise

Good Extra Topics To Add Later

Local Machine Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
grafana		grafana
load-tests		load-tests
loki		loki
otel		otel
postgres		postgres
prometheus		prometheus
scripts		scripts
services		services
tempo		tempo
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Local SRE Learning Lab

What We Are Building

Architecture

Project Layout

Quick Start

Demo Endpoints

Screenshots

Browser Control Panel

Grafana Golden Signals

Capacity Planning

SLO and Error Budget

Jaeger Trace Waterfall View

Dozzle Live Logs

Load Testing

Suggested Learning Flow

Included Dashboards

Session Report Export

Config Defaults And Overrides

What Persists After Git Pull Or Restart

Reset Scripts

Reproduce The 15-Minute Capacity Exercise

Good Extra Topics To Add Later

Local Machine Note

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages