This project is a local playground for learning practical SRE concepts on Windows with Docker Desktop:
- Golden signals: latency, traffic, errors, saturation
- SLI/SLO design and error budgets
- Metrics, logs, and traces
- Capacity planning basics
- Alerting and resilience patterns
- Automated run summaries
- Load testing and behavior under stress
The lab uses small Node.js applications so we can practice:
- Service-to-service tracing
- Error propagation between services
- Latency analysis
- Load testing and bottleneck discovery
This version now uses three services plus Postgres so the traces are more realistic:
app-a->app-b->app-capp-c->postgres
app-a: entry service that receives user trafficapp-b: middle-tier service called byapp-aapp-c: deeper dependency called byapp-b- OpenTelemetry Collector: central telemetry pipeline
- Prometheus: metrics storage and querying
- Loki: logs storage
- Tempo: traces storage
- Grafana: dashboards and correlation across metrics/logs/traces
flowchart LR
U["Load Generator / Browser"] --> A["app-a (Node.js API)"]
A --> B["app-b (Node.js API)"]
B --> C["app-c (Node.js API)"]
C --> D["Postgres"]
A --> O["OpenTelemetry Collector"]
B --> O
C --> O
O --> P["Prometheus"]
O --> L["Loki"]
O --> T["Tempo"]
G["Grafana"] --> P
G --> L
G --> T
docker-compose.yml: full local stackservices/app-a: entry appservices/app-b: downstream appservices/app-c: deeper dependency apppostgres: seeded demo databasegrafana/dashboards: prebuilt Grafana dashboardsload-tests:k6traffic scriptsotel/collector-config.yaml: telemetry pipelinesprometheus/prometheus.yml: Prometheus scrape configgrafana/provisioning/datasources/datasources.yml: preconfigured Grafana datasourcesgrafana/provisioning/dashboards/dashboards.yml: dashboard provisioningdocs/learning-plan.md: what to learn and how to use this repodocs/sli-slo-error-budget.md: SLIs, SLOs, and error budget examplesdocs/capacity-planning.md: capacity planning approachdocs/15-minute-capacity-exercise.md: guided 15-minute exercise, reproduction steps, and dashboard interpretationdocs/sre-learning-path.md: beginner to advanced roadmap for this labdocs/roadmap-v2-v4.md: staged roadmap for evolving the lab into a more production-like platformdocs/sre-learning-interview-guide.md: learning notes and interview preparation guidedocs/incident-runbook.md: investigation checklist and mitigation guidedocs/services-overview.md: each service, its role, and its telemetrydocs/dashboard-panel-guide.md: each dashboard and panel explaineddocs/deployment-validation.md: deployment, validation, and GitHub handoff stepsdocs/manual-load-testing.md: exact terminal-driven load testing, including100 req/secexamplesdocs/load-testing-cheatsheet.md: quick copy-paste load commands for50,100, and200 req/sec
- Install Docker Desktop and make sure Linux containers are enabled.
- Copy
.env.exampleto.envif you want custom ports, credentials, or report path overrides. - From this folder, run:
docker compose up --buildCross-platform bootstrap scripts:
- Windows PowerShell:
.\scripts\bootstrap.ps1- Linux / macOS:
./scripts/bootstrap.sh- Open these URLs:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- Jaeger: http://localhost:16686
- Dozzle logs UI: http://localhost:8080
- app-a: http://localhost:3001/health
- app-b: http://localhost:3002/health
- app-c: http://localhost:3003/health
- Postgres:
localhost:5432(sre/sre, databasesrelab) - Browser control panel: http://localhost:3001
Grafana default login:
- Username:
admin - Password:
admin
You can override these in .env.
Use app-a as the entry point:
GET /healthGET /readyGET /api/demo
Example:
curl "http://localhost:3001/api/demo?items=3&latencyMs=100"To simulate failures:
curl "http://localhost:3001/api/demo?failureRate=0.4"To simulate extra CPU work:
curl "http://localhost:3001/api/demo?cpuMs=150"To simulate deeper dependency latency:
curl "http://localhost:3001/api/demo?dependencyLatencyMs=120"For exact fixed-rate traffic like 100 req/sec, use the dedicated guide:
Most common exact 100 req/sec runs:
PowerShell:
docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=baseline --env RATE=100 --env DURATION=1mdocker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=retry-storm --env RATE=100 --env DURATION=1mdocker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=db-saturation --env RATE=100 --env DURATION=1mBash:
docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=baseline --env RATE=100 --env DURATION=1mdocker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=retry-storm --env RATE=100 --env DURATION=1mdocker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=fixed --env PRESET=db-saturation --env RATE=100 --env DURATION=1mUse the browser UI for interactive learning. Use k6 for exact request-rate tests.
k6 script:
PowerShell:
k6 run .\load-tests\sre-demo.jsBash:
k6 run ./load-tests/sre-demo.jsDocker-based k6 run if you do not want a local install:
PowerShell:
docker run --rm -i -v "${PWD}\load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=baselineBash:
docker run --rm -i -v "$(pwd)/load-tests:/scripts" grafana/k6 run /scripts/sre-demo.js --env BASE_URL=http://host.docker.internal:3001 --env SCENARIO=baselineScenario examples:
PowerShell:
k6 run --env SCENARIO=baseline .\load-tests\sre-demo.js
k6 run --env SCENARIO=latency .\load-tests\sre-demo.js
k6 run --env SCENARIO=errors .\load-tests\sre-demo.js
k6 run --env SCENARIO=stress .\load-tests\sre-demo.jsBash:
k6 run --env SCENARIO=baseline ./load-tests/sre-demo.js
k6 run --env SCENARIO=latency ./load-tests/sre-demo.js
k6 run --env SCENARIO=errors ./load-tests/sre-demo.js
k6 run --env SCENARIO=stress ./load-tests/sre-demo.jsQuick smoke test:
PowerShell:
k6 run --env SCENARIO=baseline --env QUICK=1 .\load-tests\sre-demo.jsBash:
k6 run --env SCENARIO=baseline --env QUICK=1 ./load-tests/sre-demo.jsBasic PowerShell loop if you do not want to install k6:
1..100 | ForEach-Object {
Start-Job { curl "http://localhost:3001/api/demo?latencyMs=50" } | Out-Null
}Or use a dedicated tool like k6 or hey from your machine.
- Start the stack and generate a small amount of traffic.
- Open Grafana and inspect:
- Request rate
- Latency percentiles
- Error rate
- Trace spans between
app-a,app-b, andapp-c - Application logs for failed requests
- Increase latency and failure rate through query parameters.
- Define an SLI/SLO target and observe whether the system meets it.
- Load test the stack and estimate practical local capacity.
- Translate the observations into a simple capacity plan.
Grafana now auto-loads:
SRE Golden Signals: traffic, latency, errors, and runtime saturation indicatorsService Dependency Overview: request path visibility forapp-a,app-b, andapp-cSLO and Error Budget: availability SLI, latency SLI, burn-rate views, and remaining budgetDatabase Health: Postgres exporter metrics and app-c DB query behaviorCapacity Planning: throughput, p95 latency, error ratio, event-loop pressure, and DB saturationAlerting and Runbook: alert-condition visibility plus investigation steps
Prometheus alerting now includes:
- latency SLO breach alerts
- fast error-budget burn alerts
- DB latency and connection-pressure alerts
- resilience activity alerts for fallbacks and circuit-breaker events
Resilience patterns in the app now include:
- downstream timeout control
- retry control
- circuit-breaker behavior
- graceful stub fallback mode
Dedicated UIs:
Jaeger: dedicated trace search and span waterfall viewDozzle: dedicated live container logs view
Deployment and validation steps are documented in:
- docs/deployment-validation.md
- docs/15-minute-capacity-exercise.md
- docs/sre-learning-path.md
- docs/roadmap-v2-v4.md
- docs/sre-learning-interview-guide.md
- docs/incident-runbook.md
- docs/services-overview.md
- docs/dashboard-panel-guide.md
- docs/manual-load-testing.md
After a scheduled run or guided experiment, you can export:
- JSON from the browser UI
- Markdown from the browser UI
- HTML from the browser UI
- PDF with the export script:
PowerShell:
.\scripts\export-session-report.ps1 -PdfLinux / macOS shell:
python3 -m webbrowser "http://localhost:3001/api/session-report/latest/html"This creates files in .\reports.
You can also browse persisted reports in the app at:
This lab now supports defaults with optional overrides through .env.
Examples:
- host ports
- Grafana admin username and password
- Postgres database, user, and password
- DB pool max
- reports host directory
Start from:
This makes the setup easier to reuse across:
- Windows
- Linux
- macOS
If you git pull and start the stack again, the application setup comes from code and should work automatically after:
PowerShell:
docker compose up --build -dBash:
docker compose up --build -dState behavior:
- persists in Docker volume:
- Postgres data in
postgres_data - app log files in
app_logs
- Postgres data in
- recreated from code on startup:
- Grafana dashboards
- Prometheus rules
- application code and routes
- docs and scripts
- does not persist unless you export it:
- in-memory browser/guided session history
- latest generated session report in
app-a
So:
- your database content stays unless you remove Docker volumes
- your dashboards and rules come from the repo automatically
- session reports are now persisted automatically in
reports
- Windows PowerShell:
.\scripts\reset-lab.ps1
.\scripts\reset-lab.ps1 -RemoveVolumes- Linux / macOS:
./scripts/reset-lab.sh
./scripts/reset-lab.sh --volumesUse the dedicated guide here:
That guide explains:
- how the 15-minute run was executed
- which phases were used
- which dashboards to watch during the run
- why low RPS can still produce very high latency
- how to decide whether to scale app containers or fix the database tier first
- Alert rules for burn-rate alerts
- Synthetic checks
- Chaos testing
- Retry and timeout tuning
- Circuit breaker patterns
- Horizontal scaling experiments
- Queue-based workloads
- Database dependency simulation
This project is intended to run on your local machine.
- The repository files live in your local workspace after you clone it.
- Containers started with Docker Desktop or Docker Engine also run locally.
- This is a local lab setup, not a hosted or remote environment.





