Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: CI - Build and Push QuickTicket Images

on:
push:
branches: [ main ]

jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
packages: write
contents: write

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Build and push gateway
run: |
docker build -t ghcr.io/${{ github.actor }}/quickticket-gateway:${{ github.sha }} ./app/gateway
docker push ghcr.io/${{ github.actor }}/quickticket-gateway:${{ github.sha }}

- name: Build and push events
run: |
docker build -t ghcr.io/${{ github.actor }}/quickticket-events:${{ github.sha }} ./app/events
docker push ghcr.io/${{ github.actor }}/quickticket-events:${{ github.sha }}

- name: Build and push payments
run: |
docker build -t ghcr.io/${{ github.actor }}/quickticket-payments:${{ github.sha }} ./app/payments
docker push ghcr.io/${{ github.actor }}/quickticket-payments:${{ github.sha }}
10 changes: 10 additions & 0 deletions app/events/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__
*.pyc
*.pyo
.git
.gitignore
.env
README.md
*.md
.vscode
__MACOSX
10 changes: 10 additions & 0 deletions app/gateway/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__
*.pyc
*.pyo
.git
.gitignore
.env
README.md
*.md
.vscode
__MACOSX
4 changes: 3 additions & 1 deletion app/gateway/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ FROM python:3.13-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN addgroup --system app && adduser --system --ingroup app app
COPY main.py .

RUN chown -R app:app /app
USER app
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
29 changes: 16 additions & 13 deletions app/gateway/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -310,14 +310,10 @@ async def _notify_order_confirmed(reservation_id: str):
log.warning(f"notify failed (non-critical) order={reservation_id} err={e}")


@app.post("/reserve/{reservation_id}/pay")
@app.post("/reserve/{reservation_id}/pay")
async def pay_reservation(reservation_id: str):
# 1. Call payments — wrapped in circuit breaker + retry.
#
# Composition order matters: cb.call(retry(_charge)) means each CB-tracked
# invocation includes its retries internally; the CB only sees the FINAL
# outcome. The reverse — retry(cb.call(_charge)) — would retry past the
# CircuitOpenError, defeating the fast-fail. See lab 11 §11.4.
"""Pay for reservation with graceful degradation when payments service is down."""
async def _charge():
resp = await client.post(
f"{PAYMENTS_URL}/charge",
Expand All @@ -327,20 +323,27 @@ async def _charge():
return resp

try:
# Try to call payments with circuit breaker + retry
pay_resp = await payments_cb.call(lambda: call_with_retry(_charge, target="payments"))
payment_ref = pay_resp.json().get("payment_ref", "unknown")
except CircuitOpenError:
log.error("circuit open, skipping payments call")
raise HTTPException(503, "Payment service temporarily unavailable (circuit open)")
except httpx.TimeoutException:
raise HTTPException(504, "Payment service timeout")
except (CircuitOpenError, httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:
# === GRACEFUL DEGRADATION ===
log.warning(f"Payments service unavailable for reservation {reservation_id}: {e}")
return JSONResponse(
status_code=503,
content={
"error": "payments_unavailable",
"message": "Payment service is temporarily down. Your reservation is held — try again in a few minutes.",
"reservation_id": reservation_id
}
)
except httpx.HTTPStatusError as e:
raise HTTPException(e.response.status_code, "Payment failed")
except Exception as e:
log.error(f"payment error: {e}")
raise HTTPException(502, "Payment service unavailable")

# 2. Confirm reservation in events.
# 2. Confirm reservation in events (only if payment succeeded)
try:
confirm_resp = await client.post(
f"{EVENTS_URL}/reservations/{reservation_id}/confirm",
Expand All @@ -352,7 +355,7 @@ async def _charge():
log.error(f"confirm error after payment: {e}")
raise HTTPException(500, "Payment succeeded but confirmation failed — contact support")

# 3. Fire-and-forget notify (don't await → don't add latency, don't fail user).
# 3. Fire-and-forget notify
asyncio.create_task(_notify_order_confirmed(reservation_id))

return result
10 changes: 10 additions & 0 deletions app/payments/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__
*.pyc
*.pyo
.git
.gitignore
.env
README.md
*.md
.vscode
__MACOSX
2 changes: 1 addition & 1 deletion docker-compose.monitoring.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ services:
- "9090:9090"
volumes:
- ../monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ../monitoring/prometheus/rules.yml:/etc/prometheus/rules.yml:ro # ← добавь эту строку
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=7d"

grafana:
image: grafana/grafana:13.0.1
ports:
Expand Down
23 changes: 23 additions & 0 deletions monitoring/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "rules.yml"

scrape_configs:
- job_name: 'gateway'
static_configs:
- targets: ['gateway:8080']

- job_name: 'events'
static_configs:
- targets: ['events:8081']

- job_name: 'payments'
static_configs:
- targets: ['payments:8082']

- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
12 changes: 12 additions & 0 deletions monitoring/prometheus/rules.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
groups:
- name: quickticket_slo_rules
interval: 30s
rules:
- record: gateway:sli_availability:ratio_rate5m
expr: sum(rate(gateway_requests_total{status!~"5.."}[5m])) / sum(rate(gateway_requests_total[5m]))

- record: gateway:sli_latency_500ms:ratio_rate5m
expr: sum(rate(gateway_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(gateway_request_duration_seconds_count[5m]))

- record: gateway:error_budget_burn_rate:ratio_rate5m
expr: (1 - gateway:sli_availability:ratio_rate5m) / (1 - 0.995)
179 changes: 179 additions & 0 deletions submissions/lab1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Lab 1 — SRE Philosophy: Deploy, Break, Understand

## Docker Compose Status

All 5 services are running successfully:

```bash
NAME IMAGE STATUS PORTS
app-events-1 app-events Up 0.0.0.0:8081->8081/tcp
app-gateway-1 app-gateway Up 0.0.0.0:3080->8080/tcp
app-payments-1 app-payments Up 0.0.0.0:8082->8082/tcp
app-postgres-1 postgres:17-alpine Up (healthy) 0.0.0.0:5432->5432/tcp
app-redis-1 redis:7-alpine Up (healthy) 0.0.0.0:6379->6379/tcp
```

## Critical Path (Everything Working)

### 1. List Events

```json
[
{
"id": 1,
"name": "Go Conference 2026",
"venue": "Main Hall A",
"date": "2026-09-15T09:00:00+00:00",
"total_tickets": 100,
"price_cents": 5000,
"available": 99
},
{
"id": 4,
"name": "Python Workshop",
"venue": "Lab 301",
"date": "2026-09-22T14:00:00+00:00",
"total_tickets": 25,
"price_cents": 2000,
"available": 25
},
{
"id": 2,
"name": "SRE Meetup",
"venue": "Room 204",
"date": "2026-10-01T18:00:00+00:00",
"total_tickets": 30,
"price_cents": 0,
"available": 30
},
{
"id": 5,
"name": "Kubernetes Deep Dive",
"venue": "Auditorium B",
"date": "2026-10-10T10:00:00+00:00",
"total_tickets": 80,
"price_cents": 8000,
"available": 80
},
{
"id": 3,
"name": "Cloud Native Summit",
"venue": "Expo Center",
"date": "2026-11-20T10:00:00+00:00",
"total_tickets": 500,
"price_cents": 15000,
"available": 500
}
]
```

### 2. Reserve a Ticket

```json
{
"reservation_id": "a3370485-51ea-46bf-a3b1-c6cf7a101df4",
"event_id": 1,
"quantity": 1,
"total_cents": 5000,
"expires_in_seconds": 300
}
```

### 3. Pay for Reservation

```json
{
"order_id": "a3370485-51ea-46bf-a3b1-c6cf7a101df4",
"event_id": 1,
"quantity": 1,
"total_cents": 5000,
"status": "confirmed"
}
```

### 4. Health Check

```json
{
"status": "healthy",
"checks": {
"events": "ok",
"payments": "ok",
"circuit_payments": "CLOSED"
}
}
```

## Dependency Map

```mermaid
graph TD
Gateway --> Events
Gateway --> Payments
Events --> Postgres
Events --> Redis
```

## Failure Table

| Component Killed | Events List | Reserve | Pay | Health Check | User Impact |
| ---------------- | ----------- | ------- | ----- | ------------ | -------------------------------- |
| payments | Works | Works | Fails | degraded | Can reserve but cannot pay |
| events | Fails | Fails | Fails | degraded | Cannot browse or buy tickets |
| redis | Works | Works | Works | ok | Minor impact |
| postgres | Fails | Fails | Fails | degraded | Events service completely broken |

## Load Generator Test

I ran the load generator:

```bash
../loadgen/run.sh 5 30
```

While it was running, I stopped the payments service. The error rate increased significantly, but list and reserve endpoints continued working. This demonstrates the blast radius of the payments service and validates graceful degradation behavior.

## Task 2 — Graceful Degradation

Modified `gateway/main.py` to return a clear 503 response when payments are unavailable.

Example response:

```json
{
"error": "payments_unavailable",
"message": "Payment service is temporarily down. Your reservation is held — try again in a few minutes.",
"reservation_id": "..."
}
```

Results:

* Reserve endpoint continued working.
* Pay endpoint returned a friendly error message.
* User experience degraded gracefully instead of failing unexpectedly.

## Bonus Task — Resource Usage

### Idle

```bash
NAME CPU % MEM USAGE
app-gateway-1 0.25% 38.11MiB
app-events-1 0.25% 41MiB
app-payments-1 0.23% 32.96MiB
app-postgres-1 2.59% 23.89MiB
app-redis-1 0.86% 3.66MiB
```

### Observations

* PostgreSQL consumed the highest CPU while idle.
* Redis used the least memory.
* Gateway and Events services increased CPU usage under load because they handled incoming traffic.
* When Payments was unavailable Gateway retained requests longer and showed increased resource utilization.

## GitHub Community
I starred the course repository and the `simple-container-com/api` project.
I followed the professor (@Cre-eD), TAs (@Naghme98, @pierrepicaud), and several classmates.
Starring repositories supports maintainers and helps useful projects gain visibility. Following developers helps me learn from their work and expand my professional network.
Loading