aria-aiops · bayrem · Jun 16, 2026 · Jun 16, 2026 · Jun 16, 2026 · Jun 16, 2026
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,35 @@
+# Version control
+.git/
+.gitignore
+
+# Python artifacts
+.venv/
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.pytest_cache/
+*.egg-info/
+dist/
+build/
+.mypy_cache/
+.ruff_cache/
+
+# Tests and fixtures — not needed in the image
+tests/
+
+# Dev and internal docs
+.docs/
+.claude/
+
+# Infrastructure-as-code — not part of the application
+infra/
+
+# Runtime data — always mounted at runtime, never baked in
+logs/
+data/
+conf.yaml
+
+# Secrets — never in the image
+.env
+.env.*
diff --git a/.env.example b/.env.example
@@ -1,45 +1,97 @@
 # ──────────────────────────────────────────────────────────────────────────────
-# ARIA — Secrets template
+# ARIA — Secrets and runtime env vars template
 # Copy this file to .env and fill in your values. Never commit .env to Git.
 # Non-secret configuration (model IDs, connector types, GCP settings, etc.)
-# lives in conf.yaml — see conf_template.yaml.
+# lives in conf.yaml — see conf_template.yaml for all options.
 # ──────────────────────────────────────────────────────────────────────────────
 
+
+# ── Required (all deployments) ────────────────────────────────────────────────
+
 # ServiceNow — password for the service account defined in conf.yaml (servicenow.user)
 SNOW_PASSWORD=<your-servicenow-password>
 
-# Anthropic — API key for LLM calls across all agents
-# Reference implementation uses Anthropic. Swap for your provider if you bring
-# your own LLMClientInterface implementation.
+# Slack — bot token with chat:write scope (channel set in conf.yaml slack.channel_id)
+SLACK_BOT_TOKEN=<your-slack-bot-token>
+
+
+# ── LLM provider — set ONE block depending on llm.provider in conf.yaml ───────
+
+# --- anthropic (llm.provider: anthropic) — recommended for non-GCP deployments
 ANTHROPIC_API_KEY=<your-anthropic-api-key>
 
-# Slack — bot token with chat:write scope, for the channel defined in conf.yaml (slack.channel_id)
-SLACK_BOT_TOKEN=<your-slack-bot-token>
+# --- vertex_ai (llm.provider: vertex_ai) — GCP container deployments (no API key needed)
+# Auth is via ADC — set GOOGLE_APPLICATION_CREDENTIALS if not running on GKE/Cloud Run.
+# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json  # only if ADC is not auto-resolved
+VERTEX_AI_PROJECT_ID=<your-gcp-project-id>
+VERTEX_AI_LOCATION=europe-west1
 
-# MS Teams (optional — alternative notifier; swap connector in dependencies.py)
-TEAMS_WEBHOOK_URL=<your-teams-incoming-webhook-url>
+# --- claude_code (llm.provider: claude_code) — local dev only, NOT for production (#84)
+# No additional env vars needed; uses the local Claude Code CLI subscription.
+
+
+# ── Vault backend — set ONE block depending on runtime.vault_backend in conf.yaml ─
+
+# --- env (default) — secrets come from this .env file; no additional config needed.
+
+# --- gcp — GCP Secret Manager via ADC
+GCP_PROJECT_ID=<your-gcp-project-id>
+
+# --- hashicorp — HashiCorp Vault
+VAULT_TOKEN=<your-vault-token>
+
+# --- aws — AWS Secrets Manager
+AWS_ACCESS_KEY_ID=<your-aws-access-key-id>
+AWS_SECRET_ACCESS_KEY=<your-aws-secret-access-key>
 
-# Google Chat (optional — alternative notifier; swap connector in dependencies.py)
-GOOGLE_CHAT_WEBHOOK_URL=<your-google-chat-webhook-url>
+
+# ── Pipeline behaviour ────────────────────────────────────────────────────────
+
+# Path to a custom conf.yaml (default: ./conf.yaml relative to working directory)
+# ARIA_CONFIG_PATH=/etc/aria/conf.yaml
+
+# Enable the built-in Alpine.js ops dashboard at /dashboard
+# ARIA_DASHBOARD_ENABLED=true
+
+# Operating mode: inform | hitm | autonomous (only 'inform' is implemented in Phase 1.5)
+# ARIA_OPERATING_MODE=inform
+
+# LLM provider override — overrides llm.provider in conf.yaml
+# ARIA_LLM_PROVIDER=anthropic
+
+# Vault backend override — overrides runtime.vault_backend in conf.yaml
+# ARIA_VAULT_BACKEND=env
+
+# Log format: human (coloured, for terminals) | json (for log aggregators)
+# ARIA_LOG_FORMAT=human
+
+# Log directory for rolling file output
+# ARIA_LOG_DIR=logs/
+
+# SQLite run history database path
+# ARIA_RUN_DB_PATH=data/runs.db
+
+# Dry-run mode — uses in-memory stubs; no real ServiceNow/Slack/SSH calls
+# ARIA_DRY_RUN=false
+
+
+# ── Optional connectors ───────────────────────────────────────────────────────
 
 # CDP — SSH private key PEM content for Agent 2 log extraction from Cloudera CDP nodes
 # Set via: export CDP_SSH_KEY="$(cat /path/to/private_key)"
 CDP_SSH_KEY=<pem-content-of-ssh-private-key>
 
-# CDP — SSH host public key for strict host verification (recommended, prevents MITM attacks)
+# CDP — SSH host public key for strict host verification (prevents MITM attacks)
 # Format: "<key-type> <base64-encoded-public-key>" e.g. "ssh-ed25519 AAAA..."
-# If not set, ARIA falls back to WarningPolicy (logs a warning but still connects)
+# Leave empty to use WarningPolicy (logs a warning but still connects)
 CDP_HOST_KEY=
 
-# GCP — service account JSON key (base64-encoded) for BigQuery and GCS access
-# Only required when connectors.log = gcp in conf.yaml
+# GCP — service account JSON key (base64-encoded) for Cloud Logging / BigQuery access
+# Only needed if NOT using ADC (e.g. running outside GCP with a SA key file)
 GCP_SA_KEY=<base64-encoded-service-account-json>
 
-# AWS — credentials for the AWS Secrets Manager vault implementation
-# Only required if you are using the AWS SM vault backend
-AWS_ACCESS_KEY_ID=<your-aws-access-key-id>
-AWS_SECRET_ACCESS_KEY=<your-aws-secret-access-key>
+# MS Teams (optional alternative notifier — swap connector in dependencies.py)
+TEAMS_WEBHOOK_URL=<your-teams-incoming-webhook-url>
 
-# HashiCorp Vault — token for the Vault vault implementation
-# Only required if you are using the Vault backend
-VAULT_TOKEN=<your-vault-token>
+# Google Chat (optional alternative notifier — swap connector in dependencies.py)
+GOOGLE_CHAT_WEBHOOK_URL=<your-google-chat-webhook-url>
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -52,3 +52,31 @@ jobs:
 
       - name: pytest (unit)
         run: pytest tests/unit/ -v
+
+  docker-smoke:
+    name: Docker build + smoke test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Build image
+        run: docker build -t aria:ci .
+
+      - name: Smoke test
+        run: |
+          docker run -d --name aria-ci \
+            -e ARIA_DRY_RUN=true \
+            -e ARIA_LLM_PROVIDER=anthropic \
+            -e ANTHROPIC_API_KEY=dummy \
+            -p 8000:8000 aria:ci
+          # Wait for the API to boot (health check starts after start_period).
+          for i in $(seq 1 15); do
+            if curl -sf http://localhost:8000/api/v1/health; then
+              echo "Health check passed"
+              break
+            fi
+            echo "Waiting... ($i/15)"
+            sleep 2
+          done
+          curl -sf http://localhost:8000/api/v1/health
+          docker stop aria-ci
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,29 @@
+FROM python:3.11-slim
+
+# curl is needed for the HEALTHCHECK command below.
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends curl \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+# Create a non-root user before copying any files.
+RUN adduser --disabled-password --uid 1000 aria
+
+# Install dependencies first so this layer is cached when only source changes.
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application source.
+COPY . .
+
+RUN chown -R aria:aria /app
+USER aria
+
+EXPOSE 8000
+
+# Health check hits the /health endpoint — fails fast if the API is down.
+HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:8000/api/v1/health || exit 1
+
+CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/README.md b/README.md
@@ -549,6 +549,51 @@ uvicorn api.main:app --reload
 
 ---
 
+## Deployment
+
+ARIA ships as a single Docker image. No Python installation is required on the target machine — only Docker (local/VM) or a Kubernetes cluster (production). The same image works across all environments; what changes is how `conf.yaml` and secrets are injected.
+
+### Docker (local machine or VM)
+
+```bash
+# 1. Build
+docker build -t aria:latest .
+
+# 2. Run — mount your conf.yaml; pass secrets as env vars
+docker run -d \
+  --name aria \
+  -p 8000:8000 \
+  -v /path/to/conf.yaml:/etc/aria/conf.yaml:ro \
+  -e ARIA_CONFIG_PATH=/etc/aria/conf.yaml \
+  -e SNOW_PASSWORD=<your-password> \
+  -e ANTHROPIC_API_KEY=<your-key> \
+  -e SLACK_BOT_TOKEN=<your-token> \
+  aria:latest
+
+# 3. Verify
+curl http://localhost:8000/api/v1/health
+```
+
+### Kubernetes
+
+`conf.yaml` is delivered via a ConfigMap; secrets via a Kubernetes Secret or GCP Secret Manager (Workload Identity, no API key in the pod):
+
+```bash
+kubectl create namespace aria
+kubectl create configmap aria-config --from-file=conf.yaml=./conf.yaml -n aria
+kubectl create secret generic aria-secrets \
+  --from-literal=SNOW_PASSWORD=<pw> \
+  --from-literal=ANTHROPIC_API_KEY=<key> \
+  --from-literal=SLACK_BOT_TOKEN=<token> \
+  -n aria
+```
+
+Then apply a Deployment that mounts the ConfigMap at `/etc/aria/conf.yaml` and sets `ARIA_CONFIG_PATH=/etc/aria/conf.yaml`. For GCP clusters, set `llm.provider: vertex_ai` and `runtime.vault_backend: gcp` in `conf.yaml` — the pod authenticates via Workload Identity with no credentials in the container.
+
+**Full guide** (conf.yaml preparation, docker-compose, GKE Deployment + Service YAML, LLM provider selection, vault backend options): [documentation/guides/installation.md](documentation/guides/installation.md)
+
+---
+
 ## Acceptance criteria (Phase 1)
 
 Phase 1 is complete when all of the following pass on 10 consecutive test incidents:
@@ -580,8 +625,8 @@ Phase 1 is complete when all of the following pass on 10 consecutive test incide
 | Phase 1 | S8: ReAct loop trigger — cross-service log requests | ✅ Done |
 | Phase 1 | M7: Acceptance criteria validated on local environment | ✅ Done |
 | Phase 1.5 | S1: Structured logging — structlog, `run_id`, lifecycle events, RunRecord | ✅ Done |
-| **Phase 1.5** | **S2: Monitoring foundation — run store, REST API, Alpine.js dashboard, mode scaffold** | 🔜 Next |
-| Phase 1.5 | S3: Docker + `ARIA_CONFIG_PATH` + `VertexAILLMClient` + LLM provider DI | 🔜 Planned |
+| Phase 1.5 | S2: Monitoring foundation — run store, REST API, Alpine.js dashboard, mode scaffold | ✅ Done |
+| Phase 1.5 | S3: Docker + `ARIA_CONFIG_PATH` + `VertexAILLMClient` + LLM provider DI (incl. #84 security fix) | ✅ Done |
 | Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring, KB runbooks, CMDB validation | 🔜 Planned |
 | Phase 1.5 | S5: Round 2 acceptance testing — 30 incidents on UC1 + UC2 real infrastructure | 🔜 Planned |
 | Phase 1.5 | S6: GCP native connectors — BQ, Cloud Functions, Pub/Sub, GCS | 🔜 Planned |