Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions ai_plans/2026-06-21_web-task-summary-strip-env-details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Web task summary: show strictly the user query, fold the environment block

Date: 2026-06-21
Area: self-hosted-cloudapi web view (task list + task detail)

## Problem

The task summary/title in the cloud web view shows machine-generated framing the
user never typed — e.g. "Current Mode / code". Root cause, proven from real data:

The first user turn that reaches the cloud arrives in Roo Code's **API-prompt
form**, not the clean UI text. `api_conversation_history.json` first user message:

```
<user_message>
uruchom wszystkie testy w langgrapha
</user_message> <environment_details>
# VSCode Visible Files
...
# Current Mode
<slug>code</slug>
<name>💻 Code</name>
<model>unsloth/GLM-5.2-GGUF:UD-Q3_K_XL</model>
...
</environment_details>
```

`_derive_title()` ([routers/web.py](../self-hosted-cloudapi/src/routers/web.py))
takes the first text-bearing message verbatim, so the `<user_message>` wrapper and
the `<environment_details>` block (mode, open tabs, file tree, cost…) bleed into
the title. The same raw text renders in the conversation body with no way to
separate the human query from the machine appendix.

## Fix

Treat the wrapped form as what it is: human query + machine appendix.

### Backend — `_derive_title` (routers/web.py)

Add `_strip_task_wrappers(text)`:

1. Remove `<environment_details>…</environment_details>` (also the trailing,
unclosed case).
2. Unwrap the human message tag — `<user_message>` / `<task>` / `<feedback>` — to
its inner content.
3. Plain text (already clean) passes through unchanged.

`_derive_title` runs each candidate message through it before taking the first
non-empty line. Covers both the task list and the detail-page `<h1>`.

### Frontend — render.js conversation body

Add `userContentHtml(text)`: split off the `<environment_details>` block, unwrap
the message tag, render the clean query as markdown, and append the environment
block as a **collapsed `<details>`** ("Environment details") so the full original
is one click away — satisfying "unfold to full length". Applied to the text /
user_feedback / user_feedback_diff rows. No tags present → identical to today.

### CSS — app.css

Minimal styling for `details.env-details` (muted summary, monospace body).

## Tests

`tests/test_web_and_share.py`: a backfill whose first message is the wrapped
API-form turn — assert the rendered list/detail title is the bare query
("uruchom wszystkie testy w langgrapha"), with no `environment_details` / `Current
Mode` / `<user_message>` leakage.

## Out of scope

- Message role classification (the initial task currently renders under the
"Assistant" label) — separate concern, not touched here.
- Title length cap stays at 100 chars; the full prompt is now visible in the
conversation body.
57 changes: 57 additions & 0 deletions ai_plans/2026-06-22_authentik-group-gate-and-app-rename.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Authentik: gate Tumble Code by a group + rename the application

**Date:** 2026-06-22
**Scope:** `self-hosted-cloudapi/authentik/blueprints/`

## Goal

Two changes to the auto-provisioned Authentik blueprint, applied cleanly on a
fresh `docker compose up` (user will drop all `./.vol/*` first):

1. Provision a **group** so access to Tumble Code is controlled by group
membership — add a user to the group → they can sign in to Tumble Code.
2. **Rename** the application's display name from `Stork Code` → `Tumble Code`.

## Background (verified against Authentik docs)

- Application access in Authentik is governed by **policy bindings** on the
application. A binding whose `group` field is set is a plain _group-membership_
check — no separate policy object needed.
- **Default behaviour:** an application with _no_ bindings is open to everyone.
The moment one group binding is added, access is restricted to that group.
- **Superusers are not exempt** from application access bindings (superuser grants
_admin_ access, not _application_ access). The bootstrap `akadmin` account — the
one used to sign in during the extension OAuth flow — must therefore be a member
of the group, or it gets locked out of its own app. The blueprint adds `akadmin`
to the group on creation to prevent this.

Source: Authentik blueprint Models + Bindings overview docs.

## Changes (single file: `stork-code.yaml`)

Internal IDs stay (`slug: stork-code`, `client_id`, provider name) — these are
referenced by the api's `AUTHENTIK_APP_SLUG` / `AUTHENTIK_CLIENT_ID` and must not
change. Only the public display string changes, per the rebrand principle.

1. **New group entry** (`authentik_core.group`), id `tumble-code-group`,
name `Tumble Code Users`, with `akadmin` added as a member via
`!Find [authentik_core.user, [username, akadmin]]`.
2. **Application name** `Stork Code` → `Tumble Code`.
3. **New policy binding** (`authentik_policies.policybinding`) targeting the
application (`!KeyOf stork-code-application`) with `group`
(`!KeyOf tumble-code-group`), `order: 0`, `enabled: true` — this is what turns
on the group gate.

## How to use after `docker compose up`

- Sign in to Authentik admin as `akadmin` (already in the group → can use Tumble
Code immediately).
- To grant another person access: Directory → Groups → **Tumble Code Users** →
add the user. No blueprint edit needed.

## Not changed / why

- Slug, client id/secret, provider name, `AUTHENTIK_APP_SLUG` — internal IDs the
api builds endpoints from; renaming them would be a wider, riskier change and
isn't what was asked.
- Blueprint filename kept as `stork-code.yaml` (internal).
81 changes: 81 additions & 0 deletions ai_plans/2026-06-22_cloudapi-authentik-back-channel-host.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Fix Authentik back-channel 502 on OAuth callback (public-address ready)

Branch: `fix/cloudapi-authentik-back-channel-host`

## Symptom

After logging in to the bundled Authentik, the browser lands on
`GET http://localhost:8085/auth/clerk/callback?code=...&state=...` with
**502 Bad Gateway**.

## Root cause (proven, not inferred)

The "502" is **not** a reverse-proxy error — it is the cloud API's own error
page, returned at [browser.py:267](../self-hosted-cloudapi/src/routers/browser.py#L267)
when the back-channel token exchange to Authentik throws.

Evidence chain, gathered against the running stack:

1. `api` container log:
`Token exchange failed: Client error '404 Not Found' for url 'http://auth_server:9000/application/o/token/'`
2. Every Authentik `/application/o/*` route 404s on the back-channel, while
`/-/health/live/` returns 200 — so the container is reachable, the routes are not.
3. The only variable is the HTTP `Host` header. Probing the same URL:
- `Host: auth_server:9000` → **404**
- `Host: localhost` / `localhost:9000` / `auth.tumblecode.dev` / `evil.example.com` → **200**
4. Narrowed to the underscore: `under_score.example.com` → 404, `auth-server:9000` → 200.

**Authentik (Django) resolves the brand — and therefore serves its OAuth/OIDC
routes — from the `Host` header, and rejects hosts containing an underscore
(`auth_server` is not a valid RFC-1123 hostname) with a 404.** The compose
service is named `auth_server`, so the back-channel URL `http://auth_server:9000`
makes httpx send `Host: auth_server:9000` → 404 → token exchange fails → 502 page.
The browser flow works only because the front-channel host (`localhost:9000`) is valid.

The discovery doc's `issuer` merely echoes the request Host, and the token's real
`iss` is fixed at front-channel authorize time, so the topology-independent fix is
to make the back-channel present the **public front-channel host** as `Host`.

## Fix

Connect to the internal service name (for DNS) but send the front-channel host
(host of `AUTHENTIK_BASE_URL`) as `Host` on every server-to-server call. Works
identically for dev (`localhost:9000`) and prod (`auth.tumblecode.dev`).

- [config/auth.py](../self-hosted-cloudapi/config/auth.py): add
`get_back_channel_host_header()` → returns `urlsplit(authentik_base_url).netloc`
when `authentik_internal_url` is set, else `None`.
- [src/auth/authentik.py](../self-hosted-cloudapi/src/auth/authentik.py): add
`_back_channel_headers()` and apply it to `exchange_code_for_tokens`,
`get_userinfo`, `get_openid_configuration`.
- [.env.example](../self-hosted-cloudapi/.env.example): document the Host behaviour
and a full `app.tumblecode.dev` production block.
- [tests/test_back_channel_host.py](../self-hosted-cloudapi/tests/test_back_channel_host.py):
lock in the header value and that it is attached to all three calls.

No compose change needed: the Host override neutralises the underscore, so
`AUTHENTIK_INTERNAL_URL=http://auth_server:9000` stays valid.

## Production (app.tumblecode.dev)

```
API_BASE_URL=https://app.tumblecode.dev
AUTHENTIK_BASE_URL=https://auth.tumblecode.dev # front-channel host → sent as Host
AUTHENTIK_INTERNAL_URL=http://auth_server:9000 # back-channel (in-cluster)
AUTHENTIK_REDIRECT_URI=https://app.tumblecode.dev/auth/clerk/callback
CORS_ORIGINS=https://app.tumblecode.dev
AUTHENTIK_CLIENT_SECRET=<openssl rand -hex 32> # provider is confidential
```

The provider `client_type` is `confidential`, so a matching `client_secret` is
mandatory in production (the bundled stack already shares one via env). The api
will send `Host: auth.tumblecode.dev` on back-channel calls.

## Verification

- Unit: `pytest tests/test_back_channel_host.py` + auth suites → 22 passed.
- Live, against the running Authentik (simulating the patched code path):
- old (`Host: auth_server:9000`) → **404**
- new (`Host: localhost:9000`) + real client_secret + fake code → **400 `invalid_grant`**
— i.e. the request now reaches the token endpoint, client auth passes, only the
fake code is rejected. A real authorization code will succeed.
52 changes: 52 additions & 0 deletions ai_plans/2026-06-22_dockerize-cloud-backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Dockerize the self-hosted cloud backend

**Date:** 2026-06-22
**Scope:** `self-hosted-cloudapi/`

## Goal

Be able to run the self-hosted cloud API in a container.

## Finding

A `Dockerfile`, `.dockerignore`, and `docker-compose.yml` already existed and were
committed, but **the image did not build**. Proven by `docker build`:

```
OSError: Readme file does not exist: README.md
ERROR: process "/bin/sh -c uv sync --frozen --no-dev" did not complete successfully
```

### Root cause

- `pyproject.toml` declares `readme = "README.md"` under `[project]`.
- The final `RUN uv sync --frozen --no-dev` installs the project itself, so hatchling
reads project metadata and requires `README.md` to be present.
- `.dockerignore` excluded `*.md` (and `README.md`), so the file was not in the build
context → metadata validation fails.
- The earlier `uv sync ... --no-install-project` passes because it runs before
`COPY . .` and does not build the project, so it never touches the README.

## Fix

One line in `.dockerignore`: keep `README.md` in the build context while still
ignoring other markdown.

```
*.md
!README.md
```

## Verification

1. `docker build -t roo-cloud-api:test .` — succeeds (was failing before).
2. `docker run ... uv run uvicorn src.main:app` — app imports cleanly through
uvicorn and reaches Pydantic settings validation; only stops on missing required
Authentik env vars, which `docker-compose.yml` supplies. Confirms the Python
entrypoint, dependency set, and app module are all sound in the image.

## Notes / possible follow-ups (not done)

- Container runs as root; a non-root `USER` could be added for hardening.
- A `HEALTHCHECK` and multi-stage build (smaller runtime image) are reasonable
future improvements but were out of scope for "make it build and run".
87 changes: 87 additions & 0 deletions ai_plans/2026-06-22_fix-fresh-db-bootstrap-crashloop.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Fix: api container crash-loop on a fresh database (Docker bring-up)

**Date:** 2026-06-22
**Scope:** `self-hosted-cloudapi/`

## Symptom

After `docker compose up -d`, every service is healthy **except `api`**, which is
stuck `Restarting (1)`. The backend is unreachable on `:8085`. Logs show:

```
sqlalchemy.exc.ProgrammingError: (...asyncpg...UndefinedTableError):
relation "authentik_state_store" does not exist
[SQL: ALTER TABLE authentik_state_store ALTER COLUMN created_at TYPE TIMESTAMP WITH TIME ZONE]
```

## Root cause (proven, not assumed)

The Dockerfile `CMD` runs `alembic upgrade head` **before** the app starts.
On a fresh `./.vol/postgres` volume that migration chain cannot build a schema:

- `a1b2c3d4e5f6_baseline.py` — `upgrade()` is `pass`. Creates **no tables**. Its
own docstring says it represents a pre-existing `create_all`'d DB you are meant
to `alembic stamp`.
- `b2c3d4e5f6a7_datetime_timezone.py` — immediately `ALTER`s `authentik_state_store`
(and `users`, `sessions`, …), tables that were never created → **crash**.
- `c3d4…`, `d4e5…`, `e5f6…` — all evolution-only (`add_column`, `create_index`).

The only thing that _creates_ tables is `Base.metadata.create_all` — in the app
lifespan ([src/main.py:30](../self-hosted-cloudapi/src/main.py#L30)), with the
ORM models as the single source of truth. But the app never starts, because
alembic crashes first in the `&&` chain.

So: **alembic-first ordering + a no-op baseline = a fresh DB can never bootstrap.**

Note a tempting non-fix: making the baseline `create_all`. That breaks too —
`create_all` produces the **head** schema, so the later `add_column` migrations
(`task_message_ts`, `task.workspace_path`) would then fail with _column already
exists_. The migrations are evolution steps for a _pre-head_ schema; they must not
be replayed against a freshly created head schema.

## Fix

Replace the blind `alembic upgrade head` with a small startup reconciler that
matches the project's actual design (models = source of truth; migrations = how
_existing_ deployments evolve):

- **Fresh DB** (no `users` table): `Base.metadata.create_all` builds the current
schema, then `alembic stamp head` records every migration as already applied
(without running the evolution steps).
- **Legacy DB** (tables exist, no `alembic_version`): follow the baseline's
documented path — `alembic stamp a1b2c3d4e5f6` then `alembic upgrade head` —
so an old pre-tz schema gets evolved.
- **Managed DB** (`alembic_version` present): `alembic upgrade head` as normal.

### Files

- `src/db_bootstrap.py` (new) — async probe of the live DB; prints
`FRESH` / `LEGACY` / `MANAGED` and runs `create_all` in the `FRESH` case. Uses
the same engine/models as the app, so there is one schema source of truth.
- `docker-entrypoint.sh` (new) — runs the probe, dispatches the correct alembic
command per state, then `exec`s uvicorn.
- `Dockerfile` — `CMD` now runs `docker-entrypoint.sh` (copied + `chmod +x`).

The app lifespan keeps its own idempotent `create_all` (harmless no-op once the
entrypoint has built the schema), so running the app outside Docker is unchanged.

## Verification

1. `docker compose down` + remove `./.vol/postgres` → truly fresh DB.
2. `docker compose up -d` → `api` reaches healthy/running, not restarting.
3. `docker compose logs api` shows `DB state: FRESH`, the stamp, and
`Application startup complete` — no `UndefinedTableError`.
4. `curl -fsS localhost:${PORT:-8085}/health` (or `/`) returns 200.
5. `docker compose exec api uv run alembic current` shows head
(`e5f6a7b8c9d0`), proving alembic and the schema agree.
6. Restart `api` → `DB state: MANAGED`, `upgrade head` no-op, still healthy
(idempotency).
7. `uv run pytest` stays green (entrypoint is Docker-only; no app code path
changed).

## Risks / follow-ups

- The `LEGACY` branch assumes a pre-tz schema (the baseline's documented
assumption). A legacy DB that was `create_all`'d at _head_ and never stamped
would fail `upgrade` on the `add_column` steps — but that is the pre-existing
documented contract, not introduced here, and not the Docker path.
Loading
Loading