Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions ai_plans/2026-06-19_finish-self-hosted-auth-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Finish the self-hosted Authentik auth flow (operational + persistence)

**Date:** 2026-06-19
**Branch:** `feature/self-hosted-cloud-backend`
**Goal (user's words):** "We're using Authentik. Run Authentik, set up the app and
user account, integrate with the server, and ensure a user can authenticate and
**stay authenticated until logoff** even after returning next day/week."

---

## What was already true (verified in code, not assumed)

The auth _code_ is complete; the gap is operational. Evidence:

- **Browser OAuth flow** — [browser.py](../self-hosted-cloudapi/src/routers/browser.py):
PKCE → Authentik authorize → `/auth/clerk/callback` → user/session/ticket →
HTML bounce to `vscode://`. Authentik's own access/refresh tokens are used once
for `get_userinfo` and **never stored** → extension longevity does NOT depend on
the Authentik session lasting.
- **Durable credential** — the extension trades the single-use ticket for a
**client token** (opaque, SHA-256-hashed). `ClientToken.expires_at` is nullable
and **never set** → never expires server-side
([user.py](../self-hosted-cloudapi/src/models/user.py)).
`Session.expires_at` likewise never set; `is_active` flips to `False` only on
explicit `/v1/client/sessions/{id}/remove` (logoff).
- **Short-lived JWT** — [jwt_issuer.py](../self-hosted-cloudapi/src/auth/jwt_issuer.py)
`issue_session_token(expires_in=60)`. Re-minted on demand from the client token
via `POST /v1/client/sessions/{id}/tokens`.
- **Client-side persistence** — [WebAuthService.ts:214](../packages/cloud/src/WebAuthService.ts#L214)
stores `{clientToken, sessionId}` in VS Code `SecretStorage`, reloads on
`initialize()`, and `refreshSession()` re-mints a JWT on a `RefreshTimer`.

**Conclusion:** "return next day/week, stay logged in until logoff" is already the
design. The only missing pieces are: Authentik is down, and the API isn't running.

## Environment findings

- Existing Authentik lives in compose project `llm` at
`/opt/docker/llm/docker-compose.yaml`: `auth_db` (postgres:16-alpine),
`auth_server`, `auth_worker` (goauthentik 2026.2.2). All **exited ~4 weeks ago**.
- Data survived on bind mounts: `/opt/docker/llm/vol/auth/{data,postgres,certs}`.
- `/opt/docker/llm/.env` still holds `AUTHENTIK_SECRET_KEY` (so existing encrypted
DB rows stay decryptable — must NOT change it) and `AUTH_PG_PASS`.
- **Two problems with reviving as-is:**
1. **No Redis** in the stack. Authentik requires Redis (Celery broker for the
worker, cache, Channels layer). `auth_worker` exited **1** — consistent with a
missing broker. Must add a `redis:alpine` service.
2. **Port 5432 conflict.** `auth_db` maps host `5432:5432`, but the unrelated,
currently-running `voicebot-database` holds host 5432.
- Cloud API config: [self-hosted-cloudapi/.env](../self-hosted-cloudapi/.env) →
`DATABASE_URL=postgresql://authentik:…@localhost:5432/stork_code`,
`AUTHENTIK_BASE_URL=http://localhost:9000`, `AUTHENTIK_APP_SLUG=stork-code`,
`AUTHENTIK_CLIENT_ID=nLV79xyh…`, `AUTHENTIK_REDIRECT_URI=http://localhost:8085/auth/clerk/callback`,
`API_BASE_URL=http://localhost:8085`. So the cloud API uses a **second database
`stork_code`** on the same Authentik Postgres, connecting as user `authentik`.
- Free host ports: 5544, 9000, 9443, 6379, 8085.

## Decisions (from the user)

1. **Revive the existing `/opt/docker/llm` stack** (keep `stork-code` app + users) and
**add the missing Redis**.
2. **Remap `auth_db` host port 5432 → 5544** (zero disruption to voicebot);
point the cloud API `.env` at 5544. Authentik's internal `auth_db:5432`
connection is over the compose network and unaffected.

## Plan

### 1. Edit `/opt/docker/llm/docker-compose.yaml`

- Add `auth_redis` (`redis:alpine`, healthcheck `redis-cli ping`, volume
`./vol/auth/redis:/data`, restart unless-stopped, **no host port** — internal only).
- Add `AUTHENTIK_REDIS__HOST: auth_redis` to `auth_server` and `auth_worker` env.
- Add `auth_redis` (condition: service_healthy) to `depends_on` of server + worker.
- Change `auth_db` host port `"5432:5432"` → `"5544:5432"`.

### 2. Bring up the auth stack

`docker compose -f /opt/docker/llm/docker-compose.yaml up -d auth_db auth_redis auth_server auth_worker`

- Verify `auth_db` healthy, `auth_worker` stays up (no exit 1), `auth_server` logs
show "Starting authentik server", `GET http://localhost:9000/-/health/ready/` → 200.

### 3. Verify/restore Authentik config

- Confirm an OAuth2 **Provider** exists whose client ID == `AUTHENTIK_CLIENT_ID`,
with redirect URI `http://localhost:8085/auth/clerk/callback`, bound to an
**Application** slug `stork-code`. Confirm scopes include `openid email profile`.
- Confirm at least one **user** exists (or create one + set a password).
- If the provider/app didn't survive, recreate it and sync `client_id`/`secret`
into `self-hosted-cloudapi/.env`.

### 4. Wire + migrate the cloud API DB

- Update `self-hosted-cloudapi/.env` `DATABASE_URL` port `5432` → `5544`.
- Ensure database `stork_code` exists on the Authentik Postgres (create if missing:
`CREATE DATABASE stork_code OWNER authentik;`).
- Run `uv run alembic upgrade head`.

### 5. Start the cloud API

- `uv run uvicorn src.main:app --host 0.0.0.0 --port 8085` (background).
- Smoke: `GET /` → `{"status":"ok"}`; auth routes mounted.

### 6. End-to-end + persistence verification

- Drive the OAuth flow (curl through authorize → callback, or the extension) to a
ticket; `POST /v1/client/sign_ins` → capture client token + `created_session_id`;
`POST /v1/client/sessions/{id}/tokens` → **200 + jwt** (the line that used to 404);
`GET /v1/me` → 200.
- **Persistence proof:** with the same stored client token, re-call the tokens
endpoint (simulating "next day" after the 60 s JWT expired) → fresh jwt, no
re-login. Confirms client token + session never expire and the extension's
`RefreshTimer` path works.

## Out of scope (tracked separately)

- Authentik groups → `org_id` mapping ([browser.py:281](../self-hosted-cloudapi/src/routers/browser.py#L281)).
User did not ask for orgs; JWT simply omits `r.o`. Revisit only if org-scoped
features are needed.
- Anthropic streaming SSE conversion, Google/xAI providers, marketplace org
filtering, admin API — unrelated to the auth flow.

## Risk / rollback

- Compose edits are additive (Redis) + a host-port remap; revert the three hunks to
restore the original file. The `voicebot` project is never touched.
- `AUTHENTIK_SECRET_KEY` is left unchanged → existing encrypted data stays readable.
- Cloud API `.env` change is a single port digit; revert to 5432 if 5544 is undesired.
124 changes: 124 additions & 0 deletions ai_plans/2026-06-19_web-task-list-and-viewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Self-hosted Cloud: web task list + read-only task viewer

**Date:** 2026-06-19
**Branch:** `feature/self-hosted-web-task-viewer` (stacked on `feature/self-hosted-cloud-backend`)
**Goal (user's words):** "see the list of tasks on web page (after backend run) and after
click on task, we should see the same flow as in Tumble Code."

---

## Context

Auth is finished (Authentik → cloud API on `:8085`, user signed in). Next: a web page listing
the user's tasks and a read-only conversation view per task. Investigation found the
extension's "share task to cloud" pipeline is **broken end-to-end on the backend** and there
is **no web frontend**. Live DB: 1 user, `tasks=0, task_messages=0, task_shares=0`.

**Decisions (user):** lightweight read-only renderer (not reusing the VS Code-coupled
`ChatRow.tsx`); list **shared tasks only**; server-rendered Jinja2 + minimal JS inside FastAPI
(no Node build).

## How task data arrives (verified)

Upload happens only on **Share** in the extension:
`ShareButton` → `shareCurrentTask` (webviewMessageHandler.ts:844) →
`CloudService.shareTask(taskId, visibility, clineMessages)` (CloudService.ts:315):

1. `POST /api/extension/share`; on **404** `TaskNotFoundError` →
2. `POST /api/events/backfill` (multipart `task.json` = full `ClineMessage[]`,
TelemetryClient.ts:238) → retry share.

`ClineMessage`: packages/types/src/message.ts:249.

## Four blockers (each verified)

- **A. Sharing disabled (button dead).** `enable_task_sharing` comes only from org settings;
no org → `cloud_settings=None` (settings_service.py:27) → client `canShareTask()` false →
Share button disabled (ShareButton.tsx:128).
- **B. Wrong status code.** `share_task` returns HTTP 200 `{success:false}` (share_service.py:25)
but client backfills only on HTTP 404 (CloudAPI.ts:97). First share silently fails.
- **C. No `Task` row.** `backfill_messages` inserts `TaskMessage` (FK → `tasks.id`) but never
creates the parent `Task` → IntegrityError. FK `task_messages_task_id_fkey` confirmed live.
- **D. No web UI.** `share_url` = relative `/shared/{id}`, unserved; no list endpoint; auth is
Bearer-JWT only (dependencies.py:15) — browser needs a cookie session.

## Implementation

### 1. Backend fixes (no extension changes)

- **A** `settings_service.py`: org-less → return `OrganizationCloudSettings(enable_task_sharing=
True, allow_public_task_sharing=True)`. New `enable_task_sharing: bool = True` in
`config/settings.py` (env `ENABLE_TASK_SHARING`).
- **A.2 (follow-up, 2026-06-20):** the Share button stayed disabled after A even with the
correct settings live. Root cause: the extension caches org settings and only replaces them
when `version` changes (`CloudSettingsService.fetchSettings`, version check at
CloudSettingsService.ts:139). The org-less response hardcoded `version = 0`; the client had
cached `version:0, cloudSettings:null` at the login _before_ A, so the new (still `0`)
response was rejected as unchanged. Fix: org-less `version` is now content-derived
(`_content_version` = sha256 of the cloud-settings payload → 32-bit int), so it differs from
the stale `0` and auto-bumps on any future toggle. Client re-fetches hourly + on session
start, so a window reload (or sign-out/in) applies it immediately.
- **B** `routers/extension.py` share: raise `HTTPException(404)` when task missing.
- **C** `services/telemetry_service.py` `backfill_messages(user_id, …)`: get-or-create
`Task(id, user_id)`, delete existing `TaskMessage`s for the task, re-insert in order; pass
`current_user` from `routers/events.py`.
- **share_url absolute** `services/share_service.py`: `{api_base_url}/shared/{id}`,
manage `{api_base_url}/app/tasks/{id}`.

### 2. Browser session auth — `src/auth/web_session.py`

Reuse `generate_pkce_pair`, `get_authorize_url`, `store_oauth_state`, `get_oauth_state`,
`exchange_code_for_tokens`, `get_userinfo`, `get_or_create_user`, `create_session`. Reuse the
single `/auth/clerk/callback` redirect URI; branch on the stored `auth_redirect` marker
(`http(s)://` = web → set cookie + 302 `/app`; `vscode://` = existing bounce). Cookie
`tumble_session` = itsdangerous-signed `{session_id,user_id}`, 30-day, HttpOnly, SameSite=Lax.
`get_web_user` dependency validates the cookie + `Session.is_active`. Routes `/app/login`,
`/app/logout`.

### 3. Web router + templates — `src/routers/web.py`, `src/web/templates/`, `src/web/static/`

- `GET /app` task list (own tasks, newest first, derived title + counts).
- `GET /app/tasks/{id}` detail (session + ownership).
- `GET /shared/{id}` public target (anon if visibility public, else session).
- Lightweight renderer: parse each `TaskMessage.message_data` `ClineMessage`, render per
say/ask type; vendored `marked` + minimal highlight CSS; dark theme like browser.py.
- Mount `Jinja2Templates` + `StaticFiles` in `main.py`; add `jinja2` dep.

### 4. Migration

`tasks` already has `user_id`; add a migration only if DDL changes. Head `b2c3d4e5f6a7`.

## Out of scope

Auto-sync all tasks; React component reuse; Authentik group→org_id mapping.

## Verification

**Status: implemented & verified (2026-06-19).**

Automated (`tests/test_web_and_share.py`, 9 new tests; full suite **29 passed**):

- B: `POST /api/extension/share` for an unknown task → **404**.
- C: `POST /api/events/backfill` creates the `Task` row + 3 `TaskMessage`s; re-share with a
shorter set **replaces** (count 1, still 1 Task) — idempotent.
- Web: `/app` without session → **303** `/app/login`; `/app` with session lists owned tasks
(derived title rendered); `/app/tasks/{id}` for a non-owner → **404**; `/shared/{id}`
public → 200 anon, private → 303 login, unknown → 404.

Live smoke (uvicorn on throwaway sqlite):

- `/health` 200; `/app` → 303 `/app/login`; `/app/login` → 307 to Authentik authorize URL
(PKCE + state present).
- `/static/app.css` 200 `text/css`; `/static/render.js` 200 `text/javascript`;
vendored `marked.min.js` (35479 B) + `purify.min.js` (21496 B) served.
- A: `get_extension_settings(org_id=None)` →
`cloudSettings.enableTaskSharing == true`, `allowPublicTaskSharing == true`.

Remaining manual step (needs the real extension + Authentik, not scriptable here):
sign in to the extension, run a small task, click **Share**, confirm the live log shows
`share 404 → backfill 200 → share 200` and `/app` lists it, `/shared/{id}` renders it.

## Risk / rollback

Backend changes are additive + one status-code change; revert files to restore. Web routes are
new and isolated. No `AUTHENTIK_SECRET_KEY` change. New cookie uses existing `secret_key`.
91 changes: 91 additions & 0 deletions ai_plans/2026-06-20_fix-share-button-disabled-null-settings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Fix: Share button stays disabled — extension-settings emits JSON `null`, client Zod rejects it

Date: 2026-06-20
Branch: feature/self-hosted-web-task-viewer
Area: self-hosted-cloudapi (settings serialization)

## Symptom

In the extension Task header the Share (Share2) icon is rendered **disabled** (greyed
out, not clickable). Tooltip would read "sharingDisabledByOrganization".

## Root cause (proven with evidence)

The Share button is disabled in exactly one branch of `ShareButton.tsx`:

```
cloudIsAuthenticated && !sharingEnabled -> disabled
```

`sharingEnabled` comes from `CloudService.canShareTask()`, which returns
`settingsService.getSettings()?.cloudSettings?.enableTaskSharing`. That value is
populated by `CloudSettingsService.fetchSettings()` from `GET /api/extension-settings`.

Evidence collected:

1. **Backend returns the right value over HTTP (200):** for the real user
`user_2c8fdf212b024808aa7a1ba1a`, `organization.cloudSettings.enableTaskSharing`
is `true`. So the data is correct.

2. **But the client never stores it.** The extension's VS Code `globalState`
(`QUB-IT.tumble-code` → key `organization-settings`) is **`null`**. The fetch
never populated the cache.

3. **Why the cache is null — schema parse fails.** The backend (Pydantic) serializes
_unset_ `Optional` fields as JSON `null`:
`features:null, hiddenMcps:null, hideMarketplaceMcps:null, mcps:null,
providerProfiles:null, cloudSettings.recordTaskMessages:null, ...` and on the user
side `settings.taskSyncEnabled:null`.

The client schemas (`packages/types/src/cloud.ts`) declare these as `.optional()`,
which accepts `undefined` but **rejects `null`**. Running the real
`organizationSettingsSchema` / `userSettingsDataSchema` against the live response:

```
ORG parse success: false
cloudSettings.recordTaskMessages: Expected boolean, received null
features: Expected object, received null
hiddenMcps: Expected array, received null
... (10 issues)
USER parse success: false
settings.taskSyncEnabled: Expected boolean, received null
```

`parseExtensionSettingsResponse` therefore returns `{success:false}`,
`fetchSettings()` logs "Invalid extension settings format" and returns without
assigning `this.settings`. Cache stays `null` → `canShareTask()` → `false` →
button disabled.

4. **Fix verified:** stripping nulls (what `exclude_none=True` does) makes both
schemas parse, and `enableTaskSharing === true`.

This is a backend contract bug: "optional" in the client means _may be absent_, not
_may be explicit null_. The backend must omit unset optionals rather than emit `null`.

## Fix

Serialize the settings responses with nulls omitted. Add
`response_model_exclude_none=True` to both routes in
`self-hosted-cloudapi/src/routers/settings.py`:

- `GET /api/extension-settings`
- `PATCH /api/user-settings` (same model family, parsed by the same strict client
schema in `CloudSettingsService.updateUserSettings`)

`exclude_none` only drops `null` values; required non-null fields (`version`,
`defaultSettings: {}`, `allowList`, `cloudSettings.enableTaskSharing: true`,
`features: {}`) are preserved.

## Post-fix activation

The running uvicorn has no `--reload`, and the client already cached `null`:

1. Restart the backend so the new serialization takes effect.
2. Reload the extension host window (or sign out/in). On the next fetch the response
parses, `organization-settings` is cached, and the Share button enables.

## Tests

Extend `self-hosted-cloudapi/tests/test_web_and_share.py` (or settings tests) to
assert the `/api/extension-settings` response contains **no `null` values** at any
nesting level, and that `organization.cloudSettings.enableTaskSharing` is present.
Loading
Loading