cortex

Deployment repo for a self-hosted Ollama instance. The image ships with two pre-pulled models:

qwen2.5:3b (default) — general chat, drafting, summarisation
gemma3:1b — small/fast for classification, tagging, extraction

This repo only owns the Ollama deployment. The application code that calls it (the /orgs/{orgSlug}/ai/chat API) lives in partners-connect/server — see the spec at partners-connect/server/specs/026-llm-chat-integration/.

What's here

cortex/
├── ollama/
│   └── Dockerfile               # builds an Ollama image with qwen2.5:3b + gemma3:1b pre-pulled
├── Dockerfile.deploy            # consumed by Clever Cloud, pulls the image from GHCR
├── docker-compose.yml           # local smoke test
├── .github/workflows/
│   ├── build-push-ollama.yaml   # CI: build & push to ghcr.io on every main push
│   └── deploy-ollama.yaml       # CD: deploy to Clever Cloud
└── DEPLOYMENT.md                # one-time Clever Cloud setup runbook

Local smoke test

docker compose up --build

Host-side port is 11435 locally to avoid colliding with a native Ollama install on the dev machine (which uses 11434). Inside the compose network, services still reach Ollama at http://ollama:11434. In production (Clever Cloud), Ollama is forced to listen on 8080 via the OLLAMA_HOST env var (set by the deploy workflow) so Clever Cloud's healthcheck — which polls 0.0.0.0:8080 — succeeds.

First build takes ~4–5 minutes (pulls Ollama, then pulls qwen2.5:3b ~1.9 GB and gemma3:1b ~815 MB into the image).

Once running:

# List models baked into the image
curl -s http://127.0.0.1:11435/api/tags | jq

# Test inference
curl -s -X POST http://127.0.0.1:11435/api/generate \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen2.5:3b","prompt":"Reply with one word: hello","stream":false}'

Stop:

docker compose down

Deployment pipeline

git push main ─▶ build-push-ollama.yaml ─▶ ghcr.io/<owner>/<repo>-ollama:<sha>
                                       └▶ ghcr.io/<owner>/<repo>-ollama:latest

manual trigger ─▶ deploy-ollama.yaml ─▶ Clever Cloud (single instance)

Image builds automatically on every push to main. Deploys are manual via gh workflow run "CD - Deploy Ollama to Clever Cloud" -f image_tag=latest (or the GitHub Actions UI).

First-time setup

See DEPLOYMENT.md for the one-time Clever Cloud setup (create apps, configure network group, set GitHub secrets).

Changing the default model

Edit the ollama pull lines in ollama/Dockerfile. Push to main to rebuild the image, then re-trigger the deploy workflow. Models get baked into the image, so a larger model means a larger image (and longer pulls on Clever Cloud).

Ollama loads one model into RAM at a time, so having multiple models on disk doesn't multiply runtime memory — image size grows but RAM only needs to fit the largest single model in use.

Sizing guide for the Clever Cloud instance (runtime RAM, with 8K context):

Model	Disk	Runtime RAM	Min instance
`gemma3:1b`	815 MB	~2.0 GB	S (~2 GB) — tight
`qwen2.5:1.5b`	986 MB	~1.8 GB	S (~2 GB)
`qwen2.5:3b` (default)	1.9 GB	~3.5 GB	M (~4 GB)
`llama3.2:3b`	2.0 GB	~3.5 GB	M (~4 GB)
`gemma3:4b`	3.3 GB	~5 GB	L (~8 GB)
`qwen2.5:7b`	4.7 GB	~6 GB	L (~8 GB)

For runtime-pulled models (not baked in), attach a Clever Cloud FS Bucket add-on mounted at /root/.ollama so models persist across restarts.

Security note

The Ollama API has no authentication. The Clever Cloud Ollama app must NOT have a public domain — only the partners-connect server (on the same network group) should be able to reach it. See DEPLOYMENT.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cortex

What's here

Local smoke test

Deployment pipeline

First-time setup

Changing the default model

Security note

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

cortex

What's here

Local smoke test

Deployment pipeline

First-time setup

Changing the default model

Security note