Skip to content

Latest commit

 

History

History
88 lines (60 loc) · 3.97 KB

File metadata and controls

88 lines (60 loc) · 3.97 KB

cortex

Deployment repo for a self-hosted Ollama instance. The image ships with two pre-pulled models:

  • qwen2.5:3b (default) — general chat, drafting, summarisation
  • gemma3:1b — small/fast for classification, tagging, extraction

This repo only owns the Ollama deployment. The application code that calls it (the /orgs/{orgSlug}/ai/chat API) lives in partners-connect/server — see the spec at partners-connect/server/specs/026-llm-chat-integration/.

What's here

cortex/
├── ollama/
│   └── Dockerfile               # builds an Ollama image with qwen2.5:3b + gemma3:1b pre-pulled
├── Dockerfile.deploy            # consumed by Clever Cloud, pulls the image from GHCR
├── docker-compose.yml           # local smoke test
├── .github/workflows/
│   ├── build-push-ollama.yaml   # CI: build & push to ghcr.io on every main push
│   └── deploy-ollama.yaml       # CD: deploy to Clever Cloud
└── DEPLOYMENT.md                # one-time Clever Cloud setup runbook

Local smoke test

docker compose up --build

Host-side port is 11435 locally to avoid colliding with a native Ollama install on the dev machine (which uses 11434). Inside the compose network, services still reach Ollama at http://ollama:11434. In production (Clever Cloud), Ollama is forced to listen on 8080 via the OLLAMA_HOST env var (set by the deploy workflow) so Clever Cloud's healthcheck — which polls 0.0.0.0:8080 — succeeds.

First build takes ~4–5 minutes (pulls Ollama, then pulls qwen2.5:3b ~1.9 GB and gemma3:1b ~815 MB into the image).

Once running:

# List models baked into the image
curl -s http://127.0.0.1:11435/api/tags | jq

# Test inference
curl -s -X POST http://127.0.0.1:11435/api/generate \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen2.5:3b","prompt":"Reply with one word: hello","stream":false}'

Stop:

docker compose down

Deployment pipeline

git push main ─▶ build-push-ollama.yaml ─▶ ghcr.io/<owner>/<repo>-ollama:<sha>
                                       └▶ ghcr.io/<owner>/<repo>-ollama:latest

manual trigger ─▶ deploy-ollama.yaml ─▶ Clever Cloud (single instance)

Image builds automatically on every push to main. Deploys are manual via gh workflow run "CD - Deploy Ollama to Clever Cloud" -f image_tag=latest (or the GitHub Actions UI).

First-time setup

See DEPLOYMENT.md for the one-time Clever Cloud setup (create apps, configure network group, set GitHub secrets).

Changing the default model

Edit the ollama pull lines in ollama/Dockerfile. Push to main to rebuild the image, then re-trigger the deploy workflow. Models get baked into the image, so a larger model means a larger image (and longer pulls on Clever Cloud).

Ollama loads one model into RAM at a time, so having multiple models on disk doesn't multiply runtime memory — image size grows but RAM only needs to fit the largest single model in use.

Sizing guide for the Clever Cloud instance (runtime RAM, with 8K context):

Model Disk Runtime RAM Min instance
gemma3:1b 815 MB ~2.0 GB S (~2 GB) — tight
qwen2.5:1.5b 986 MB ~1.8 GB S (~2 GB)
qwen2.5:3b (default) 1.9 GB ~3.5 GB M (~4 GB)
llama3.2:3b 2.0 GB ~3.5 GB M (~4 GB)
gemma3:4b 3.3 GB ~5 GB L (~8 GB)
qwen2.5:7b 4.7 GB ~6 GB L (~8 GB)

For runtime-pulled models (not baked in), attach a Clever Cloud FS Bucket add-on mounted at /root/.ollama so models persist across restarts.

Security note

The Ollama API has no authentication. The Clever Cloud Ollama app must NOT have a public domain — only the partners-connect server (on the same network group) should be able to reach it. See DEPLOYMENT.md.