CHORD Config Orchestrator — monitors and manages kotekan instances running on a cluster of nodes.
choco provides a web UI that shows the live status of every kotekan instance, detects when their configs drift from the desired state, and lets you push config updates. It talks to kotekan's built-in REST API, so no agent software is needed on the nodes.
Kotekan itself is deployed and managed on nodes by Ansible. choco only handles monitoring and config management.
- Python 3.10+
- A FreeIPA server for LDAP authentication (e.g.
ipa1.auth.chord-observatory.ca) - Kotekan instances reachable over HTTP (default port 12048)
Requires root (uses sudo internally):
git clone <this repo>
cd choco
sudo ./choco.sh install # install; prompts to overwrite existing configs
sudo ./choco.sh install --overwrite-configs # overwrite configs without prompting
sudo ./choco.sh install --keep-configs # keep existing configs without prompting
sudo $EDITOR /etc/choco/config.yaml # edit LDAP settings + secret_key
sudo systemctl restart chocoThis installs choco as a system service with the following layout:
| Path | Contents |
|---|---|
/opt/choco/.venv/ |
System Python venv with choco installed |
/etc/choco/config.yaml |
choco configuration (chmod 600) |
/etc/choco/configs/ |
Kotekan config files (nodes.yaml, group dirs, .updatable/) |
The install script also:
- Creates a local
.venvin the repo directory (editable install, owned by invoking user) for development - Sets up iptables rules to redirect ports 443 -> 5000 and 80 -> 8080 (persisted via
iptables-persistent) - Installs and enables a systemd service that starts on boot and restarts on failure
- Seeds
/etc/choco/configs/from the repo'sconfigs/directory on first install; on subsequent installs, prompts whether to overwrite (use--overwrite-configsor--keep-configsto skip the prompt)
Re-running sudo ./choco.sh install is safe — it always syncs config.yaml from the local copy (with configs_dir rewritten to /etc/choco/configs), and iptables rules are deduplicated. If configs already exist you'll be prompted before overwriting.
sudo systemctl status choco # check status
sudo systemctl restart choco # restart after config changes
sudo journalctl -u choco -f # follow logs./choco.sh run # run locally for development (extra args forwarded)The install script creates a local .venv with an editable install, so code changes in the repo are picked up immediately:
./choco.sh run # run local code against config.yaml
./choco.sh test # run tests (extra args forwarded to pytest)
./choco.sh test -k test_kotekan # run specific testschoco is configured via a config.yaml file and a config directory containing node/kotekan YAML files.
The install script creates /etc/choco/config.yaml from the template. Edit it:
server:
host: 0.0.0.0
port: 5000
secret_key: change-me # Change this in production!
log_level: INFO
configs_dir: configs
fpga_master:
host: chive.site.chord-observatory.ca
port: 54321
timeout: 5 # HTTP request timeout (seconds)
eop:
intervals_before: 2 # Days of past entries (older stored entries are truncated on merge)
intervals_after: 2 # Days of future entries (later stored entries are kept, never overwritten)
endpoint: earth_rotation_data # Kotekan updatable config endpoint name
state_file: eop-state.json # State file name (stored in configs_dir)
service_unit: choco-eop-broadcast.service # systemd unit for last-run status
ldap:
host: # e.g. ldaps://ipa1.auth.chord-observatory.ca
port: 636
use_ssl: true
base_dn: # e.g. dc=auth,dc=chord-observatory,dc=ca
user_dn: cn=users,cn=accounts
user_login_attr: uid
user_object_filter: "(objectclass=posixaccount)"
bind_dn: # e.g. uid=choco,cn=users,cn=accounts,dc=auth,dc=chord-observatory,dc=ca
bind_password:config.yaml contains secrets and is chmod 600. Only config.yaml.template is checked into the repo.
choco authenticates against a FreeIPA LDAP directory. FreeIPA does not allow anonymous binds, so a bind account is required for user searches. The bind_dn can be a dedicated user account (e.g. uid=choco,cn=users,cn=accounts,...). The defaults are tuned for FreeIPA (cn=users,cn=accounts user DN, posixaccount object class, LDAPS on port 636).
The config directory (/etc/choco/configs/) is the source of truth for which nodes choco manages and what their base configs are.
/etc/choco/configs/
├── nodes.yaml # Node registry
├── vars.yaml # (optional) Shared Jinja2 template variables
├── .updatable/ # Per-node updatable config overrides (JSON)
│ └── cx/
│ └── cx27.json # Updatable values for cx27
├── cx/
│ └── cx27.yaml # Base kotekan config for cx27
└── recv/
└── recv1.j2 # Base kotekan config (Jinja2 template)
Defines the kotekan instances choco should monitor, organized into groups. Each node's base config lives at <group>/<node>.yaml (or .j2):
groups:
cx:
cx27: {host: cx27.site.chord-observatory.ca, port: 12048}
cx42: {host: cx42.site.chord-observatory.ca, port: 12048, started: true}
recv:
recv1: {host: recv1.site.chord-observatory.ca, port: 12048}The optional started field is a pre-discovery default for the desired runtime state. On startup, choco polls every node and preserves whatever kotekan is actually doing — reachable nodes that are running come up with started=True, idle ones with started=False, and unreachable nodes fall back to started=False. The nodes.yaml value is overwritten by this observation. The started state can also be toggled at runtime via the dashboard or the JSON API. Runtime toggles are ephemeral (reset on choco restart, at which point the discovery pass runs again).
Each file at <group>/<node>.yaml (or <group>/<node>.j2) contains the base kotekan config for that node. All base config files are rendered through Jinja2 using variables from vars.yaml (if present) to produce rendered configs, which are then merged with any updatable overrides to form the desired config that gets pushed to kotekan as JSON.
For example, a Jinja2 template cx/cx27.j2 might reference shared variables:
num_elements: {{ n_elem }}
log_level: infoThese files can be edited directly on disk - choco watches for changes and picks them up automatically.
Kotekan configs can contain updatable blocks - sections marked with kotekan_update_endpoint that can be changed at runtime without restarting kotekan. When updatable values are set (via the web UI or by editing files on disk), they are stored as JSON files under .updatable/<group>/<node>.json:
{"updatable_config/gains": {"start_time": 1234, "coeff": 1.0}}When a config is pushed, stored updatable values are merged into the rendered config to produce the desired config, which is sent to kotekan so it boots with the correct values immediately. These files are also watched - editing them on disk triggers an immediate push of the updatable values to the running kotekan instance (without a restart).
After installation, choco runs as a systemd service. Open https://<hostname> in a browser and log in with your LDAP credentials.
To run manually (e.g. for debugging):
sudo systemctl stop choco
/opt/choco/.venv/bin/choco /etc/choco/config.yamlEvery page (for logged-in users) shows a thin strip above the nav with two pill badges:
- FPGA — colour-coded readout from the
fpga_masterdaemon. Green when/statusresponds and/get-frame0-timeparses (timing is good); yellow when/statusis reachable but timing can't be read; red when the daemon is unreachable; grey when nofpga_masterblock is configured. The tooltip carries the host, last-seen, error, and currentframe0_ns. - EOP — health of the
choco-eop-broadcast.servicesystemd unit. Green when its last run succeeded within the last ~25 hours, yellow when stale, red when the last result was a failure, grey when the unit has never run or systemd isn't reachable (in which case choco falls back to theeop-state.jsonmtime).
The strip is refreshed every 30 seconds via htmx; the FPGA poller runs as a single gevent greenlet on the same cadence.
The main page shows a table of all registered nodes with live-updating columns: node name, status, config, sync state, and an Edit link.
Status indicators:
- Green (started) — kotekan is running and config matches the desired state
- Yellow (stopped) — kotekan is reachable but not running (ready for
/start) - Blue (syncing) — config push in progress (kill → restart → start)
- Red (down) — kotekan is unreachable
- Grey (unknown) — not yet polled or state indeterminate
Each node also has two toggles:
- Started/stopped (green/yellow) — desired runtime state. On startup choco discovers the actual state and sets this from what kotekan reports; the toggle then controls whether choco keeps kotekan running.
- Maintenance (orange = on, blue = normal) — when on, choco observes the node but never writes to it (no
/start, no/kill, no updatable POSTs). Useful when an operator is intervening on a node manually. Every node starts in maintenance mode after a choco restart; flip it off (per node, per group, or with the cluster-wide toggle) when you're ready for choco to reconcile drift.
Each scope (group header, dashboard header) has paired ▲/▼ and M/N buttons that flip every node in scope at once.
Status updates are pushed to the browser in real time via WebSockets - no need to refresh.
Click Edit on a node to manage its settings:
- Config selector — which base config file to use for this node.
- Config editor — edit the base config YAML. Save queues a base-config change (write to disk + restart). "Re-push Current" queues a forced re-push.
- Updatable config — edit individual updatable blocks. Changes are queued and pushed to kotekan's updatable endpoints without a restart.
The Edit nodes button on the dashboard opens /nodes, a drag-and-drop editor for nodes.yaml. Saving rewrites the YAML, rebuilds the in-memory registry from scratch (dropping queued changes), then automatically puts every node into maintenance mode and re-runs state discovery so each node's started/idle flag is set from the live kotekan runtime rather than a cold default. Take nodes back out of maintenance individually or via the cluster-wide toggle once you've reviewed the new layout. Config files on disk are not moved when nodes change groups — that's an operator task.
Config changes can also be submitted programmatically:
POST /update/<group>— queue a change for all nodes in a groupPOST /update/<group>/<node>— queue a change for a single node
Both accept JSON with:
{"action": "base_config", "config_content": "..."}{"action": "updatable_config", "endpoint": "...", "values": {...}}{"action": "set_started", "started": true}— set the started/stopped state{"action": "set_maintenance", "maintenance": true}— put the node(s) into or out of maintenance mode
Read-only status endpoints:
GET /api/status— per-node runtime status plus an aggregate summaryGET /api/nodes— the node registry (groups/hosts/ports) as JSON
The /update/* and /api/* endpoints bypass auth when called from localhost, so from the choco host you can use curl directly (use -k since the cert is typically self-signed):
# Start a single node
curl -ks -X POST https://localhost:5000/update/<group>/<node> -H 'Content-Type: application/json' -d '{"action":"set_started","started":true}'
# Stop (idle) a single node
curl -ks -X POST https://localhost:5000/update/<group>/<node> -H 'Content-Type: application/json' -d '{"action":"set_started","started":false}'
# Start a whole group
curl -ks -X POST https://localhost:5000/update/<group> -H 'Content-Type: application/json' -d '{"action":"set_started","started":true}'
# Stop a whole group
curl -ks -X POST https://localhost:5000/update/<group> -H 'Content-Type: application/json' -d '{"action":"set_started","started":false}'
# Check status
curl -ks https://localhost:5000/api/status | jq .Changes flow through a two-tier queue system:
Producers (web UI, API, file watcher, poll timer)
→ Input Queue (serialized — one submission at a time)
→ Node Queues (FIFO, each Node holds its own)
→ Worker Pool (locks a node's queue, drains items, syncs to remote)
Input queue — a single serialized entry point. Accepts changes for individual nodes or entire groups (fan-out). Submissions block each other so only one caller modifies the queues at a time.
Node queues — each Node holds a FIFO change queue. A pool of worker greenlets scans nodes for unlocked, non-empty queues. A worker locks a node's queue, drains all pending items (writing base config or updatable values to disk), then syncs to the remote kotekan instance:
- Base config changes — kill kotekan, wait for stopped, start with new config via
POST /start - Updatable-only changes — POST new values directly to updatable endpoints (no restart)
- Poll (no changes) — compare desired config vs. running config; push if drift is detected
Periodic polling — every 5 seconds, a poll item is submitted for every node. This detects drift and unreachable nodes even when no local changes are made. Status changes are pushed to browsers via WebSocket.
File watcher — the config directory is watched for changes:
- YAML/J2 files — reloads the affected node's config and queues a poll for it (
vars.yamlchanges re-render all nodes) .updatable/JSON files — reloads the affected node's updatable store and queues a poll
Load errors are surfaced, not fatal. If a base config or updatable JSON file fails to parse, the affected node loads with a load_error and the service still starts. The dashboard shows the specific error (including the file name), and the sync loop refuses to push any config to that node until the error clears — pushing an incomplete desired_config could silently regress kotekan's runtime state. Errors clear automatically when the file becomes parseable again (file watcher reload) or when a fresh config is submitted via the UI / API (save_base / save_updatable). Stopped nodes (started: false) are still killed normally — load errors don't override the user's intent to stop a node.
Startup state discovery. When choco starts, Orchestrator.discover_node_states() probes every node in parallel and sets each node.started from the actual runtime state (STARTED → True, IDLE → False, unreachable → False). This happens before the regular poll loop and worker pool engage, so choco never "resets" a running node back to idle just because the local default was False.
Maintenance mode. Every node has a maintenance flag that defaults to on at startup. When maintenance is on, all REST calls that mutate the node — Node.push_updatable, Node.start, and Node.kill — are no-ops (they log and return False), and Orchestrator._sync_node short-circuits before reaching them. Drift is still observed and the dashboard reflects the node's actual state, but choco never writes to a paused node, even to enforce started=False. Operators flip nodes out of maintenance once they're ready for choco to reconcile. Maintenance state is ephemeral; a choco restart puts everything back into maintenance and re-runs state discovery.
A companion oneshot service generates an Earth Orientation Parameter (EOP) table from IERS data and pushes it to every group as updatable_config under the earth_rotation_data endpoint.
Schedule — choco-eop-broadcast.service runs once on choco.service startup (After=choco.service, WantedBy=choco.service) and again daily at 12:00 UTC via choco-eop-broadcast.timer (Persistent=true, so a missed firing catches up on boot). One-off runs: sudo systemctl start choco-eop-broadcast.service.
Pipeline (jobs/eop_update.py):
- Read
frame0_nsfromfpga_masterover TCP. - Build a fresh EOP table on the UTC-midnight grid using
astropy+ IERS auto-download, covering(now − intervals_before, now + intervals_after)days. - If
configs_dir/eop-state.jsonexists, merge with stored state (policy below). - Wait for choco's web port, then
POST /update/<group>for every group innodes.yaml. - If all groups succeed, write the merged table back to
eop-state.json. On any failure, the state file is left alone so the next run merges from a known-good baseline.
Merge policy — eop-state.json is the source of truth for what kotekan has been told; the merge protects continuity of any value that has already been pushed:
- No overwrite. Stored entries are never replaced, even if IERS data has been refined since they were committed. Past and future values are immutable once stored.
- No gap filling, no prepending. Fresh entries are added only when their timestamp is strictly greater than the latest surviving stored entry. We never insert between two existing stored entries — kotekan may be interpolating across that segment — and we never insert before the first stored entry.
- Conditional truncation. Stored entries older than
intervals_beforedays are dropped, but only if the surviving stored set still contains at least one entry on either side of "now". If truncation would leave the table without an anchor before or after now, no truncation happens — preserving kotekan's ability to interpolate at the current instant takes priority over tidy bookkeeping.
The net effect is that the on-disk table grows forward over time (one new entry per day) and is trimmed from the past only when it's safe to do so.
./choco.sh testOr manually:
source .venv/bin/activate
pytest tests/ -vchoco/
├── app.py # Flask app factory, SocketIO setup, entry point
├── auth.py # LDAP authentication (Flask-Login + Flask-LDAP3-Login)
├── web.py # Flask routes: dashboard, node edit, login/logout, /update/* JSON API
├── state.py # Node (identity, config state, change queue, kotekan REST client), Registry
├── sync.py # Queue-based sync: ChangeItem, InputQueue, Orchestrator worker pool
├── templates/ # Jinja2 templates (Pico CSS + htmx)
└── static/ # Static assets
jobs/
├── choco.service # Main systemd service (Type=notify)
├── choco-eop-broadcast.service # EOP update job (runs on choco start + daily timer)
├── choco-eop-broadcast.timer # Daily at 12:00 UTC
├── eop-broadcast.sh # Wrapper: finds venv, calls eop_update.py
├── eop_update.py # EOP pipeline: generate table, merge with state, push to choco
└── eop_utils.py # Vendored from kotekan (do not modify — update from upstream)
eop_utils.py is vendored from kotekan (tools/earth_orientation/eop_utils.py). It should not be modified in this repo.