Skip to content

chord-observatory/choco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

choco

CHORD Config Orchestrator — monitors and manages kotekan instances running on a cluster of nodes.

choco provides a web UI that shows the live status of every kotekan instance, detects when their configs drift from the desired state, and lets you push config updates. It talks to kotekan's built-in REST API, so no agent software is needed on the nodes.

Kotekan itself is deployed and managed on nodes by Ansible. choco only handles monitoring and config management.

Requirements

  • Python 3.10+
  • A FreeIPA server for LDAP authentication (e.g. ipa1.auth.chord-observatory.ca)
  • Kotekan instances reachable over HTTP (default port 12048)

Installation

Requires root (uses sudo internally):

git clone <this repo>
cd choco

sudo ./choco.sh install                   # install; prompts to overwrite existing configs
sudo ./choco.sh install --overwrite-configs  # overwrite configs without prompting
sudo ./choco.sh install --keep-configs       # keep existing configs without prompting
sudo $EDITOR /etc/choco/config.yaml  # edit LDAP settings + secret_key
sudo systemctl restart choco

This installs choco as a system service with the following layout:

Path Contents
/opt/choco/.venv/ System Python venv with choco installed
/etc/choco/config.yaml choco configuration (chmod 600)
/etc/choco/configs/ Kotekan config files (nodes.yaml, group dirs, .updatable/)

The install script also:

  • Creates a local .venv in the repo directory (editable install, owned by invoking user) for development
  • Sets up iptables rules to redirect ports 443 -> 5000 and 80 -> 8080 (persisted via iptables-persistent)
  • Installs and enables a systemd service that starts on boot and restarts on failure
  • Seeds /etc/choco/configs/ from the repo's configs/ directory on first install; on subsequent installs, prompts whether to overwrite (use --overwrite-configs or --keep-configs to skip the prompt)

Re-running sudo ./choco.sh install is safe — it always syncs config.yaml from the local copy (with configs_dir rewritten to /etc/choco/configs), and iptables rules are deduplicated. If configs already exist you'll be prompted before overwriting.

Service management

sudo systemctl status choco        # check status
sudo systemctl restart choco       # restart after config changes
sudo journalctl -u choco -f        # follow logs

Running manually

./choco.sh run                     # run locally for development (extra args forwarded)

Development

The install script creates a local .venv with an editable install, so code changes in the repo are picked up immediately:

./choco.sh run                     # run local code against config.yaml
./choco.sh test                    # run tests (extra args forwarded to pytest)
./choco.sh test -k test_kotekan   # run specific tests

Configuration

choco is configured via a config.yaml file and a config directory containing node/kotekan YAML files.

config.yaml

The install script creates /etc/choco/config.yaml from the template. Edit it:

server:
  host: 0.0.0.0
  port: 5000
  secret_key: change-me           # Change this in production!
  log_level: INFO

configs_dir: configs

fpga_master:
  host: chive.site.chord-observatory.ca
  port: 54321
  timeout: 5                     # HTTP request timeout (seconds)

eop:
  intervals_before: 2             # Days of past entries (older stored entries are truncated on merge)
  intervals_after: 2              # Days of future entries (later stored entries are kept, never overwritten)
  endpoint: earth_rotation_data   # Kotekan updatable config endpoint name
  state_file: eop-state.json     # State file name (stored in configs_dir)
  service_unit: choco-eop-broadcast.service  # systemd unit for last-run status

ldap:
  host:                           # e.g. ldaps://ipa1.auth.chord-observatory.ca
  port: 636
  use_ssl: true
  base_dn:                       # e.g. dc=auth,dc=chord-observatory,dc=ca
  user_dn: cn=users,cn=accounts
  user_login_attr: uid
  user_object_filter: "(objectclass=posixaccount)"
  bind_dn:                       # e.g. uid=choco,cn=users,cn=accounts,dc=auth,dc=chord-observatory,dc=ca
  bind_password:

config.yaml contains secrets and is chmod 600. Only config.yaml.template is checked into the repo.

LDAP Authentication (FreeIPA)

choco authenticates against a FreeIPA LDAP directory. FreeIPA does not allow anonymous binds, so a bind account is required for user searches. The bind_dn can be a dedicated user account (e.g. uid=choco,cn=users,cn=accounts,...). The defaults are tuned for FreeIPA (cn=users,cn=accounts user DN, posixaccount object class, LDAPS on port 636).

Config Directory

The config directory (/etc/choco/configs/) is the source of truth for which nodes choco manages and what their base configs are.

/etc/choco/configs/
├── nodes.yaml          # Node registry
├── vars.yaml           # (optional) Shared Jinja2 template variables
├── .updatable/         # Per-node updatable config overrides (JSON)
│   └── cx/
│       └── cx27.json   # Updatable values for cx27
├── cx/
│   └── cx27.yaml       # Base kotekan config for cx27
└── recv/
    └── recv1.j2        # Base kotekan config (Jinja2 template)

nodes.yaml - Node Registry

Defines the kotekan instances choco should monitor, organized into groups. Each node's base config lives at <group>/<node>.yaml (or .j2):

groups:
  cx:
    cx27: {host: cx27.site.chord-observatory.ca, port: 12048}
    cx42: {host: cx42.site.chord-observatory.ca, port: 12048, started: true}
  recv:
    recv1: {host: recv1.site.chord-observatory.ca, port: 12048}

The optional started field is a pre-discovery default for the desired runtime state. On startup, choco polls every node and preserves whatever kotekan is actually doing — reachable nodes that are running come up with started=True, idle ones with started=False, and unreachable nodes fall back to started=False. The nodes.yaml value is overwritten by this observation. The started state can also be toggled at runtime via the dashboard or the JSON API. Runtime toggles are ephemeral (reset on choco restart, at which point the discovery pass runs again).

Per-Node Config Files

Each file at <group>/<node>.yaml (or <group>/<node>.j2) contains the base kotekan config for that node. All base config files are rendered through Jinja2 using variables from vars.yaml (if present) to produce rendered configs, which are then merged with any updatable overrides to form the desired config that gets pushed to kotekan as JSON.

For example, a Jinja2 template cx/cx27.j2 might reference shared variables:

num_elements: {{ n_elem }}
log_level: info

These files can be edited directly on disk - choco watches for changes and picks them up automatically.

Updatable Config Overrides

Kotekan configs can contain updatable blocks - sections marked with kotekan_update_endpoint that can be changed at runtime without restarting kotekan. When updatable values are set (via the web UI or by editing files on disk), they are stored as JSON files under .updatable/<group>/<node>.json:

{"updatable_config/gains": {"start_time": 1234, "coeff": 1.0}}

When a config is pushed, stored updatable values are merged into the rendered config to produce the desired config, which is sent to kotekan so it boots with the correct values immediately. These files are also watched - editing them on disk triggers an immediate push of the updatable values to the running kotekan instance (without a restart).

Running

After installation, choco runs as a systemd service. Open https://<hostname> in a browser and log in with your LDAP credentials.

To run manually (e.g. for debugging):

sudo systemctl stop choco
/opt/choco/.venv/bin/choco /etc/choco/config.yaml

Web UI

Service status strip

Every page (for logged-in users) shows a thin strip above the nav with two pill badges:

  • FPGA — colour-coded readout from the fpga_master daemon. Green when /status responds and /get-frame0-time parses (timing is good); yellow when /status is reachable but timing can't be read; red when the daemon is unreachable; grey when no fpga_master block is configured. The tooltip carries the host, last-seen, error, and current frame0_ns.
  • EOP — health of the choco-eop-broadcast.service systemd unit. Green when its last run succeeded within the last ~25 hours, yellow when stale, red when the last result was a failure, grey when the unit has never run or systemd isn't reachable (in which case choco falls back to the eop-state.json mtime).

The strip is refreshed every 30 seconds via htmx; the FPGA poller runs as a single gevent greenlet on the same cadence.

Dashboard

The main page shows a table of all registered nodes with live-updating columns: node name, status, config, sync state, and an Edit link.

Status indicators:

  • Green (started) — kotekan is running and config matches the desired state
  • Yellow (stopped) — kotekan is reachable but not running (ready for /start)
  • Blue (syncing) — config push in progress (kill → restart → start)
  • Red (down) — kotekan is unreachable
  • Grey (unknown) — not yet polled or state indeterminate

Each node also has two toggles:

  • Started/stopped (green/yellow) — desired runtime state. On startup choco discovers the actual state and sets this from what kotekan reports; the toggle then controls whether choco keeps kotekan running.
  • Maintenance (orange = on, blue = normal) — when on, choco observes the node but never writes to it (no /start, no /kill, no updatable POSTs). Useful when an operator is intervening on a node manually. Every node starts in maintenance mode after a choco restart; flip it off (per node, per group, or with the cluster-wide toggle) when you're ready for choco to reconcile drift.

Each scope (group header, dashboard header) has paired ▲/▼ and M/N buttons that flip every node in scope at once.

Status updates are pushed to the browser in real time via WebSockets - no need to refresh.

Node Edit

Click Edit on a node to manage its settings:

  • Config selector — which base config file to use for this node.
  • Config editor — edit the base config YAML. Save queues a base-config change (write to disk + restart). "Re-push Current" queues a forced re-push.
  • Updatable config — edit individual updatable blocks. Changes are queued and pushed to kotekan's updatable endpoints without a restart.

Edit Nodes (registry)

The Edit nodes button on the dashboard opens /nodes, a drag-and-drop editor for nodes.yaml. Saving rewrites the YAML, rebuilds the in-memory registry from scratch (dropping queued changes), then automatically puts every node into maintenance mode and re-runs state discovery so each node's started/idle flag is set from the live kotekan runtime rather than a cold default. Take nodes back out of maintenance individually or via the cluster-wide toggle once you've reviewed the new layout. Config files on disk are not moved when nodes change groups — that's an operator task.

JSON API

Config changes can also be submitted programmatically:

  • POST /update/<group> — queue a change for all nodes in a group
  • POST /update/<group>/<node> — queue a change for a single node

Both accept JSON with:

  • {"action": "base_config", "config_content": "..."}
  • {"action": "updatable_config", "endpoint": "...", "values": {...}}
  • {"action": "set_started", "started": true} — set the started/stopped state
  • {"action": "set_maintenance", "maintenance": true} — put the node(s) into or out of maintenance mode

Read-only status endpoints:

  • GET /api/status — per-node runtime status plus an aggregate summary
  • GET /api/nodes — the node registry (groups/hosts/ports) as JSON

The /update/* and /api/* endpoints bypass auth when called from localhost, so from the choco host you can use curl directly (use -k since the cert is typically self-signed):

# Start a single node
curl -ks -X POST https://localhost:5000/update/<group>/<node> -H 'Content-Type: application/json' -d '{"action":"set_started","started":true}'

# Stop (idle) a single node
curl -ks -X POST https://localhost:5000/update/<group>/<node> -H 'Content-Type: application/json' -d '{"action":"set_started","started":false}'

# Start a whole group
curl -ks -X POST https://localhost:5000/update/<group> -H 'Content-Type: application/json' -d '{"action":"set_started","started":true}'

# Stop a whole group
curl -ks -X POST https://localhost:5000/update/<group> -H 'Content-Type: application/json' -d '{"action":"set_started","started":false}'

# Check status
curl -ks https://localhost:5000/api/status | jq .

How Sync Works

Changes flow through a two-tier queue system:

Producers (web UI, API, file watcher, poll timer)
    → Input Queue (serialized — one submission at a time)
        → Node Queues (FIFO, each Node holds its own)
            → Worker Pool (locks a node's queue, drains items, syncs to remote)

Input queue — a single serialized entry point. Accepts changes for individual nodes or entire groups (fan-out). Submissions block each other so only one caller modifies the queues at a time.

Node queues — each Node holds a FIFO change queue. A pool of worker greenlets scans nodes for unlocked, non-empty queues. A worker locks a node's queue, drains all pending items (writing base config or updatable values to disk), then syncs to the remote kotekan instance:

  • Base config changes — kill kotekan, wait for stopped, start with new config via POST /start
  • Updatable-only changes — POST new values directly to updatable endpoints (no restart)
  • Poll (no changes) — compare desired config vs. running config; push if drift is detected

Periodic polling — every 5 seconds, a poll item is submitted for every node. This detects drift and unreachable nodes even when no local changes are made. Status changes are pushed to browsers via WebSocket.

File watcher — the config directory is watched for changes:

  • YAML/J2 files — reloads the affected node's config and queues a poll for it (vars.yaml changes re-render all nodes)
  • .updatable/ JSON files — reloads the affected node's updatable store and queues a poll

Load errors are surfaced, not fatal. If a base config or updatable JSON file fails to parse, the affected node loads with a load_error and the service still starts. The dashboard shows the specific error (including the file name), and the sync loop refuses to push any config to that node until the error clears — pushing an incomplete desired_config could silently regress kotekan's runtime state. Errors clear automatically when the file becomes parseable again (file watcher reload) or when a fresh config is submitted via the UI / API (save_base / save_updatable). Stopped nodes (started: false) are still killed normally — load errors don't override the user's intent to stop a node.

Startup state discovery. When choco starts, Orchestrator.discover_node_states() probes every node in parallel and sets each node.started from the actual runtime state (STARTED → True, IDLE → False, unreachable → False). This happens before the regular poll loop and worker pool engage, so choco never "resets" a running node back to idle just because the local default was False.

Maintenance mode. Every node has a maintenance flag that defaults to on at startup. When maintenance is on, all REST calls that mutate the node — Node.push_updatable, Node.start, and Node.kill — are no-ops (they log and return False), and Orchestrator._sync_node short-circuits before reaching them. Drift is still observed and the dashboard reflects the node's actual state, but choco never writes to a paused node, even to enforce started=False. Operators flip nodes out of maintenance once they're ready for choco to reconcile. Maintenance state is ephemeral; a choco restart puts everything back into maintenance and re-runs state discovery.

EOP Broadcast

A companion oneshot service generates an Earth Orientation Parameter (EOP) table from IERS data and pushes it to every group as updatable_config under the earth_rotation_data endpoint.

Schedulechoco-eop-broadcast.service runs once on choco.service startup (After=choco.service, WantedBy=choco.service) and again daily at 12:00 UTC via choco-eop-broadcast.timer (Persistent=true, so a missed firing catches up on boot). One-off runs: sudo systemctl start choco-eop-broadcast.service.

Pipeline (jobs/eop_update.py):

  1. Read frame0_ns from fpga_master over TCP.
  2. Build a fresh EOP table on the UTC-midnight grid using astropy + IERS auto-download, covering (now − intervals_before, now + intervals_after) days.
  3. If configs_dir/eop-state.json exists, merge with stored state (policy below).
  4. Wait for choco's web port, then POST /update/<group> for every group in nodes.yaml.
  5. If all groups succeed, write the merged table back to eop-state.json. On any failure, the state file is left alone so the next run merges from a known-good baseline.

Merge policyeop-state.json is the source of truth for what kotekan has been told; the merge protects continuity of any value that has already been pushed:

  • No overwrite. Stored entries are never replaced, even if IERS data has been refined since they were committed. Past and future values are immutable once stored.
  • No gap filling, no prepending. Fresh entries are added only when their timestamp is strictly greater than the latest surviving stored entry. We never insert between two existing stored entries — kotekan may be interpolating across that segment — and we never insert before the first stored entry.
  • Conditional truncation. Stored entries older than intervals_before days are dropped, but only if the surviving stored set still contains at least one entry on either side of "now". If truncation would leave the table without an anchor before or after now, no truncation happens — preserving kotekan's ability to interpolate at the current instant takes priority over tidy bookkeeping.

The net effect is that the on-disk table grows forward over time (one new entry per day) and is trimmed from the past only when it's safe to do so.

Tests

./choco.sh test

Or manually:

source .venv/bin/activate
pytest tests/ -v

Project Structure

choco/
├── app.py          # Flask app factory, SocketIO setup, entry point
├── auth.py         # LDAP authentication (Flask-Login + Flask-LDAP3-Login)
├── web.py          # Flask routes: dashboard, node edit, login/logout, /update/* JSON API
├── state.py        # Node (identity, config state, change queue, kotekan REST client), Registry
├── sync.py         # Queue-based sync: ChangeItem, InputQueue, Orchestrator worker pool
├── templates/      # Jinja2 templates (Pico CSS + htmx)
└── static/         # Static assets
jobs/
├── choco.service               # Main systemd service (Type=notify)
├── choco-eop-broadcast.service # EOP update job (runs on choco start + daily timer)
├── choco-eop-broadcast.timer   # Daily at 12:00 UTC
├── eop-broadcast.sh            # Wrapper: finds venv, calls eop_update.py
├── eop_update.py               # EOP pipeline: generate table, merge with state, push to choco
└── eop_utils.py                # Vendored from kotekan (do not modify — update from upstream)

eop_utils.py is vendored from kotekan (tools/earth_orientation/eop_utils.py). It should not be modified in this repo.

About

CHORD Config Controller

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors