A live dashboard + alerting daemon for Monad validators and full nodes. It reads the node's own journals, tracks block-level execution metrics, catches reorgs / retry-spikes / hardware issues in real time, and pings you on Telegram before things go sideways.
I wrote it because I wanted answers to "why is my box acting up"
rather than "is it behind the network". Everything it shows is local:
derived from monad-execution / monad-bft journals and a handful of
probes on the same machine. No network-wide data, no external RPC
dependency for the core view.
- Live dashboard at
/— recent blocks, retry rate, TPS, gas, epoch progress, top retried contracts. - Alerts for block stalls, retry-rate spikes, chain reorgs,
reference-RPC lag, and assertion-like log patterns
(
CXX_ASSERT,RUST_PANIC,QC_OVERSHOOT,CHUNK_EXHAUSTION). Telegram by default, dedup + hysteresis so you don't get flapping. - Host probes — systemd state of monad services, key-backup age,
TrieDB disk health, UDP config, filesystem usage,
fd_limits. - Alerts history at
/alerts— persisted across restarts, filterable by window / severity. - JSON API for everything, plus a single-call
/api/window_summarythat gives an aggregate + top contracts for any window up to 30 days. - Independent watchdog — a tiny bash script on a 5-second systemd timer that pokes the dashboard and pages Telegram if monad-ops itself dies. Because the whole point is to alert when the thing that normally alerts you is what died.
Field notes from running this against Monad testnet, with data and explicit hypotheses — see OBSERVATIONS.md. First entry is the 23-reorg window observed 2026-04-19 / 2026-04-20.
A single process tails journalctl -u monad-execution, parses each
__exec_block record into a typed struct, feeds it through a handful
of rules, persists the result in SQLite, and serves a FastAPI
dashboard. A second worker fetches block receipts to attribute retry
activity to the contracts driving it. All SQLite writes happen off the
event loop so a slow aggregate query can't stall live ingestion. The
public /alerts page and JSON API are read-only — the dashboard
cannot write back to the node.
- A Monad validator or full node on the same host, running under
systemd (monad-ops tails
journalctl -u monad-executionandjournalctl -u monad-bft, and probes service state viasystemctl). - Python 3.12+ with SQLite ≥ 3.35.
- A Telegram bot for alerts — optional. Leave the token blank and the dashboard runs alert-free; you can wire Telegram later.
monad-ops is intentionally a host process, not a container: every
interface point with the node (journals, systemctl is-active,
key-backup paths, host RPC) is shaped by host systemd. Wrapping it
in Docker means re-mounting all of those, and systemctl is-active
still won't see the host's units from inside the container.
git clone https://github.com/rustemar/monad-ops.git
cd monad-ops
python3 -m venv .venv
.venv/bin/pip install -e .
cp config.example.toml config.toml
# edit config.toml: [node].name, RPC URL, optional TelegramKey sections of config.toml:
[node]— display name, RPC URL, list of systemd services to probe.[alerts.telegram]— bot token (from@BotFather) and chat ID. Blankbot_token(or omitting the section) routes alerts to stdout.[persistence]— SQLite database path. Default isdata/state.db.[enrichment]— receipts-enrichment worker settings.
The user running monad-ops must be in the systemd-journal group.
Add it once and re-login:
sudo usermod -aG systemd-journal "$(id -un)"Or, if you don't want a global group change, uncomment the
SupplementaryGroups=systemd-journal line in
systemd/monad-ops.service.example so the unit grants journal access
only to the service.
.venv/bin/python -m monad_ops.cli run
# dashboard: http://127.0.0.1:8873sudo cp systemd/monad-ops.service.example \
/etc/systemd/system/monad-ops.service
# edit the file: set User=, Group=, WorkingDirectory=, ExecStart=
sudo systemctl daemon-reload
sudo systemctl enable --now monad-ops.servicecp scripts/watchdog.env.example scripts/watchdog.env
chmod 600 scripts/watchdog.env
# fill in TG_BOT_TOKEN, TG_CHAT_ID, TG_TOPIC_ID
sudo cp systemd/monad-ops-watchdog.service.example \
/etc/systemd/system/monad-ops-watchdog.service
sudo cp systemd/monad-ops-watchdog.timer \
/etc/systemd/system/monad-ops-watchdog.timer
# edit the .service file: paths + User=
sudo systemctl daemon-reload
sudo systemctl enable --now monad-ops-watchdog.timermonad-ops is editable-installed (pip install -e .), so a code
upgrade is just a fast-forward pull plus a service restart:
cd /opt/monad-ops # or wherever you cloned it
git pull --ff-only origin main
sudo systemctl restart monad-ops.serviceSchema migrations are applied automatically on service start —
idempotent ALTER TABLE … ADD COLUMN is run for any new columns
introduced upstream. No manual SQL is required, and the migration is a
no-op once the column exists.
Two cases that need an extra step (rare):
pyproject.tomldependencies changed — also run.venv/bin/pip install -e .after the pull. Release notes will call this out when needed.config.tomlschema changed (a new required field, a renamed key) — diff yourconfig.tomlagainstconfig.example.tomland copy the delta over. Optional fields keep working as-is.
To see what changed before pulling: git fetch origin main && git log --oneline HEAD..origin/main.
A ready nginx template lives in
systemd/nginx-ops-dashboard.conf.example. It expects a TLS cert, a
<dashboard-domain> substitution, and an upstream on 127.0.0.1:8873.
It sets CSP / HSTS / X-Frame-Options / Permissions-Policy on every
response.
The full reference with curl examples lives at /api on any running
instance. In brief:
GET /api/state— live snapshot (blocks, rolling metrics, epoch, reorg counter, reference-RPC lag).GET /api/blocks/sampled?from_ts_ms=&to_ts_ms=&points=300— server-aggregated time-series for charts.GET /api/alerts/history?window=&severity=&limit=— persisted alerts.GET /api/reorgs— all observed reorgs, newest-first.GET /api/reorgs/{block_number}?window=N— per-event forensic trace (reorged block + ±N neighbors).GET /api/contracts/top_retried?since_ts_ms=&…— contracts ranked by re-execution.GET /api/window_summary?from_ts_ms=&to_ts_ms=&include_blocks=— single-call post-event report.GET /api/probes— sanitized host-probe status.
All JSON routes ship Access-Control-Allow-Origin: * so external
dashboards can pull from the browser. The HTML dashboard itself stays
on a strict CSP.
.venv/bin/pip install -e '.[dev]'
.venv/bin/python -m pytest -q
.venv/bin/ruff check .Layout:
monad_ops/
├── cli.py # entry point + async task wiring
├── api/ # FastAPI app, endpoints, cache layer
├── collector/ # journal tailer, probes, reference RPC, epoch
├── dashboard/ # Jinja templates, static JS/CSS/Chart.js
├── enricher/ # eth_getBlockReceipts worker
├── parser/ # __exec_block / assertion line parsers
├── rules/ # stall, retry_spike, reorg, reference_lag,
│ # assertion — alert-emitting rules
├── alerts/ # sinks (Telegram, stdout, deduping)
├── state.py # in-memory snapshot + EpochTracker
├── storage.py # SQLite schema, migrations, aggregates
└── config.py # Pydantic settings
Tests live in tests/ (unit-level per module; no network in the
suite).
Apache-2.0 — see LICENSE.
