Vocal is a tool for managing netCDF data product standards and associated data product specifications. It is intended to be used with datasets following the Climate-Forecast Conventions, but may also be used non cf-compliant datasets.
Vocal requires the udunits2 C library to be installed on your system. On Debian/Ubuntu-based systems this can be installed with:
sudo apt install libudunits2-dev
On macOS with Homebrew:
brew install udunits
The recommended way to install vocal is with uv:
uv tool install git+https://github.com/FAAM-146/vocal.git
This makes the vocal command available globally. Alternatively, to use vocal in a project:
uv add git+https://github.com/FAAM-146/vocal.git
Vocal can also be installed with pip:
pip install git+https://github.com/FAAM-146/vocal.git
Note that if using pip directly, it is strongly recommended that you use a python environment manager such as Virtualenv.
Once installed, the vocal command should be available in your PATH:
$ vocal
Usage: vocal [OPTIONS] COMMAND [ARGS]...
Compliance checking and metadata management.
╭─ Commands ────────────────────────────────────────────────────────────────╮
│ autodoc Generate documentation from a project or product. │
│ build Create an example data file from a definition. │
│ check Check a netCDF file against standard and product definitions. │
│ fetch Fetch a vocal project or pack and register it. │
│ gatekeep Watch a folder and check files as they arrive. │
│ init Initialise a vocal project. │
│ register Register a vocal project or pack globally. │
│ release Produce a pack with a manifest, v{Y}/, and latest/. │
│ web Launch a web-based checker GUI. │
╰───────────────────────────────────────────────────────────────────────────╯
Vocal uses vocal projects to define standards for netCDF data. Vocal projects are comprised of pydantic model definitions, and associated validators. Vocal then provides a mapping from netCDF data to these models, allowing the power of pydantic to be used for compliance checking.
Typically as a data provider you will be provided with a vocal project to use to check your data for compliance.
The simplest way to obtain a vocal project is with the fetch command:
$ vocal fetch <url>
where <url> is the URL of the git repository containing the project. For private repositories or
repositories hosted outside of GitHub, pass the --git flag to use git directly:
$ vocal fetch --git <url>
Fetching a project registers it automatically. To register a project (or pack) you already have
on disk, point register at it:
$ vocal register <path>
register auto-detects the kind of resource from its marker file — a conventions.yaml at the
path is a project, a manifest.json is a pack — and registers it under the correct key. There is
no conventions-string flag: a project's identity (its name and version, e.g. MYSTD-1.0) comes
from its conventions.yaml. Pass -f/--force to overwrite an existing registration.
To create a new vocal project, type vocal init -n <NAME>, where <NAME> is the standard's
name (e.g. MYSTD). By default the project is scaffolded in the current directory; pass
-d <directory> to scaffold it elsewhere, and --major / --minor to set the standard's
version (defaulting to 1 / 0). This writes a conventions.yaml recording the standard's
identity and module layout, plus an importable Python package named after the standard
(lower-cased, or overridden with -p/--project-directory):
./conventions.yaml
./mystd/__init__.py
./mystd/defaults.py
./mystd/models/__init__.py
./mystd/models/dimension.py
./mystd/models/variable.py
./mystd/models/group.py
./mystd/models/dataset.py
./mystd/attributes/__init__.py
./mystd/attributes/global_attributes.py
./mystd/attributes/group_attributes.py
./mystd/attributes/variable_attributes.py
The models directory contains the pydantic models which define the dataset,
groups, dimensions and variables. The attributes directory contains the pydantic models
for the attributes associated with the dataset (globals), groups and variables.
Product definitions are conventionally kept in a definitions directory alongside the project
(see Specifying data products), though this location can be
overridden at runtime.
Data product definitions are specified in YAML files, typically in the definitions directory.
An simple example of a product definition may be
meta:
file_pattern: "example_data.nc"
short_name: "example_data"
description: "An example data product"
references:
- ["Reference 1", "https://example.com"]
- ["Reference 2", "https://example.com"]
attributes:
Conventions: "CF-1.8"
title: "Example data"
comment: <str: derived_from_file optional>
dimensions:
- name: time
size: null # null indicates unlimited dimension
- name: height
size: 32
variables:
- meta:
name: "example_variable"
datatype: "<float32>"
required: true
attributes:
long_name: "Example variable"
units: "m"
comment: <str: derived_from_file optional>
dimensions:
- time
- height
This definition specifies a single required variable, example_variable, with dimensions time and height. Attributes may be literal values, or may be a placeholder indicating
that the value may change between files. In this case, the comment attribute is derived from the file. A typical attribute placeholder is <str: derived_from_file optional>, which indicates that the attribute is a string, and that it is optional. Array-valued attributes are also supported, for example <Array[int8]: derived_from_file optional> indicates that the attribute is an array of 8-bit integers, and is optional.
The 'working' copy of a data product definition is typically stored in the definitions directory. However, it is possible that a data product definition may change over time. For example, a new version of a standard may be released, or a data product may be updated to include new variables. In this case, it is useful to be able to track the changes between versions of a data product definition.
To create a versioned release of a set of data product definitions, use the vocal release command:
$ vocal release -p <project_path> -v <version> -u <pack_repo_url> -o <output_dir>
This produces a pack: a self-describing, independently releasable catalogue of product definitions. The command writes a v<version>/ directory containing the versioned product definitions, plus a byte-identical latest/ directory holding a copy of the most recent release. Each product definition is a JSON file intended to be used with the check command, and each release directory carries a manifest.json recording the pack's identity and the standard it requires. Additionally a dataset_schema.json file is created, which is a JSON Schema representation of the pydantic model for the dataset, minus any validators.
The -u/--url value is the pack's GitHub repository URL — the repository you will publish the pack from. It is recorded in every manifest.json and is the identity consumers fetch the pack by (see Packs). On the first release in a fresh output directory --url is required; on subsequent releases it falls back to the URL recorded in <output>/latest/manifest.json, and supplying a different URL is a deliberate, explicit operation.
Publishing a pack is a normal git workflow: commit the v<version>/ and latest/ tree and cut a GitHub release from the repository. vocal release only produces the files locally; it does not create the GitHub release for you.
A pack is a versioned, self-describing catalogue of product definitions, produced with vocal release (see Versioning data product definitions). Where a project defines the standard, a pack holds the concrete product definitions authored against that standard, and is published and consumed independently of it.
Packs are hosted on GitHub, exactly like projects. A pack repository is a multi-version monorepo: it keeps every release's v{Y}/ directory plus a latest/ copy, so a single repository carries the full version history. To publish, commit the tree produced by vocal release and cut a GitHub release from the repository — the release's source archive then contains every version.
A pack's identity is its GitHub repository URL, recorded by vocal release --url into every manifest.json. There is no separate static-hosting URL; the repository is the pack.
Obtain a pack with the same fetch command used for projects:
$ vocal fetch <pack-repo-url>
By default this downloads the pack repository's latest GitHub release and registers it. For private repositories, non-GitHub hosts, or repositories with no published release, clone the repository directly with --git:
$ vocal fetch --git <pack-repo-url>
vocal fetch auto-detects whether a URL points at a project or a pack by inspecting the downloaded tree (a conventions.yaml at the root is a project; a latest/manifest.json is a pack), so you never have to tell it which kind of resource you are fetching. If a fetched repository is neither, the command reports a clear error.
A single fetch registers every version (v{Y}) the pack contains — you can then validate files authored against any historical version without re-fetching. The latest/ directory is a hosting artifact only and is not registered separately; "latest" is simply the highest registered version.
Fetching a pack that is already registered is gated to avoid silently clobbering what you have:
- a plain
vocal fetch <pack-repo-url>on an already-registered pack reports that it is already fetched and hints at--update/--force; vocal fetch --update <pack-repo-url>picks up newly released versions and refreshes existing ones. Update is additive — it never removes a version you have already registered;vocal fetch --force <pack-repo-url>re-installs every version in the latest release regardless of what is registered, repairing a corrupted or partial install.
When you check a file, vocal routes it to the right pack using two global attributes on the file: vocal_definitions_url (the pack's GitHub repository URL) and vocal_definitions_version (the v{Y} release the file was authored against). The pack must already be fetched; if the named URL or version is not registered, vocal check reports a PackMissing error hinting at the vocal fetch <pack-repo-url> you need to run.
vocal_definitions_version is optional. When it is present, the file is checked against that exact registered version. When it is absent (but vocal_definitions_url is present), vocal falls back to the highest registered version for that pack.
Latest-version caveat. With
vocal_definitions_versionomitted, a file is validated against the locally newest registered version, which may differ from the version it was actually authored against. The version attribute is the precise pin; its absence means "latest". For reproducible checks, pin the version explicitly.
Vocal can be used to check netCDF files against vocal projects and data product definitions. To do this, use the check command:
$ vocal check <file> -p <project_name> -d <definition>
This will check the file against the project and definition specified. If the file is valid, the command will return with exit code 0. If the file is invalid, the command will return with exit code 1. When checking against a product definition, all of the checks will be printed to the console. You can limit the output to warnings and errors only by using the -w flag, to errors only by using the -e flag, or to no output by using the -q flag. Comments are hidden by default; pass -c/--comments to show them. Use --no-color to disable coloured output.
For example,
$ vocal check <file> -p <project_name> -d <definition> -e
will check the file against the project and definition specified, and will only print errors to the console.
A file can also be checked only against a project, without a data product definition:
$ vocal check <file> -p <project_name>
For example, to check a data file against a project standard:
$ vocal check example_data.nc -p example_project
Checking example_data.nc against example_project standard... OK!
Any errors will be printed to the console, indicating where in the file the error occurred, the reason for the error, and potentially the validator that failed.
$ vocal check example_data.nc -p example_project
Checking example_data.nc against example_project standard... ERROR!
✗ root -> groups -> instrument_group_1 -> attributes -> instrument_name: field required
If you omit -p, vocal resolves the project automatically from the file's Conventions
attribute, matching it against the registered projects. When the file also carries the
vocal_definitions_url (and optionally vocal_definitions_version) attributes, the matching
product definition is resolved from the corresponding registered pack as well, so neither -p
nor -d is needed (see Packs):
$ vocal check example_data.nc
Checking example_data.nc against MYSTD-1 standard... OK!
If no registered project matches the file's conventions, or the named pack/version is not
fetched, check reports a typed error explaining what to fetch or register.
Vocal includes a web-based checker GUI that can be launched with the web command:
$ vocal web
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8088 (Press CTRL+C to quit)
The host and port can be configured with the --host and --port options.
The web GUI can fetch projects and packs from URLs, and a fetched project's Python package is imported — i.e. runs on this machine — when a file is checked. The running server is therefore a code-execution surface, so it ships locked down by default:
-
Binds to
127.0.0.1by default. Only your own machine can reach it. -
Downloads are disabled by default. The "Add" affordance and the
/addfetch route are turned off; the GUI checks files against projects and packs you have already registered (e.g. viavocal fetchon the CLI). Pass--allow-downloadsto let GUI users fetch from URLs:$ vocal web --allow-downloads -
Exposing downloads to the network is refused. Combining a non-loopback
--host(e.g.0.0.0.0or a LAN address) with--allow-downloadswould make the GUI an unauthenticated remote code execution service, sovocal webrefuses that combination unless you explicitly acknowledge the risk with--dangerously-allow-remote. A non-loopback bind without downloads is allowed (a read-only viewer) but prints a warning.
Residual risk (CSRF). The Add form carries no CSRF token, so when downloads
are enabled a malicious web page you visit could issue a cross-origin request to
the GUI on localhost and trigger a fetch. The default-off posture mitigates
this for the common case; only enable downloads on a machine you trust.
The gatekeep command turns vocal into a folder watcher: it polls a directory, checks each
new file, and routes it to a configurable action depending on whether it passes. It is intended
to sit in front of an archive or processing pipeline, letting only conforming files through.
$ vocal gatekeep -w <folder>
Each file is resolved from its own global attributes (exactly as vocal check does when given no
-p) and validated against both the standard and the matching product definition. A file that
resolves and validates is a pass; a file that fails validation — or that cannot be resolved
at all (it carries no recognised, registered standard and product) — is a fail. The
gatekeeper makes no distinction between "broken" and "not one of ours": anything that is not
demonstrably conforming is rejected.
The action taken on each outcome is configurable:
- Pass (
--pass-action/-pa):none(the default — leave the file in place),move(to--pass-folder/-pf), orcommand(run--pass-command/-pc). - Fail (
--fail-action/-fa):move(the default — to--fail-folder/-ff),delete, orcommand(run--fail-command/-fc).
The default fail action is move rather than delete, and move requires its destination
folder, so a bare vocal gatekeep -w <folder> will stop and ask for --fail-folder. This is
deliberate: nothing is deleted or silently dropped until you have said where rejected files
should go. A typical invocation sorts files into two folders:
$ vocal gatekeep -w ./incoming -pa move -pf ./accepted -fa move -ff ./rejected
For command actions, the file path is substituted for a {} token in the command if present,
or appended as the final argument otherwise; the command is run without a shell. A command that
overruns --command-timeout/-ct (default 300 seconds) is killed and treated as a failure, so a
hung command cannot block the watcher.
The folder is scanned immediately on startup and then every --frequency/-f seconds (default
300). A few safeguards keep the watcher well-behaved over long runs:
- A file that is still being written is skipped until it has been left untouched for a short period, so partially-copied files are never checked.
- Files are de-duplicated by path, size and modification time, so a file that passes and stays in
the folder (for example under
--pass-action none) is checked once, not on every cycle. - Only one gatekeeper may watch a given folder at a time; a second invocation on the same folder exits immediately.
The watcher runs until interrupted (Ctrl-C).
Vocal can be used to create example data files from vocal projects and data product definitions. To do this, use the build command:
$ vocal build -p <project_name> -d <definition> -o <output_file>
This will create a netCDF file with sinusoidal data for each variable in the data product definition.
Vocal can generate human-readable documentation directly from a project or a product, so the
documentation never drifts from the definitions it describes. Use the autodoc command:
$ vocal autodoc --project <project_path> -o standard.html
There are two modes, and you must supply exactly one:
--project <path>documents the abstract standard — it walks the project's pydantic model tree and reports the structure (groups, variables, dimensions and attributes) together with the rules and constraints each must satisfy. Point it at the importable project package (theproject_directoryinside the repo root, e.g.~/.vocal/projects/FAAM-0/faam).--product <path>documents a concrete instance — it walks a product-pack JSON definition and reports the actual structure and values, without consulting the project.
The rendered output is controlled by --format/-f (default html) and written to --out/-o
(default autodoc.<ext>); pass -o - to write to stdout so the output can be piped. Add --open
to open the rendered file in a browser:
$ vocal autodoc --project ~/.vocal/projects/FAAM-0/faam --open
Internally, both modes parse their source into a shared documentation IR (intermediate representation) which is then handed to a registered renderer, so additional output formats can be added without changing the walkers.
As well as the vocal command, the vocal.utils module exposes a small
programmatic API for loading a project and pulling a product definition out of a
fetched pack from your own code. This is useful when you want to work with a
product definition as a pydantic model rather than
checking a file from the command line.
from vocal.utils import import_project, get_product
# Import a fetched project package by the path to its importable module
# (the project_directory inside the repo root, e.g. ~/.vocal/projects/FAAM-0/faam).
project = import_project("/home/dave/.vocal/projects/FAAM-0/faam")
# A fetched pack's product_root is the directory holding its v{Y}/ release
# directories (e.g. ~/.vocal/packs/<slug>).
pack_dir = "/home/dave/.vocal/packs/github-com-davesproson-faam-data-products-0"
# Resolve a product by its meta.short_name. The version may be a specific
# integer (e.g. 1) or "latest", which selects the highest-numbered v{Y}/.
product = get_product("core_1hz", project, "latest", pack_dir)The functions are:
import_project(path)— import a project package from the path to its importable module (theproject_directoryinside the repo root) and return the imported module. The module exposes the project contract:models.Dataset,filecodec, anddefaults.get_spec(short_name, project, version="latest", product_root=None)— return the raw product definition (as adict) whosemeta.short_namematchesshort_name, orNoneif no product matches.versionis a pack release number or"latest";product_rootis the pack directory holding thev{Y}/release directories. Whenproduct_rootis omitted it defaults to the project's repo root (useful only when the definitions are colocated with the project).get_product(short_name, project, version="latest", product_root=None)— asget_spec, but returns the definition validated against the project'smodels.Datasetpydantic model rather than a rawdict.
Note. Resolving
"latest"selects the highest-numberedv{Y}/directory present underproduct_root, which is the locally newest fetched release. As with theConventions/vocal_definitions_*resolution used byvocal check, pass an explicitversionfor reproducible results (see Packs).