Skip to content

davesproson/vocal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

370 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vocal

Vocal is a tool for managing netCDF data product standards and associated data product specifications. It is intended to be used with datasets following the Climate-Forecast Conventions, but may also be used non cf-compliant datasets.

Dependencies

Vocal requires the udunits2 C library to be installed on your system. On Debian/Ubuntu-based systems this can be installed with:

sudo apt install libudunits2-dev

On macOS with Homebrew:

brew install udunits

Installation

With uv (recommended)

The recommended way to install vocal is with uv:

uv tool install git+https://github.com/FAAM-146/vocal.git

This makes the vocal command available globally. Alternatively, to use vocal in a project:

uv add git+https://github.com/FAAM-146/vocal.git

With pip

Vocal can also be installed with pip:

pip install git+https://github.com/FAAM-146/vocal.git

Note that if using pip directly, it is strongly recommended that you use a python environment manager such as Virtualenv.

Once installed, the vocal command should be available in your PATH:

$ vocal

 Usage: vocal [OPTIONS] COMMAND [ARGS]...

 Compliance checking and metadata management.

╭─ Commands ────────────────────────────────────────────────────────────────╮
│ autodoc    Generate documentation from a project or product.              │
│ build      Create an example data file from a definition.                 │
│ check      Check a netCDF file against standard and product definitions.  │
│ fetch      Fetch a vocal project or pack and register it.                 │
│ gatekeep   Watch a folder and check files as they arrive.                 │
│ init       Initialise a vocal project.                                    │
│ register   Register a vocal project or pack globally.                     │
│ release    Produce a pack with a manifest, v{Y}/, and latest/.           │
│ web        Launch a web-based checker GUI.                                │
╰───────────────────────────────────────────────────────────────────────────╯

Vocal projects

Vocal uses vocal projects to define standards for netCDF data. Vocal projects are comprised of pydantic model definitions, and associated validators. Vocal then provides a mapping from netCDF data to these models, allowing the power of pydantic to be used for compliance checking.

Typically as a data provider you will be provided with a vocal project to use to check your data for compliance.

Obtaining a vocal project

The simplest way to obtain a vocal project is with the fetch command:

$ vocal fetch <url>

where <url> is the URL of the git repository containing the project. For private repositories or repositories hosted outside of GitHub, pass the --git flag to use git directly:

$ vocal fetch --git <url>

Registering a vocal project

Fetching a project registers it automatically. To register a project (or pack) you already have on disk, point register at it:

$ vocal register <path>

register auto-detects the kind of resource from its marker file — a conventions.yaml at the path is a project, a manifest.json is a pack — and registers it under the correct key. There is no conventions-string flag: a project's identity (its name and version, e.g. MYSTD-1.0) comes from its conventions.yaml. Pass -f/--force to overwrite an existing registration.

Creating a new vocal project

To create a new vocal project, type vocal init -n <NAME>, where <NAME> is the standard's name (e.g. MYSTD). By default the project is scaffolded in the current directory; pass -d <directory> to scaffold it elsewhere, and --major / --minor to set the standard's version (defaulting to 1 / 0). This writes a conventions.yaml recording the standard's identity and module layout, plus an importable Python package named after the standard (lower-cased, or overridden with -p/--project-directory):

./conventions.yaml
./mystd/__init__.py
./mystd/defaults.py
./mystd/models/__init__.py
./mystd/models/dimension.py
./mystd/models/variable.py
./mystd/models/group.py
./mystd/models/dataset.py
./mystd/attributes/__init__.py
./mystd/attributes/global_attributes.py
./mystd/attributes/group_attributes.py
./mystd/attributes/variable_attributes.py

The models directory contains the pydantic models which define the dataset, groups, dimensions and variables. The attributes directory contains the pydantic models for the attributes associated with the dataset (globals), groups and variables.

Product definitions are conventionally kept in a definitions directory alongside the project (see Specifying data products), though this location can be overridden at runtime.

Specifying data products

Data product definitions are specified in YAML files, typically in the definitions directory.

An simple example of a product definition may be

meta:
    file_pattern: "example_data.nc"
    short_name: "example_data"
    description: "An example data product"
    references:
        - ["Reference 1", "https://example.com"]
        - ["Reference 2", "https://example.com"]
attributes:
    Conventions: "CF-1.8"
    title: "Example data"
    comment: <str: derived_from_file optional>
dimensions:
    - name: time
      size: null # null indicates unlimited dimension
    - name: height
      size: 32
variables:
    - meta:
        name: "example_variable"
        datatype: "<float32>"
        required: true
    attributes:
        long_name: "Example variable"
        units: "m"
        comment: <str: derived_from_file optional>
    dimensions:
        - time
        - height

This definition specifies a single required variable, example_variable, with dimensions time and height. Attributes may be literal values, or may be a placeholder indicating that the value may change between files. In this case, the comment attribute is derived from the file. A typical attribute placeholder is <str: derived_from_file optional>, which indicates that the attribute is a string, and that it is optional. Array-valued attributes are also supported, for example <Array[int8]: derived_from_file optional> indicates that the attribute is an array of 8-bit integers, and is optional.

Versioning data product definitions

The 'working' copy of a data product definition is typically stored in the definitions directory. However, it is possible that a data product definition may change over time. For example, a new version of a standard may be released, or a data product may be updated to include new variables. In this case, it is useful to be able to track the changes between versions of a data product definition.

To create a versioned release of a set of data product definitions, use the vocal release command:

$ vocal release -p <project_path> -v <version> -u <pack_repo_url> -o <output_dir>

This produces a pack: a self-describing, independently releasable catalogue of product definitions. The command writes a v<version>/ directory containing the versioned product definitions, plus a byte-identical latest/ directory holding a copy of the most recent release. Each product definition is a JSON file intended to be used with the check command, and each release directory carries a manifest.json recording the pack's identity and the standard it requires. Additionally a dataset_schema.json file is created, which is a JSON Schema representation of the pydantic model for the dataset, minus any validators.

The -u/--url value is the pack's GitHub repository URL — the repository you will publish the pack from. It is recorded in every manifest.json and is the identity consumers fetch the pack by (see Packs). On the first release in a fresh output directory --url is required; on subsequent releases it falls back to the URL recorded in <output>/latest/manifest.json, and supplying a different URL is a deliberate, explicit operation.

Publishing a pack is a normal git workflow: commit the v<version>/ and latest/ tree and cut a GitHub release from the repository. vocal release only produces the files locally; it does not create the GitHub release for you.

Packs

A pack is a versioned, self-describing catalogue of product definitions, produced with vocal release (see Versioning data product definitions). Where a project defines the standard, a pack holds the concrete product definitions authored against that standard, and is published and consumed independently of it.

Hosting a pack

Packs are hosted on GitHub, exactly like projects. A pack repository is a multi-version monorepo: it keeps every release's v{Y}/ directory plus a latest/ copy, so a single repository carries the full version history. To publish, commit the tree produced by vocal release and cut a GitHub release from the repository — the release's source archive then contains every version.

A pack's identity is its GitHub repository URL, recorded by vocal release --url into every manifest.json. There is no separate static-hosting URL; the repository is the pack.

Fetching a pack

Obtain a pack with the same fetch command used for projects:

$ vocal fetch <pack-repo-url>

By default this downloads the pack repository's latest GitHub release and registers it. For private repositories, non-GitHub hosts, or repositories with no published release, clone the repository directly with --git:

$ vocal fetch --git <pack-repo-url>

vocal fetch auto-detects whether a URL points at a project or a pack by inspecting the downloaded tree (a conventions.yaml at the root is a project; a latest/manifest.json is a pack), so you never have to tell it which kind of resource you are fetching. If a fetched repository is neither, the command reports a clear error.

A single fetch registers every version (v{Y}) the pack contains — you can then validate files authored against any historical version without re-fetching. The latest/ directory is a hosting artifact only and is not registered separately; "latest" is simply the highest registered version.

Fetching a pack that is already registered is gated to avoid silently clobbering what you have:

  • a plain vocal fetch <pack-repo-url> on an already-registered pack reports that it is already fetched and hints at --update / --force;
  • vocal fetch --update <pack-repo-url> picks up newly released versions and refreshes existing ones. Update is additive — it never removes a version you have already registered;
  • vocal fetch --force <pack-repo-url> re-installs every version in the latest release regardless of what is registered, repairing a corrupted or partial install.

How packs are used at check time

When you check a file, vocal routes it to the right pack using two global attributes on the file: vocal_definitions_url (the pack's GitHub repository URL) and vocal_definitions_version (the v{Y} release the file was authored against). The pack must already be fetched; if the named URL or version is not registered, vocal check reports a PackMissing error hinting at the vocal fetch <pack-repo-url> you need to run.

vocal_definitions_version is optional. When it is present, the file is checked against that exact registered version. When it is absent (but vocal_definitions_url is present), vocal falls back to the highest registered version for that pack.

Latest-version caveat. With vocal_definitions_version omitted, a file is validated against the locally newest registered version, which may differ from the version it was actually authored against. The version attribute is the precise pin; its absence means "latest". For reproducible checks, pin the version explicitly.

Checking data products

Vocal can be used to check netCDF files against vocal projects and data product definitions. To do this, use the check command:

$ vocal check <file> -p <project_name> -d <definition>

This will check the file against the project and definition specified. If the file is valid, the command will return with exit code 0. If the file is invalid, the command will return with exit code 1. When checking against a product definition, all of the checks will be printed to the console. You can limit the output to warnings and errors only by using the -w flag, to errors only by using the -e flag, or to no output by using the -q flag. Comments are hidden by default; pass -c/--comments to show them. Use --no-color to disable coloured output.

For example,

$ vocal check <file> -p <project_name> -d <definition> -e

will check the file against the project and definition specified, and will only print errors to the console.

A file can also be checked only against a project, without a data product definition:

$ vocal check <file> -p <project_name>

For example, to check a data file against a project standard:

$ vocal check example_data.nc -p example_project

Checking example_data.nc against example_project standard... OK!

Any errors will be printed to the console, indicating where in the file the error occurred, the reason for the error, and potentially the validator that failed.

$ vocal check example_data.nc -p example_project

Checking example_data.nc against example_project standard... ERROR!
✗ root -> groups -> instrument_group_1 -> attributes -> instrument_name: field required

If you omit -p, vocal resolves the project automatically from the file's Conventions attribute, matching it against the registered projects. When the file also carries the vocal_definitions_url (and optionally vocal_definitions_version) attributes, the matching product definition is resolved from the corresponding registered pack as well, so neither -p nor -d is needed (see Packs):

$ vocal check example_data.nc

Checking example_data.nc against MYSTD-1 standard... OK!

If no registered project matches the file's conventions, or the named pack/version is not fetched, check reports a typed error explaining what to fetch or register.

Checking data products via the web interface

Vocal includes a web-based checker GUI that can be launched with the web command:

$ vocal web

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8088 (Press CTRL+C to quit)

The host and port can be configured with the --host and --port options.

Security

The web GUI can fetch projects and packs from URLs, and a fetched project's Python package is imported — i.e. runs on this machine — when a file is checked. The running server is therefore a code-execution surface, so it ships locked down by default:

  • Binds to 127.0.0.1 by default. Only your own machine can reach it.

  • Downloads are disabled by default. The "Add" affordance and the /add fetch route are turned off; the GUI checks files against projects and packs you have already registered (e.g. via vocal fetch on the CLI). Pass --allow-downloads to let GUI users fetch from URLs:

    $ vocal web --allow-downloads
    
  • Exposing downloads to the network is refused. Combining a non-loopback --host (e.g. 0.0.0.0 or a LAN address) with --allow-downloads would make the GUI an unauthenticated remote code execution service, so vocal web refuses that combination unless you explicitly acknowledge the risk with --dangerously-allow-remote. A non-loopback bind without downloads is allowed (a read-only viewer) but prints a warning.

Residual risk (CSRF). The Add form carries no CSRF token, so when downloads are enabled a malicious web page you visit could issue a cross-origin request to the GUI on localhost and trigger a fetch. The default-off posture mitigates this for the common case; only enable downloads on a machine you trust.

Watching a folder with the gatekeeper

The gatekeep command turns vocal into a folder watcher: it polls a directory, checks each new file, and routes it to a configurable action depending on whether it passes. It is intended to sit in front of an archive or processing pipeline, letting only conforming files through.

$ vocal gatekeep -w <folder>

Each file is resolved from its own global attributes (exactly as vocal check does when given no -p) and validated against both the standard and the matching product definition. A file that resolves and validates is a pass; a file that fails validation — or that cannot be resolved at all (it carries no recognised, registered standard and product) — is a fail. The gatekeeper makes no distinction between "broken" and "not one of ours": anything that is not demonstrably conforming is rejected.

The action taken on each outcome is configurable:

  • Pass (--pass-action/-pa): none (the default — leave the file in place), move (to --pass-folder/-pf), or command (run --pass-command/-pc).
  • Fail (--fail-action/-fa): move (the default — to --fail-folder/-ff), delete, or command (run --fail-command/-fc).

The default fail action is move rather than delete, and move requires its destination folder, so a bare vocal gatekeep -w <folder> will stop and ask for --fail-folder. This is deliberate: nothing is deleted or silently dropped until you have said where rejected files should go. A typical invocation sorts files into two folders:

$ vocal gatekeep -w ./incoming -pa move -pf ./accepted -fa move -ff ./rejected

For command actions, the file path is substituted for a {} token in the command if present, or appended as the final argument otherwise; the command is run without a shell. A command that overruns --command-timeout/-ct (default 300 seconds) is killed and treated as a failure, so a hung command cannot block the watcher.

The folder is scanned immediately on startup and then every --frequency/-f seconds (default 300). A few safeguards keep the watcher well-behaved over long runs:

  • A file that is still being written is skipped until it has been left untouched for a short period, so partially-copied files are never checked.
  • Files are de-duplicated by path, size and modification time, so a file that passes and stays in the folder (for example under --pass-action none) is checked once, not on every cycle.
  • Only one gatekeeper may watch a given folder at a time; a second invocation on the same folder exits immediately.

The watcher runs until interrupted (Ctrl-C).

Creating example data

Vocal can be used to create example data files from vocal projects and data product definitions. To do this, use the build command:

$ vocal build -p <project_name> -d <definition> -o <output_file>

This will create a netCDF file with sinusoidal data for each variable in the data product definition.

Generating documentation

Vocal can generate human-readable documentation directly from a project or a product, so the documentation never drifts from the definitions it describes. Use the autodoc command:

$ vocal autodoc --project <project_path> -o standard.html

There are two modes, and you must supply exactly one:

  • --project <path> documents the abstract standard — it walks the project's pydantic model tree and reports the structure (groups, variables, dimensions and attributes) together with the rules and constraints each must satisfy. Point it at the importable project package (the project_directory inside the repo root, e.g. ~/.vocal/projects/FAAM-0/faam).
  • --product <path> documents a concrete instance — it walks a product-pack JSON definition and reports the actual structure and values, without consulting the project.

The rendered output is controlled by --format/-f (default html) and written to --out/-o (default autodoc.<ext>); pass -o - to write to stdout so the output can be piped. Add --open to open the rendered file in a browser:

$ vocal autodoc --project ~/.vocal/projects/FAAM-0/faam --open

Internally, both modes parse their source into a shared documentation IR (intermediate representation) which is then handed to a registered renderer, so additional output formats can be added without changing the walkers.

Using vocal from Python

As well as the vocal command, the vocal.utils module exposes a small programmatic API for loading a project and pulling a product definition out of a fetched pack from your own code. This is useful when you want to work with a product definition as a pydantic model rather than checking a file from the command line.

from vocal.utils import import_project, get_product

# Import a fetched project package by the path to its importable module
# (the project_directory inside the repo root, e.g. ~/.vocal/projects/FAAM-0/faam).
project = import_project("/home/dave/.vocal/projects/FAAM-0/faam")

# A fetched pack's product_root is the directory holding its v{Y}/ release
# directories (e.g. ~/.vocal/packs/<slug>).
pack_dir = "/home/dave/.vocal/packs/github-com-davesproson-faam-data-products-0"

# Resolve a product by its meta.short_name. The version may be a specific
# integer (e.g. 1) or "latest", which selects the highest-numbered v{Y}/.
product = get_product("core_1hz", project, "latest", pack_dir)

The functions are:

  • import_project(path) — import a project package from the path to its importable module (the project_directory inside the repo root) and return the imported module. The module exposes the project contract: models.Dataset, filecodec, and defaults.
  • get_spec(short_name, project, version="latest", product_root=None) — return the raw product definition (as a dict) whose meta.short_name matches short_name, or None if no product matches. version is a pack release number or "latest"; product_root is the pack directory holding the v{Y}/ release directories. When product_root is omitted it defaults to the project's repo root (useful only when the definitions are colocated with the project).
  • get_product(short_name, project, version="latest", product_root=None) — as get_spec, but returns the definition validated against the project's models.Dataset pydantic model rather than a raw dict.

Note. Resolving "latest" selects the highest-numbered v{Y}/ directory present under product_root, which is the locally newest fetched release. As with the Conventions/vocal_definitions_* resolution used by vocal check, pass an explicit version for reproducible results (see Packs).

About

Managing definitions and vocabularies for netCDF data products. Note that this is forked from the work I did while at the FAAM Airborne Laboratory. All development now happens here.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.0%
  • CSS 2.2%
  • HTML 1.8%