Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,18 @@ Every TLD is identified by its **A-label** — the ASCII form, including `xn--`

The **U-label** — the rendered Unicode form (e.g. `москва`) — is display-only and appears solely in the `tld_unicode` field, alongside the A-label, never as a key or reference. Consumers that render a name resolve the A-label to `tld_unicode`; they never key on it.

## The typed graph

Alongside `tlds.json`, the build ships four derived reverse-index artifacts that model the root zone as a typed graph of four entity types plus one enum:

- **Domains** — the TLDs themselves (`tlds.json`).
- **Organizations** — registries, governance bodies, and infrastructure operators (`organizations.json`).
- **Places** — countries, dependent territories, subdivisions, cities, and supranational regions (`places.json`).
- **Cultures** — ethno-linguistic communities like the Basques or Welsh (`cultures.json`).
- **Agreement types** — the ICANN registry-agreement enum (`agreements.json`).

Each TLD relates to one or more Organizations through *roles* (Sponsor, Administrative Contact, Technical Contact, and — for gTLDs — ICANN Registry Operator), to zero or more Places (most ccTLDs map to one country; geographic gTLDs map to a city, subdivision, country, or supranational region), to an optional Culture, and to its agreement types. Each derived artifact is a deterministic reverse index of `tlds.json`: delete it and `make build` rebuilds it. Every cross-file relationship is enforced by referential-integrity tests, so a foreign key can never dangle and no record is ever orphaned.

## `organizations.json`

The `data/generated/organizations.json` file is the canonical record of the organizations that play roles for TLDs, with a reverse-index of those roles. It is built from a hand-curated identity seed (`data/manual/organizations.json`) joined against `tlds.json`, and replaces the old per-role alias files.
Expand All @@ -254,6 +266,24 @@ Each org carries an editorial `display_name` and a stable kebab-case `slug` (the

> **Consolidated subset:** this currently covers the curated multi-source organizations only. The single-source long tail (orgs that appear under one exact name in one source) is not yet included, so the absence of a TLD's operator here does not mean it has none.

## `places.json`

The `data/generated/places.json` file is the canonical record of the places associated with TLDs, with a reverse-index of their TLDs. Countries are derived mechanically from ccTLDs (ISO 3166-1 via `pycountry`); subdivisions, cities, and supranational regions come from a hand-curated seed (`data/manual/places.json`).

Each place carries a stable `slug` (ISO 3166-1 alpha-2 for countries, e.g. `gb`; a recognizable short name for subdivisions, e.g. `basque-country`; the TLD for cities, e.g. `amsterdam`), an English `name_en`, a `subtype` (`country` / `subdivision` / `city` / `supranational`), the `iso_code` where one exists, a `parent` slug for hierarchy (subdivision/city → country; dependent territory → sovereign), an optional `info_link`, and the `tlds` reverse index. A sparse `iso_designation` field carries ISO 3166-1 status for the special cases: `dependent_territory` (e.g. `bm` → `gb`), `exceptionally_reserved` (`ac`), `transitionally_reserved` (`su`), and `special_area` (`aq`). `places[]` is sorted by `slug`.

The United Kingdom is one place slugged `gb` (its ISO alpha-2), carrying both `.gb` and `.uk`; IDN ccTLDs fold into their country (e.g. `xn--p1ai` joins `ru`). Slugs and `tlds` are A-labels/ASCII; Unicode rendering is left to consumers.

## `cultures.json`

The `data/generated/cultures.json` file records the ethno-linguistic communities that at least one TLD claims affiliation with, with a reverse-index of their TLDs. It is built from a hand-curated seed (`data/manual/cultures.json`) joined against each TLD's `cultural_affiliation` annotation.

Each culture carries a stable `slug` (the foreign key `cultural_affiliation` points at), an English `name_en`, an `info_link` to Wikipedia, an optional ISO 639 `language_code`, and the `tlds` reverse index. `cultures[]` is sorted by `slug`. The schema is intentionally minimal: descriptions and cross-artifact links belong on the canonical source (Wikipedia via `info_link`), not duplicated here.

## `agreements.json`

The `data/generated/agreements.json` file is the ICANN registry-agreement-type enum with a reverse-index of the gTLDs under each. Each record carries a canonical `slug` (`base` / `non_sponsored` / `brand` / `community` / `sponsored`), a friendly `display_name`, the verbatim ICANN string under `source_names.icann`, and the `tlds` reverse index. `agreements[]` is sorted by `slug`.

## Local usage

- `make deps` - Install the project dependencies
Expand Down
5 changes: 4 additions & 1 deletion bin/lint
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ if [ ${#paths[@]} -eq 0 ]; then
paths=(src/ tests/)
fi

# Run all three linters even if an earlier one fails, so the developer
# Run all linters even if an earlier one fails, so the developer
# sees the full set of findings in one pass instead of round-tripping.
exit_code=0
uv run ruff check "${paths[@]}" || exit_code=$?
uv run ruff format --check "${paths[@]}" || exit_code=$?
uv run pyright "${paths[@]}" || exit_code=$?
# JSON parse check runs over the whole repo (independent of the path args) so a
# stray syntax error or committed merge-conflict marker fails the lint pass.
python3 bin/lint-json.py || exit_code=$?
exit $exit_code
47 changes: 47 additions & 0 deletions bin/lint-json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env python3
"""Validate that every JSON file in the repo parses cleanly."""

import json
import sys
from pathlib import Path

EXCLUDED_DIRS = {".git", ".venv", "node_modules", "__pycache__"}

# Test fixtures that are intentionally invalid JSON.
EXCLUDED_FILES = {
Path("tests/fixtures/metadata/corrupted-metadata.json"),
}


def find_json_files(root: Path):
for path in root.rglob("*.json"):
if any(part in EXCLUDED_DIRS for part in path.parts):
continue
if path.relative_to(root) in EXCLUDED_FILES:
continue
yield path


def main() -> int:
root = Path.cwd()
bad: list[tuple[Path, str]] = []
count = 0
for path in find_json_files(root):
count += 1
try:
json.loads(path.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError) as e:
bad.append((path.relative_to(root), str(e)))

if bad:
for rel, err in bad:
print(f"{rel}: {err}", file=sys.stderr)
print(f"\n{len(bad)} of {count} JSON file(s) failed to parse.", file=sys.stderr)
return 1

print(f"{count} JSON file(s) parse cleanly.")
return 0


if __name__ == "__main__":
sys.exit(main())
Loading