Maintenance Report - ovos-localize

[2026-03-23] - Request a new language workflow

AI Model: Claude Sonnet 4.6
Actions Taken:
- Added config/enabled_languages.txt — allowlist of explicitly enabled BCP-47 codes; empty by default, managed by automation.
- Added _load_enabled_languages() in scripts/generate_data.py — reads the allowlist and seeds all_langs before scanning, so those codes appear in coverage.json/stats.json at 0 % when no locale files exist yet.
- Added .github/ISSUE_TEMPLATE/new_language.yml — issue form for requesting a new language.
- Added .github/workflows/enable_new_language.yml — two-job workflow: (1) on issue open, validates the BCP-47 code and opens a PR for maintainer review; (2) on PR merge into dev, triggers update_data.yml for a data refresh.
- Updated index.html submitLangRequest() — BCP-47 code is now required, validated client-side, and embedded as a  block in the issue body so the workflow can parse it reliably.
Oversight: Human review required (PR must be approved before language is enabled).

AI Model: Claude Sonnet 4.6
Actions Taken:
- Added ovos_localize/datasets/slot_filling.py — slot-filling / NER dataset: intent templates + slot names + known entity values from .entity files.
- Added ovos_localize/datasets/response_pairs.py — intent→dialog response pairs derived from context.triggers_dialog (AST-extracted handler analysis); no string heuristics.
- Added ovos_localize/datasets/tts_corpus.py — TTS training corpus from all .dialog files across all languages; template-expanded and deduplicated.
- Added ovos_localize/datasets/skill_metadata.py — multilingual skill name/description/examples/tags from skill.json files.
- Updated ovos_localize/datasets/__init__.py to export all six generators.
- Rewrote scripts/generate_datasets.py to wire all generators; outputs to data/datasets/{slot_filling,response_pairs,tts,skill_metadata}/.
- Added test/unittests/test_datasets.py — 28 unit tests covering all four generators.
- Updated FAQ.md with dataset table and AST-pairing explanation.
Oversight: 168 unit tests passing; generator verified against live data/skills/ corpus.

AI Model: Claude Sonnet 4.6
Actions Taken:
- Fixed TypeError: Cannot read properties of undefined (reading 'type') crash in renderEditor() (index.html:1986) — replaced fileData.type with already-computed fileType variable; fileData is undefined in create mode.
- In entity create mode, source panel now shows intent files that use {slotName} (derived from skill.files — no extra fetch). Panel header changes from "Source" to "Used in intents".
- fileHelp message in create mode now names the slot and intent count for context.
- Source language <select> hidden in create mode (no source langs exist).
- Updated FAQ.md.
Oversight: 140 unit tests passing; JS syntax clean via node.

AI Model: Claude Sonnet 4.6
Actions Taken:
- Extended public pages list to include #/stats, #/entities, #/open-data so they render without a saved profile.
- Removed permanent accent styling on Open Data nav link (index.html:96).
- Updated FAQ.md.
Oversight: Verified via Chromium CDP — all three pages render without a profile.

AI Model: Claude Sonnet 4.6
Actions Taken:
- Deleted stale dataset files using deprecated lang codes (eu-EU.jsonl, eu.jsonl, es-LM.jsonl and translation counterparts).
- Added regenerated datasets with normalized codes (eu-ES.jsonl, es-419.jsonl).
- Staged and committed all modified skill JSON, coverage, stats, repos, entities, and TSV files.
- Updated FAQ.md to explain the file removal.
Oversight: 140 unit tests passing.

AI Model: Gemini 2.0 Flash
Actions Taken:
- Added language_data>=1.1 to pyproject.toml to resolve ModuleNotFoundError in langcodes during name lookups.
- Added PyYAML to pyproject.toml to support parsing of settingsmeta.yml files.
- Synced local .venv using uv.
- Verified all 139 unit tests pass with 90% coverage.
Oversight: Automated verification via pytest.

AI Model: Gemini 2.0 Flash
Actions Taken:
- Created ovos_localize.datasets package for generating ML datasets from parsed skills.
- Implemented classification.py for NLU intent datasets.
- Implemented translation.py for parallel corpora machine translation datasets.
- Created pipeline script scripts/generate_datasets.py to auto-generate JSONL files.
- Updated .github/workflows/update_data.yml to run the dataset generation in CI.
- Updated docs/index.md to document the Open Data datasets.
Oversight: Manual code review and local execution verified dataset generation success.

AI Model: Gemini 2.0 Flash
Actions Taken:
- Refactored generate_data.py and generate_datasets.py to enforce a 48MB limit per file.
- Implemented chunked JSON loading for per-skill detail files (e.g., ovos-skill-days-in-history.json split into 2 chunks).
- Updated index.html with a new fetchSkill helper to seamlessly handle multi-chunk skill data.
- Updated ML dataset generators to expand all sentence templates ((a|b), [optional]) into unique utterances.
- Implemented data cleaning for ML datasets: lowercase, remove extra whitespace, and deduplicate.
- Refactored dataset.tsv to use expansion and splitting (now 100MB+ split into 3 files).
- Removed JSON indentation across all generated data to optimize file size.
Oversight: Verified file sizes are < 50MB and content is expanded/cleaned via local execution.