- AI Model: Claude Sonnet 4.6
- Actions Taken:
- Added
config/enabled_languages.txt— allowlist of explicitly enabled BCP-47 codes; empty by default, managed by automation. - Added
_load_enabled_languages()inscripts/generate_data.py— reads the allowlist and seedsall_langsbefore scanning, so those codes appear incoverage.json/stats.jsonat 0 % when no locale files exist yet. - Added
.github/ISSUE_TEMPLATE/new_language.yml— issue form for requesting a new language. - Added
.github/workflows/enable_new_language.yml— two-job workflow: (1) on issue open, validates the BCP-47 code and opens a PR for maintainer review; (2) on PR merge intodev, triggersupdate_data.ymlfor a data refresh. - Updated
index.htmlsubmitLangRequest()— BCP-47 code is now required, validated client-side, and embedded as a<!-- NEW_LANGUAGE_META ... -->block in the issue body so the workflow can parse it reliably.
- Added
- Oversight: Human review required (PR must be approved before language is enabled).
- AI Model: Claude Sonnet 4.6
- Actions Taken:
- Added
ovos_localize/datasets/slot_filling.py— slot-filling / NER dataset: intent templates + slot names + known entity values from.entityfiles. - Added
ovos_localize/datasets/response_pairs.py— intent→dialog response pairs derived fromcontext.triggers_dialog(AST-extracted handler analysis); no string heuristics. - Added
ovos_localize/datasets/tts_corpus.py— TTS training corpus from all.dialogfiles across all languages; template-expanded and deduplicated. - Added
ovos_localize/datasets/skill_metadata.py— multilingual skill name/description/examples/tags fromskill.jsonfiles. - Updated
ovos_localize/datasets/__init__.pyto export all six generators. - Rewrote
scripts/generate_datasets.pyto wire all generators; outputs todata/datasets/{slot_filling,response_pairs,tts,skill_metadata}/. - Added
test/unittests/test_datasets.py— 28 unit tests covering all four generators. - Updated
FAQ.mdwith dataset table and AST-pairing explanation.
- Added
- Oversight: 168 unit tests passing; generator verified against live
data/skills/corpus.
- AI Model: Claude Sonnet 4.6
- Actions Taken:
- Fixed
TypeError: Cannot read properties of undefined (reading 'type')crash inrenderEditor()(index.html:1986) — replacedfileData.typewith already-computedfileTypevariable;fileDataisundefinedin create mode. - In entity create mode, source panel now shows intent files that use
{slotName}(derived fromskill.files— no extra fetch). Panel header changes from "Source" to "Used in intents". fileHelpmessage in create mode now names the slot and intent count for context.- Source language
<select>hidden in create mode (no source langs exist). - Updated
FAQ.md.
- Fixed
- Oversight: 140 unit tests passing; JS syntax clean via node.
- AI Model: Claude Sonnet 4.6
- Actions Taken:
- Extended public pages list to include
#/stats,#/entities,#/open-dataso they render without a saved profile. - Removed permanent accent styling on Open Data nav link (
index.html:96). - Updated
FAQ.md.
- Extended public pages list to include
- Oversight: Verified via Chromium CDP — all three pages render without a profile.
- AI Model: Claude Sonnet 4.6
- Actions Taken:
- Deleted stale dataset files using deprecated lang codes (
eu-EU.jsonl,eu.jsonl,es-LM.jsonland translation counterparts). - Added regenerated datasets with normalized codes (
eu-ES.jsonl,es-419.jsonl). - Staged and committed all modified skill JSON, coverage, stats, repos, entities, and TSV files.
- Updated
FAQ.mdto explain the file removal.
- Deleted stale dataset files using deprecated lang codes (
- Oversight: 140 unit tests passing.
- AI Model: Gemini 2.0 Flash
- Actions Taken:
- Added
language_data>=1.1topyproject.tomlto resolveModuleNotFoundErrorinlangcodesduring name lookups. - Added
PyYAMLtopyproject.tomlto support parsing ofsettingsmeta.ymlfiles. - Synced local
.venvusinguv. - Verified all 139 unit tests pass with 90% coverage.
- Added
- Oversight: Automated verification via
pytest.
- AI Model: Gemini 2.0 Flash
- Actions Taken:
- Created
ovos_localize.datasetspackage for generating ML datasets from parsed skills. - Implemented
classification.pyfor NLU intent datasets. - Implemented
translation.pyfor parallel corpora machine translation datasets. - Created pipeline script
scripts/generate_datasets.pyto auto-generate JSONL files. - Updated
.github/workflows/update_data.ymlto run the dataset generation in CI. - Updated
docs/index.mdto document the Open Data datasets.
- Created
- Oversight: Manual code review and local execution verified dataset generation success.
- AI Model: Gemini 2.0 Flash
- Actions Taken:
- Refactored
generate_data.pyandgenerate_datasets.pyto enforce a 48MB limit per file. - Implemented chunked JSON loading for per-skill detail files (e.g.,
ovos-skill-days-in-history.jsonsplit into 2 chunks). - Updated
index.htmlwith a newfetchSkillhelper to seamlessly handle multi-chunk skill data. - Updated ML dataset generators to expand all sentence templates (
(a|b),[optional]) into unique utterances. - Implemented data cleaning for ML datasets: lowercase, remove extra whitespace, and deduplicate.
- Refactored
dataset.tsvto use expansion and splitting (now 100MB+ split into 3 files). - Removed JSON indentation across all generated data to optimize file size.
- Refactored
- Oversight: Verified file sizes are < 50MB and content is expanded/cleaned via local execution.