Skip to content

Add zh/ja AI-lexicon files for stylometry overlap detection #104

@devswha

Description

@devswha

Context

Surfaced from a deep-research audit of patina's evaluator paths. Three audit P0s already shipped (#96, #103). This is the remaining one and was deferred because it requires native-language curation, not a pure code change.

Problem

.patina.default.yaml currently restricts both stylometry and AI-lexicon overlap to en/ko:

stylometry:
  languages: [ko, en]
lexicon:
  languages: [en, ko]   # zh/ja deferred (no curated lexicon yet)

Pattern packs already cover all four languages (ko/en/zh/ja, 6 packs each, ~28 patterns each). But for zh/ja runs:

  • lexicon/ai-zh.md — does not exist
  • lexicon/ai-ja.md — does not exist
  • Stylometry burstiness/MATTR thresholds are ko/en-calibrated; zh/ja word segmentation may need different bands

This is the asymmetry called out in the audit:

패턴 catalog는 4개 언어를 지원하지만, README의 stylometric/AI-lexicon 설명은 EN 약 108개, KO 약 102개만 명시한다. 즉 zh/ja에 대해 pattern 지원은 있어도 stylometry/lexicon calibration은 약할 가능성이 높다.

In rewrite/ouroboros runs targeting zh or ja text, this means the LLM does most of the detection work alone — there's no statistical floor signal from lexicon overlap to back it up.

Scope

Lexicon files (this issue)

  • lexicon/ai-zh.md — 50–100 high-precision AI-tell phrases in Mandarin (e.g., 总而言之, 综上所述, 在数字时代, 让我们一起, 不仅...而且, 至关重要的是, etc.)
  • lexicon/ai-ja.md — 50–100 high-precision AI-tell phrases in Japanese (e.g., まとめると, 結論として, 〜することが重要です, 〜と言えるでしょう, デジタル時代において, etc.)
  • .patina.default.yaml — flip lexicon.languages to [en, ko, zh, ja] after files exist
  • Optional: lexicon/README.md curation guide (entry format, severity, profile context)

Stylometry calibration (follow-up issue, not this one)

  • Decide if zh/ja need different burstiness/MATTR thresholds (CJK tokenization affects sentence-length CV and TTR window)
  • Calibration corpus for zh/ja (HC3 has Chinese pairs; ja needs sourcing)
  • Flip stylometry.languages only after thresholds are validated

Acceptance criteria

For each language:

  • ≥ 50 lexicon entries, each with example + counterexample
  • High precision over recall — Wikipedia FP rate stays under 25% (ko/en boundary)
  • Calibrated against ≥ 200 paragraphs (HC3 zh / curated ja AI-vs-human pairs)
  • Documented in README's stylometry section

Why this needs a human

LLM-drafted lexicons fail in two predictable ways:

  • Too-common words → every Wikipedia article gets flagged AI
  • Too-narrow phrases → no detection lift beyond existing patterns

Korean lexicon (90 entries) and English (108 entries) were hand-curated by the maintainer against a 400-paragraph calibration corpus. zh/ja deserve the same bar.

Suggested approach

  1. Draft 30 high-confidence starter entries per language (LLM-assisted, manually filtered)
  2. Run against HC3-zh sample + author-collected ja AI/human pairs
  3. Iterate based on FP rate, expand to 50–100
  4. Land lexicon files first, flip config flag in a follow-up PR after corpus validation

Out of scope

  • ONNX-based detector (defer; the audit's L-difficulty item, lower ROI)
  • Multilingual Intl.Segmenter integration (handled separately when threshold calibration starts)
  • Profile-specific zh/ja tone overrides

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions