Skip to content

Add Japanese language analysis support#13

Merged
lemur47 merged 1 commit intomainfrom
feature/japanese-language-support
Feb 16, 2026
Merged

Add Japanese language analysis support#13
lemur47 merged 1 commit intomainfrom
feature/japanese-language-support

Conversation

@lemur47
Copy link
Owner

@lemur47 lemur47 commented Feb 16, 2026

Summary

  • Marker registry (marker_registry.py) — new MarkerSet frozen dataclass and get_markers(lang) dispatch function with lazy loading and caching, making the core library language-aware and extensible
  • Japanese markers (markers_ja.py) — all 12 marker categories culturally adapted for Japanese spiritual contexts (スピリチュアル, 霊感商法, カルト, 秘密結社, etc.)
  • Multi-language pipelinelang parameter ("en" | "ja", default "en") flows through tech_analysis(), hybrid_score(), CLI (--lang), and API (lang field on POST /analyse)
  • NLP model cache — replaced singleton _nlp with _nlp_cache dict for per-language lazy loading (en_core_web_sm / ja_core_news_sm)
  • CIja_core_news_sm model added to test matrix
  • Strategy documentdocs/STRATEGY.md articulating mission, SI definition, approach, audience, and roadmap
  • Zero regressions — all existing English behaviour unchanged (default lang="en"); markers.py, output.py, test_markers.py, and test_output.py untouched

New files (6)

  • src/si_protocols/marker_registry.py
  • src/si_protocols/markers_ja.py
  • tests/test_marker_registry.py
  • tests/test_markers_ja.py
  • examples/synthetic_suspicious_ja.txt
  • docs/STRATEGY.md

Modified files (7)

  • src/si_protocols/threat_filter.py — multi-model loading, lang param, marker dispatch
  • app/schemas.pylang field on AnalyseRequest
  • app/main.py — pass lang through
  • tests/test_threat_filter.py — Japanese fixture texts and test classes
  • tests/test_api.py — Japanese API tests
  • .github/workflows/ci.yml — install ja_core_news_sm
  • CLAUDE.md — updated architecture and dev commands

Test plan

  • 201 tests pass (up from 146), 99% coverage
  • Ruff lint clean
  • Ruff format clean
  • Pyright 0 errors
  • Bandit security scan clean
  • All pre-commit hooks pass
  • CI passes on Python 3.12 + 3.13
  • Review Japanese markers for domain accuracy

🤖 Generated with Claude Code

Introduce multi-language architecture with a marker registry abstraction,
Japanese marker definitions (スピリチュアル, 霊感商法, カルト, etc.),
and a `lang` parameter ("en" | "ja") flowing through the core library,
CLI (--lang), and REST API. All existing English behaviour is preserved
(default lang="en"). Includes Japanese NLP pipeline (ja_core_news_sm),
201 tests at 99% coverage, CI updates, and a strategy document.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link

Deploying si-protocols with  Cloudflare Pages  Cloudflare Pages

Latest commit: b876a73
Status: ✅  Deploy successful!
Preview URL: https://5cbdd887.si-protocols.pages.dev
Branch Preview URL: https://feature-japanese-language-su.si-protocols.pages.dev

View logs

@lemur47
Copy link
Owner Author

lemur47 commented Feb 16, 2026

I will be updating markers_ja.py later on.

@lemur47 lemur47 merged commit 3d4575a into main Feb 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant