Current file: app/src/lib/tools/grammar-checker.ts
Current model: llama-3.3-70b
Current approach: Single prompt asking for corrections, changes list, tone adjustments, and tips. No programmatic grammar checking, no diff generation, no readability scoring.
Problems with current approach:
- No way to verify that corrections are actually correct (LLM may introduce new errors).
- The "changes made" list may not match the actual corrected version (inconsistency between sections).
- No quantitative readability metrics.
- Tone adjustment is subjective and not measured.
Upgrade plan:
| Step |
Agent |
Action |
| 1 |
Pre-Analysis |
Programmatic: Compute readability scores (Flesch-Kincaid, Gunning Fog). Count sentences, words, syllables. Detect language. Run language_tool_python for baseline grammar checks. |
| 2 |
Correction Agent |
Receive the original text, programmatic grammar findings, and target tone. Generate the corrected version with tone adjustments. |
| 3 |
Diff Generator |
Programmatic: Compute a word-level diff between original and corrected text. Auto-generate an accurate "changes made" list from the diff, not from LLM memory. |
| 4 |
Post-Analysis |
Programmatic: Re-compute readability scores on the corrected text. Show before/after comparison. |
- You are free to enhance the agents stacks in the above plan layout, the above one is just for reference. You can enhance more if needed.
Model suggestions to start with:
- Step 2: Try
llama-3.3-70b (current model, good at natural language). Also try qwen-3-32b or gpt-oss-120b. For formal/technical writing, try deepseek-r1-0528.
Model Selection Guidance
- You are free to pick any model from the Oxlo catalog based on your own testing and evaluation.
- The Models suggestions above, not mandates. Try them first, and if they do not meet the accuracy target, experiment with alternatives.
Compare against: GPT 5.3 Thinking (strong at grammar and tone).
Acceptance criteria:
- Changes list must be auto-generated from actual diff (no hallucinated changes).
- Readability scores are computed programmatically (Flesch-Kincaid included).
- Corrected text does not introduce new grammatical errors (verified by re-running grammar checker).
- Overall quality matches or exceeds GPT 5.3 Thinking on 20 test cases.
- Overall accuracy at 80%+.
Current file:
app/src/lib/tools/grammar-checker.tsCurrent model:
llama-3.3-70bCurrent approach: Single prompt asking for corrections, changes list, tone adjustments, and tips. No programmatic grammar checking, no diff generation, no readability scoring.
Problems with current approach:
Upgrade plan:
language_tool_pythonfor baseline grammar checks.Model suggestions to start with:
llama-3.3-70b(current model, good at natural language). Also tryqwen-3-32borgpt-oss-120b. For formal/technical writing, trydeepseek-r1-0528.Model Selection Guidance
Compare against: GPT 5.3 Thinking (strong at grammar and tone).
Acceptance criteria: