Current file: app/src/lib/tools/bug-replayer.ts
Current model: deepseek-r1-0528
Current approach: Single prompt asking for reproduction steps, API calls, test script, expected vs actual, debugging checklist, and possible root causes. No structured bug analysis, no code validation, no environment-specific handling.
Problems with current approach:
- Generated CURL commands and test scripts may have syntax errors.
- Reproduction steps are generic and may not match the described environment.
- Test scripts reference libraries or APIs without verifying correctness.
- Root cause ranking is not informed by the error pattern - it is a generic list.
- No validation that the generated scripts are runnable.
Upgrade plan:
| Step |
Agent |
Action |
| 1 |
Bug Classifier |
Analyze the bug description and error output. Classify: bug type (API error, UI glitch, data corruption, race condition, auth failure), affected layer (frontend, backend, database, network), severity. |
| 2 |
Environment Analyzer |
Parse the environment string to identify: language, framework, runtime version, hosting platform. Use this to generate environment-specific reproduction code. |
| 3 |
Reproduction Generator |
Generate: numbered reproduction steps, cURL/fetch commands tailored to the environment, and a test script using the correct testing framework for the detected language. |
| 4 |
Script Validator |
Programmatic: Syntax-check the generated test script (e.g., parse Python with ast.parse, parse JS/TS with a lightweight parser). Validate cURL command syntax. |
| 5 |
Root Cause Ranker |
Using the classified bug type and error patterns, generate a ranked list of root causes with specific evidence from the error output. |
| 6 |
Refinement Agent |
If syntax validation fails on the test script, feed errors back and regenerate. Max 2 retries. |
- You are free to enhance the agents stacks in the above plan layout, the above one is just for reference. You can enhance more if needed.
Model suggestions to start with:
- Step 1: Try
deepseek-v3.2 or llama-4-maverick-17b for classification (lightweight task).
- Steps 3 and 6: Try
qwen-3-coder-30b for code generation. Also try kimi-k2.6 (agentic coding, can generate environment-aware scripts).
- Step 5: Try
deepseek-r1-0528 for root cause reasoning. Also try kimi-k2-thinking for deep analysis.
Model Selection Guidance
- You are free to pick any model from the Oxlo catalog based on your own testing and evaluation.
- The Models suggestions above, not mandates. Try them first, and if they do not meet the accuracy target, experiment with alternatives.
Compare against: GPT 5.3 Thinking & Claude Sonnet 4.6 Thinking.
Acceptance criteria:
- Generated test scripts pass syntax validation in 90%+ of cases.
- cURL commands are syntactically valid and use the correct HTTP method and headers for the described scenario.
- Environment-specific details (import paths, framework syntax) match the declared environment.
- Root cause ranking references specific evidence from the error output, not generic guesses.
- Overall quality matches or exceeds GPT 5.3 Thinking & Claude Sonnet 4.6 on test cases.
- Overall accuracy at 80%+.
Current file:
app/src/lib/tools/bug-replayer.tsCurrent model:
deepseek-r1-0528Current approach: Single prompt asking for reproduction steps, API calls, test script, expected vs actual, debugging checklist, and possible root causes. No structured bug analysis, no code validation, no environment-specific handling.
Problems with current approach:
Upgrade plan:
ast.parse, parse JS/TS with a lightweight parser). Validate cURL command syntax.Model suggestions to start with:
deepseek-v3.2orllama-4-maverick-17bfor classification (lightweight task).qwen-3-coder-30bfor code generation. Also trykimi-k2.6(agentic coding, can generate environment-aware scripts).deepseek-r1-0528for root cause reasoning. Also trykimi-k2-thinkingfor deep analysis.Model Selection Guidance
Compare against: GPT 5.3 Thinking & Claude Sonnet 4.6 Thinking.
Acceptance criteria: