Skip to content

Unstructured Text Formatter #25

@ms-shashank

Description

@ms-shashank

Current file: app/src/lib/tools/text-formatter.ts
Current model: llama-3.3-70b
Current approach: Single prompt asking LLM to reformat messy text. No structural analysis, no format validation, no data loss detection.

Problems with current approach:

  • LLM may silently drop rows or columns when reformatting.
  • Inferred headers may be incorrect.
  • Output format (CSV, JSON, YAML) may not parse correctly.
  • No verification that all input data is preserved in output.

Upgrade plan:

Step Agent Action
1 Structure Detector Programmatic: Analyze the input text to detect delimiter patterns (tabs, pipes, commas, fixed-width). Count potential rows and columns.
2 Parsing Agent Using the detected structure hints, parse the messy text into a structured table. Infer column headers and data types.
3 Format Compiler Programmatic: Convert the parsed data into the requested output format using Python serializers (csv module, json.dumps, yaml.dump).
4 Integrity Checker Programmatic: Verify row count matches between input and output. Check no data values were dropped. If discrepancies found, flag them.
  • You are free to enhance the agents stacks in the above plan layout, the above one is just for reference. You can enhance more if needed.

Model suggestions to start with:

  • Step 2: Try llama-3.3-70b or qwen-3-32b for text parsing. Also try deepseek-v3.2 for structured extraction.
  • Since most of the work is programmatic (Steps 1, 3, 4), the LLM mainly helps with ambiguous parsing decisions.

Model Selection Guidance

  • You are free to pick any model from the Oxlo catalog based on your own testing and evaluation.
  • The Models suggestions above, not mandates. Try them first, and if they do not meet the accuracy target, experiment with alternatives.

Compare against: Claude Sonnet 4.6 Thinking (strong at structured text extraction).

Acceptance criteria:

  • Zero data loss: every value in the input must appear in the output (programmatically verified).
  • Output format must be parseable in 100% of cases.
  • Row count preservation verified programmatically.
  • Overall accuracy at 80%+.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions