Skip to content

PDF / Document Summarizer #26

@ms-shashank

Description

@ms-shashank

Current file: app/src/lib/tools/pdf-summarizer.ts
Current model: deepseek-r1-0528
Current approach: Single prompt with pasted document text. No text preprocessing, no section detection, no entity extraction, no length management.

Problems with current approach:

  • Long documents exceed context window and get truncated silently.
  • No extraction of specific data points (numbers, dates, amounts) - relies on LLM memory.
  • Cannot handle structured documents (tables, headers, lists) differently from prose.
  • Action items and risk factors are often missed in long documents.

Upgrade plan:

Step Agent Action
1 Text Preprocessor Programmatic: Clean the text. Detect document structure (headers, sections, lists, tables). Split into semantic chunks if document exceeds context window.
2 Entity Extractor Programmatic: Extract numbers, dates, percentages, monetary amounts, proper nouns using regex and NLP patterns. Build a structured data table.
3 Section Summarizer Summarize each chunk/section individually. For long documents, this runs in parallel across chunks.
4 Synthesis Agent Receive all section summaries and extracted entities. Generate the final executive summary, key findings, action items, and risk factors.
  • You are free to enhance the agents stacks in the above plan layout, the above one is just for reference. You can enhance more if needed.

Model suggestions to start with:

  • Step 3: Try deepseek-v3.2 or llama-4-maverick-17b for per-section summaries (fast, handles volume).
  • Step 4: Try deepseek-r1-0528 for synthesis reasoning. Also try kimi-k2.6 (131K context can handle large merged summaries well).

Model Selection Guidance

  • You are free to pick any model from the Oxlo catalog based on your own testing and evaluation.
  • The Models suggestions above, not mandates. Try them first, and if they do not meet the accuracy target, experiment with alternatives.

Compare against: Claude Sonnet 4.6 Thinking (excellent at document analysis).

Acceptance criteria:

  • Handles documents up to 50,000 words without truncation (via chunking).
  • All extracted numbers, dates, and amounts are verified against source text.
  • Action items are correctly identified in 90%+ of documents that contain them.
  • Overall quality matches or exceeds Claude Sonnet 4.6 Thinking & ChatGPT 5.3 on 10 test documents.
  • Overall accuracy at 80%+.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions