Skip to content

ShahMimansha/doc-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Intelligence System

A full-stack system that lets you upload documents, search their content, and get AI-powered summaries, key points, and question answering — built with FastAPI, React, and NVIDIA AI.


Features

  • Document Upload: Support for both raw text and file uploads (.txt, .pdf, etc.).
  • Smart Processing: Automatic text cleaning, chunking, and keyword extraction.
  • AI Analysis: Powered by NVIDIA AI for summaries and key point extraction.
  • Contextual Q&A: Ask questions on any document and get answers with confidence scores.
  • Keyword Search: Fast search across titles, content, and extracted tags.
  • Modern UI: Clean, dark-themed dashboard built with React.

File Structure

doc-intelligence/
│
├── main.py                        # Backend entry point
├── app/                           # FastAPI Backend
│   ├── routes/                    # API Endpoints (Documents, AI)
│   ├── models/                    # Database Models (SQLAlchemy)
│   ├── services/                  # Business Logic (AI, Processing)
│   └── schemas.py                 # Pydantic Schemas
│
├── frontend/                      # React UI
│   ├── src/                       # React components and styles
│   └── public/                    # Static assets
│
├── documents.db                   # SQLite database
├── .env                           # Configuration (API keys)
└── requirements.txt               # Backend dependencies

Setup

1. Backend Setup

# Install dependencies
pip install -r requirements.txt

# Configure .env (Add your NVIDIA_API_KEY)
# NVIDIA_API_KEY=your_key_here
# DATABASE_URL=sqlite:///./documents.db

# Run the server
uvicorn main:app --reload

Backend runs at: http://localhost:8000

2. Frontend Setup

cd frontend
npm install
npm start

Frontend runs at: http://localhost:3000


API Reference

Upload a Document

POST /documents Now supports multipart/form-data for both text and file uploads.

Generate AI Summary

POST /documents/{id}/summary Generates a 2-3 sentence summary, key points, and suggested tags.

Ask a Question

POST /documents/{id}/query Uses keyword overlap to find relevant context and answers via NVIDIA AI.


Get All Documents

GET /documents
GET /documents?skip=0&limit=10

Get Document by ID

GET /documents/1

Search Documents

GET /documents/search?q=IoT

Response:

[
  {
    "id": 1,
    "title": "IoT Overview",
    "keywords": ["internet", "devices", "network"],
    "snippet": "...The Internet of Things (IoT) refers to the network of physical devices...",
    "created_at": "2024-01-15T10:30:00"
  }
]

Generate AI Summary

POST /documents/1/summary

Response:

{
  "document_id": 1,
  "title": "IoT Overview",
  "summary": "This document provides an overview of the Internet of Things, explaining how physical devices connect and communicate over networks.",
  "key_points": [
    "IoT connects physical devices to the internet",
    "Sensors collect and transmit data",
    "Applications include smart homes and industrial automation"
  ],
  "suggested_tags": ["iot", "sensors", "networking", "automation"]
}

Ask a Question

POST /documents/1/query
Content-Type: application/json

{
  "question": "What are the main applications of IoT?"
}

Response:

{
  "document_id": 1,
  "question": "What are the main applications of IoT?",
  "answer": "According to the document, IoT is applied in smart homes, industrial automation, and healthcare monitoring.",
  "confidence": "high",
  "source_chunks": [
    "IoT applications span smart homes, factories, and medical devices..."
  ]
}

Get Query History

GET /documents/1/history

Delete a Document

DELETE /documents/1

Running Tests

pytest tests/ -v

Approach

Processing Pipeline

Before storing any document, the system runs a lightweight pipeline:

  1. Clean — strips extra whitespace, normalizes characters
  2. Chunk — splits content into ~1000-char overlapping segments at sentence boundaries
  3. Extract keywords — frequency-based keyword extraction after removing stop words
  4. Store Summaries — generated AI summaries are saved in the DB for persistence and speed.
  5. Save History — every user query and AI answer is logged in a history table.

This keeps the system fast and feature-rich without needing a vector database.

AI Integration

NVIDIA AI handles two tasks:

  • Summary endpoint — given the document content, it returns a JSON with a summary, key points, and suggested tags. Results are cached in the DB so repeated calls don't re-invoke the API.
  • Query endpoint — the most relevant chunks are selected using keyword overlap scoring, then passed to the model with the question. It returns an answer and a confidence level (high/medium/low).

Search

Search is done with simple string matching across title, content, and extracted keywords. No vector embeddings or FAISS required — clean and sufficient for this scale.


AI Usage Notes

Where What was used What was customized
app/services/ai_service.py NVIDIA AI API Prompt design, JSON parsing, confidence scoring, fallback handling
app/services/processing.py Written manually Full custom implementation — chunking logic, keyword extraction, stop words list
app/routes/ Written manually All routing, caching logic, history saving
README examples AI (assisted) All curl examples verified manually

The processing pipeline (processing.py) is entirely hand-written. AI was used only for the summary and Q&A features, where NVIDIA AI's language understanding adds real value.


Bonus Features Implemented

  • Confidence score on every query response (low / medium / high)
  • Query history saved and retrievable per document
  • Summary caching — avoids redundant API calls
  • Source chunks returned with each answer (citation-style)
  • Auto-generated Swagger UI at /docs

About

A Document Intelligence System built with FastAPI, SQLite, and NVIDIA AI API. Upload documents, search content, generate AI summaries, extract key points, and ask questions with confidence scoring. Includes a React frontend.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages