Skip to content

HenrikMader/DataChecker_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF RAG Quality Analyzer 📄

A web application that analyzes PDF documents to determine their suitability for RAG (Retrieval-Augmented Generation) pipelines.

Features

  • User Authentication: Secure login system with user management
  • PDF to Markdown Conversion: Uses Docling for high-quality document conversion
  • AI-Powered Analysis: GPT-4o-mini evaluates the extracted content for RAG suitability
  • Rate Limiting: Configurable daily request limits per user
  • Detailed Reports: Get scores on text extraction, structure, coherence, and more
  • Download Markdown: Export the converted markdown for use in your RAG pipeline

Installation

  1. Clone the repository:
git clone https://github.com/HenrikMader/DataChecker_RAG.git
cd DataChecker_RAG
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your OpenAI API key (server-side):
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
  1. Configure users in config.yaml (see User Management below)

Usage

Run the Streamlit app:

streamlit run app.py

Then open http://localhost:8501 in your browser.

Default Users

Username Password Role
admin admin123 admin
demo demo123 user

⚠️ Change these passwords before deploying!

User Management

Users are managed in config.yaml. To add a new user:

  1. Generate a hashed password:
import streamlit_authenticator as stauth
hash = stauth.Hasher(['your_password']).generate()[0]
print(hash)
  1. Add the user to config.yaml:
credentials:
  usernames:
    newuser:
      email: newuser@example.com
      first_name: New
      last_name: User
      password: <paste_hashed_password>
      roles:
        - user

Configuration

Rate Limiting

Edit config.yaml to change the daily request limit:

rate_limit:
  max_requests_per_day: 20

Cookie Settings

cookie:
  expiry_days: 30  # How long users stay logged in
  key: your_secret_key  # Change this!
  name: pdf_rag_auth

How It Works

  1. Upload: Drop your PDF file into the uploader
  2. Convert: Docling extracts text and structure, converting to Markdown
  3. Analyze: GPT-4o-mini evaluates the content quality for RAG use cases
  4. Review: Get a detailed quality report with scores and recommendations

Quality Metrics

The analyzer evaluates:

  • Text Extraction Quality (1-10): Readability, OCR accuracy
  • Structure Preservation (1-10): Headings, tables, lists formatting
  • Content Coherence (1-10): Logical flow, complete sentences
  • Information Density (1-10): Meaningful, retrievable content
  • Noise Level (1-10): Minimal boilerplate/irrelevant content

Requirements

  • Python 3.10+
  • OpenAI API key (configured server-side)
  • ~2GB disk space (for Docling models)

Security Notes

  • OpenAI API key is stored server-side only (not exposed to users)
  • User passwords are hashed with bcrypt
  • Rate limiting prevents API abuse
  • Consider adding config.yaml to .gitignore if it contains sensitive data

License

MIT

About

Short Data Checker Application for a RAG Pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages