PDF RAG Quality Analyzer 📄

A web application that analyzes PDF documents to determine their suitability for RAG (Retrieval-Augmented Generation) pipelines.

Features

User Authentication: Secure login system with user management
PDF to Markdown Conversion: Uses Docling for high-quality document conversion
AI-Powered Analysis: GPT-4o-mini evaluates the extracted content for RAG suitability
Rate Limiting: Configurable daily request limits per user
Detailed Reports: Get scores on text extraction, structure, coherence, and more
Download Markdown: Export the converted markdown for use in your RAG pipeline

Installation

Clone the repository:

git clone https://github.com/HenrikMader/DataChecker_RAG.git
cd DataChecker_RAG

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up your OpenAI API key (server-side):

cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

Configure users in config.yaml (see User Management below)

Usage

Run the Streamlit app:

streamlit run app.py

Then open http://localhost:8501 in your browser.

Default Users

Username	Password	Role
admin	admin123	admin
demo	demo123	user

⚠️ Change these passwords before deploying!

User Management

Users are managed in config.yaml. To add a new user:

Generate a hashed password:

import streamlit_authenticator as stauth
hash = stauth.Hasher(['your_password']).generate()[0]
print(hash)

Add the user to config.yaml:

credentials:
  usernames:
    newuser:
      email: newuser@example.com
      first_name: New
      last_name: User
      password: <paste_hashed_password>
      roles:
        - user

Configuration

Rate Limiting

Edit config.yaml to change the daily request limit:

rate_limit:
  max_requests_per_day: 20

Cookie Settings

cookie:
  expiry_days: 30  # How long users stay logged in
  key: your_secret_key  # Change this!
  name: pdf_rag_auth

How It Works

Upload: Drop your PDF file into the uploader
Convert: Docling extracts text and structure, converting to Markdown
Analyze: GPT-4o-mini evaluates the content quality for RAG use cases
Review: Get a detailed quality report with scores and recommendations

Quality Metrics

The analyzer evaluates:

Text Extraction Quality (1-10): Readability, OCR accuracy
Structure Preservation (1-10): Headings, tables, lists formatting
Content Coherence (1-10): Logical flow, complete sentences
Information Density (1-10): Meaningful, retrievable content
Noise Level (1-10): Minimal boilerplate/irrelevant content

Requirements

Python 3.10+
OpenAI API key (configured server-side)
~2GB disk space (for Docling models)

Security Notes

OpenAI API key is stored server-side only (not exposed to users)
User passwords are hashed with bcrypt
Rate limiting prevents API abuse
Consider adding config.yaml to .gitignore if it contains sensitive data

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Quality Analyzer 📄

Features

Installation

Usage

Default Users

User Management

Configuration

Rate Limiting

Cookie Settings

How It Works

Quality Metrics

Requirements

Security Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.yaml		config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Quality Analyzer 📄

Features

Installation

Usage

Default Users

User Management

Configuration

Rate Limiting

Cookie Settings

How It Works

Quality Metrics

Requirements

Security Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages