A web application that analyzes PDF documents to determine their suitability for RAG (Retrieval-Augmented Generation) pipelines.
- User Authentication: Secure login system with user management
- PDF to Markdown Conversion: Uses Docling for high-quality document conversion
- AI-Powered Analysis: GPT-4o-mini evaluates the extracted content for RAG suitability
- Rate Limiting: Configurable daily request limits per user
- Detailed Reports: Get scores on text extraction, structure, coherence, and more
- Download Markdown: Export the converted markdown for use in your RAG pipeline
- Clone the repository:
git clone https://github.com/HenrikMader/DataChecker_RAG.git
cd DataChecker_RAG- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up your OpenAI API key (server-side):
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY- Configure users in
config.yaml(see User Management below)
Run the Streamlit app:
streamlit run app.pyThen open http://localhost:8501 in your browser.
| Username | Password | Role |
|---|---|---|
| admin | admin123 | admin |
| demo | demo123 | user |
⚠️ Change these passwords before deploying!
Users are managed in config.yaml. To add a new user:
- Generate a hashed password:
import streamlit_authenticator as stauth
hash = stauth.Hasher(['your_password']).generate()[0]
print(hash)- Add the user to
config.yaml:
credentials:
usernames:
newuser:
email: newuser@example.com
first_name: New
last_name: User
password: <paste_hashed_password>
roles:
- userEdit config.yaml to change the daily request limit:
rate_limit:
max_requests_per_day: 20cookie:
expiry_days: 30 # How long users stay logged in
key: your_secret_key # Change this!
name: pdf_rag_auth- Upload: Drop your PDF file into the uploader
- Convert: Docling extracts text and structure, converting to Markdown
- Analyze: GPT-4o-mini evaluates the content quality for RAG use cases
- Review: Get a detailed quality report with scores and recommendations
The analyzer evaluates:
- Text Extraction Quality (1-10): Readability, OCR accuracy
- Structure Preservation (1-10): Headings, tables, lists formatting
- Content Coherence (1-10): Logical flow, complete sentences
- Information Density (1-10): Meaningful, retrievable content
- Noise Level (1-10): Minimal boilerplate/irrelevant content
- Python 3.10+
- OpenAI API key (configured server-side)
- ~2GB disk space (for Docling models)
- OpenAI API key is stored server-side only (not exposed to users)
- User passwords are hashed with bcrypt
- Rate limiting prevents API abuse
- Consider adding
config.yamlto.gitignoreif it contains sensitive data
MIT