A Python tool for semantic analysis and interactive visualization of PDF document collections.
- ✅ Automatic text extraction from PDFs
- ✅ Multilingual language detection (German, English, French, Spanish, Italian)
- ✅ Semantic similarity calculation (TF-IDF + Cosine similarity)
- ✅ Automatic document clustering
- ✅ Interactive HTML network visualization
- ✅ Double-click to open PDFs
- ✅ Detailed cluster analysis
- Python 3.7+
- pip
pip install -r requirements.txtrequirements.txt Content:
PyPDF2
networkx
matplotlib
scikit-learn
numpy
pyvis
langdetect
from PDFGraph import main
main(
folder_path="./my_pdfs",
similarity_threshold=0.25, # Connection strength (0.0-1.0)
n_clusters=3, # Number of clusters
multilingual=True, # Enable multilingual processing
repulsion_strength=2.5, # Node repulsion (1.0-5.0)
spring_k=10.0 # Overall distance (3.0-15.0)
)The program generates:
- network_interactive.html - Interactive network visualization
- Console output - Language distribution, cluster analysis, similarity values
| Action | Description |
|---|---|
| Double-click | Opens the PDF file |
| Drag & Drop | Move nodes around |
| Scroll | Zoom in/out |
| Hover | Shows similarity value |
| Multi-select | Select multiple nodes |
- Controls repulsion of weak connections
- Low (1.0): Compact layout
- High (5.0): Strong dispersion
- Distance between all nodes
- Low (3.0): Compact network
- High (15.0): Widespread distribution
- Minimum similarity for connections
- Low (0.1): Many connections
- High (0.5): Only strong connections
- Clusters → Different node colors
- Green edges → High similarity (>0.7)
- Blue edges → Medium similarity (>0.5)
- Gray edges → Low similarity (>0.3)
- 🇩🇪 German
- 🇬🇧 English
- 🇫🇷 French
- 🇪🇸 Spanish
- 🇮🇹 Italian
Multilingual document collections are automatically detected and processed.
# 1. Copy PDFs to a folder
mkdir ./my_pdfs
cp *.pdf ./my_pdfs/
# 2. Run PDFGraph
python PDFGraph.py
# 3. Open the generated HTML file
network_interactive.html"No PDFs found" → Make sure the PDF folder exists and contains .pdf files
"pyvis not installed"
→ Run: pip install pyvis
"No text extracted" → The PDF might be scanned/image-based (OCR required)
- ~5-10 seconds for 10-20 PDFs
- ~30 seconds for 50+ PDFs
- Depends on text volume and CPU
Version: 1.0 | Author: Sebastian Meyer | Date: October 2025