🇷🇸 Srpska verzija / Serbian version
A simple demonstration application for Named Entity Recognition (NER) and Named Entity Linking (NEL) using spaCy models with a minimal GUI interface.
- ✅ Easy Installation: Automated installers for Windows (PowerShell) and Linux/Mac (Bash)
- ✅ Python Version Check: Ensures Python 3.10 or higher is installed
- ✅ Virtual Environment: Automatically creates and manages a virtual environment
- ✅ Flexible Dependencies: Choose between standard spaCy or spacy-transformers
- ✅ Simple GUI: User-friendly interface built with tkinter
- ✅ Model Management: Load custom trained models from the
models/directory - ✅ Text Processing: Process any text and extract named entities
- ✅ Cyrillic Transliteration: Automatic transliteration from Cyrillic to Latin script for better NER accuracy
- ✅ Smart Text Chunking: Automatically handles large texts by chunking on paragraph boundaries
- ✅ Visual Output: Generate beautiful HTML visualizations using displaCy
- ✅ Output Management: Save all outputs to
data/outputs/with timestamps - ✅ Comprehensive Testing: Full test suite with pytest
NEL_Demo/
├── install.ps1 # Windows installer (PowerShell)
├── install.sh # Linux/Mac installer (Bash)
├── requirements.txt # Python dependencies
├── README.md # This file
├── src/
│ ├── gui.py # Main GUI application
│ └── text_chunker.py # Text chunking module for large documents
├── tests/
│ └── test_text_chunker.py # Test suite for text chunking
├── models/ # Place your trained models here
│ └── {model_name}/
│ └── model-best/ # Your trained spaCy model
├── inputs/ # Input text files
│ └── sample_text.txt # Sample text file
├── data/
│ └── outputs/ # HTML visualization outputs
└── venv/ # Virtual environment (created by installer)
- Python: 3.10 or higher
- Operating System: Windows, Linux, or macOS
- spaCy Model: A trained spaCy model placed in
models/{model_name}/model-best/
- Open PowerShell
- Navigate to the project directory
- Run the installer:
.\install.ps1- Open a terminal
- Navigate to the project directory
- Run the installer:
./install.shThe installer will:
- ✅ Check if Python 3.10+ is installed
- ✅ Create a virtual environment in
venv/ - ✅ Activate the virtual environment
- ✅ Upgrade pip to the latest version
- ✅ Ask you to choose between:
- Standard spaCy (faster, smaller)
- spacy-transformers (more accurate, larger)
- ✅ Install all required dependencies
A Serbian NER+NEL model (trsic4-CNN-ner-nel) is already installed in the models/ directory and ready to use. No additional setup is required!
If you have a trained spaCy model:
- Create a directory:
models/{your_model_name}/ - Place your trained model in:
models/{your_model_name}/model-best/
The structure should look like:
models/
└── your_model_name/
└── model-best/
├── config.cfg
├── meta.json
├── tokenizer
├── ner/
└── ... (other model files)
Windows:
.\venv\Scripts\Activate.ps1
python src/gui.pyLinux/Mac:
source venv/bin/activate
python src/gui.py-
Select a Model:
- Choose your model from the dropdown
- Click "Load Model" to load it
- Wait for the confirmation message
-
Configure Processing Options:
- Transliterate Cyrillic to Latin: Enabled by default (if
cyrtranslitis installed) - This option automatically converts Cyrillic text to Latin before processing for better entity recognition
- Transliterate Cyrillic to Latin: Enabled by default (if
-
Enter Text:
- Type or paste text into the input area
- Or click "Load Sample Text" for a demo
- Or click "Load from File" to load a text file from the
inputs/folder
-
Process Text:
- Click "Process Text (NER)" to analyze the text
- View entities in the results section
- HTML visualization is automatically saved
-
View Results:
- Click "View Last Output" to open the HTML in your browser
- Click "Open Output Folder" to see all saved outputs
The application includes automatic Cyrillic-to-Latin transliteration to improve NER accuracy when using models trained primarily on Latin script:
- Automatic Conversion: Converts Cyrillic text to Latin script (supports Serbian, Montenegrin, Macedonian, Russian, Ukrainian, Kazakh, and Bulgarian) before processing
- Enabled by Default: The transliteration option is checked by default (if
cyrtranslitis installed) - Toggleable: Can be disabled via the checkbox if you prefer to process Cyrillic text directly
- Preserves Entities: Latin text remains unchanged; only Cyrillic characters are transliterated
- Better Accuracy: Models trained on Latin script typically perform better with transliterated text
Example: The Cyrillic text "Новак Ђоковић рођен у Београду" is automatically transliterated to "Novak Đoković rođen u Beogradu" before being sent to the NER pipeline.
Note: If you have a model specifically trained on Cyrillic text, you can disable this option by unchecking the "Transliterate Cyrillic to Latin before processing" checkbox.
Try this sample text:
Apple Inc. is an American multinational technology company headquartered
in Cupertino, California. Tim Cook is the CEO of Apple. The company was
founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The application will:
- Extract entities like "Apple Inc." (ORG), "Tim Cook" (PERS), "Cupertino" (LOC)
- Link entities to Wikidata (NEL) with Q-IDs where available
- Show entity labels and positions
- Generate an HTML visualization with highlighted entities
- Save the output to
data/outputs/ner_output_YYYYMMDD_HHMMSS.html
Note: The Serbian NER+NEL model (trsic4-CNN-ner-nel) recognizes these entity types: PERS (person), LOC (location), ORG (organization), EVENT, DEMO (demonym), IDEO (ideology), PRODUCT, ROLE, and WORK.
The application automatically uses chunking for any text with multiple paragraphs:
- Smart Chunking: Paragraphs are grouped into appropriately sized chunks (up to 100K chars each) to preserve logical structure and improve NER accuracy
- Automatic Processing: Each chunk is processed separately with spaCy NER
- Merged Output: All chunks are combined into a single HTML visualization
- Visual Separation: Section breaks are added between chunks in the output
- Better Context: Processing text with paragraph boundaries helps spaCy maintain clearer context for entity recognition
Single-paragraph texts are processed normally without chunking overhead. This approach ensures optimal NER performance while maintaining the readability and structure of the original text.
Each processed text generates an HTML file with:
- Original text with highlighted entities
- Color-coded entity types
- Interactive visualization
- Timestamp in the filename
Output files are saved in: data/outputs/
- Install Python 3.10 or higher from python.org
- Make sure to check "Add Python to PATH" during installation
- Make sure you've placed a trained model in
models/{model_name}/model-best/ - Check that the model directory structure is correct
- Try downloading a pre-trained model (see "Setting Up a Model")
- Verify the model files are complete and not corrupted
- Make sure the model is compatible with your spaCy version
- Try re-downloading or re-training the model
- Make sure you've activated the virtual environment
- Check that all dependencies are installed:
pip list - On Linux, you may need to install tkinter:
sudo apt-get install python3-tk
To train a custom NER+NEL model with spaCy:
- Prepare your training data
- Create a spaCy project or config
- Train the model:
python -m spacy train config.cfg --output ./models/my_model
- The trained model will be in
models/my_model/model-best/
For more information, see the spaCy training documentation.
For better accuracy, use transformer-based models:
- Install spacy-transformers during setup (option 2)
- Train or download a transformer model
- Place it in the models directory
Note: Transformer models are larger and slower but more accurate.
Core dependencies (installed automatically):
spacy>=3.7.0- Core NLP librarycyrtranslit>=1.0.0- Cyrillic-to-Latin transliterationtkinter-tooltip>=2.0.0- GUI tooltips (optional)
Optional:
spacy-transformers- For transformer-based models
Development dependencies:
pytest- For running tests
The project includes comprehensive tests for the text chunking functionality.
To run the tests:
# Activate the virtual environment first
# Windows:
.\venv\Scripts\Activate.ps1
# Linux/Mac:
source venv/bin/activate
# Install pytest (if not already installed)
pip install pytest
# Run all tests
python -m pytest tests/test_text_chunker.py -v
# Run specific test class
python -m pytest tests/test_text_chunker.py::TestChunkText -vThe test suite includes:
- Paragraph splitting tests: Verify correct handling of various paragraph formats
- Text chunking tests: Ensure proper chunking at different size limits
- HTML merging tests: Validate correct merging of multiple HTML outputs
- Edge case tests: Test Unicode, special characters, very long sentences
- Integration tests: End-to-end workflow validation
This project is dedicated to the public domain under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues or questions:
- Check the troubleshooting section
- Visit spaCy documentation
- Open an issue on GitHub
Made by:
- TESLA - Text Embeddings - Serbian Language Applications
- Language Resources and Technologies Society - Jerteh
Happy Entity Recognition! 🎯