A comprehensive Streamlit-based application specifically created for my personal application for analyzing German language proficiency and text complexity. LexiTrack provides detailed linguistic analysis, error detection, and CEFR level assessment for German learner texts
- Word Count - Total number of words in the text
- Sentence Count - Number of sentences analyzed
- Average Words per Sentence - Readability metric
- Lexical Diversity Score - Measures vocabulary richness and uniqueness
- MTLD (Measure of Textual Lexical Diversity) - Advanced lexical diversity metric less sensitive to text length
- Repetition Density - Analyzes word repetition patterns with and without stop words
- Clause Density - Counts subordinate and relative clauses
The app identifies and flags common errors in German learner texts:
- Subordinate Clause Accuracy - Checks correct use of subordinate conjunctions (weil, dass, obwohl, wenn, etc.)
- Article Accuracy - Detects incorrect article usage (der, die, das, ein, eine, etc.)
- Verb Morphology - Identifies verb conjugation and tense errors
- Preposition Case Checker - Validates dative and accusative prepositions
- Capitalization Errors - Flags incorrect capitalization (especially for nouns)
- Spelling Check - Detects misspelled words in German and English
- Case Errors (Pronoun Case Heuristic) - Identifies incorrect pronoun cases
- V2 Word Order - Checks verb position in main clauses
- Language Complexity Index (LCI) - Custom metric combining multiple linguistic features
- CEFR Level Estimation - Automatically assigns proficiency level (A1-C2)
- Error Rate Calculation - Overall error frequency per word count
- Proficiency Classification - Beginner, Intermediate, or Advanced
- Interactive Plotly charts and metrics
- Color-coded CEFR level display
- Detailed error breakdowns
- Top repeated words analysis
- Python 3.10+
- 2GB RAM minimum (for spaCy and HanTa models)
- ~500MB disk space for language models
All dependencies are listed in requirements.txt:
- streamlit - Web app framework
- spacy - NLP library for German text processing
- pandas - Data manipulation and analysis
- plotly - Interactive visualizations
- pyspellchecker - Spelling error detection
- HanTa - German lemmatizer and morphological tagger
git clone https://github.com/AmanPhadke/LexiTrack-German_Language_Performance_Tracker.git
cd LexiTrack-German_Language_Performance_Trackerpython -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activatepip install -r requirements.txtpython -m spacy download de_core_news_mdThe file morphmodel_ger.pgz must be in the project root directory. If it's missing, HanTa will attempt to download it automatically on first run.
Start the Streamlit app:
streamlit run app.pyThe application will open in your default web browser at http://localhost:8501
- Paste or type your German essay/text in the text area
- Minimum recommended length: 60+ words for accurate analysis
- Click the "Analyze" button to begin
The app displays results in three tabs:
- Shows your Language Complexity Index (LCI)
- Displays estimated CEFR Level (A1-C2)
- Color-coded proficiency indicator
- Basic text statistics (word count, sentences, etc.)
- Lexical diversity metrics
- MTLD score
- Repetition density analysis
- Clause density breakdown
- Subordinate clause accuracy
- Article usage errors
- Verb morphology issues
- Preposition case errors
- Capitalization mistakes
- Spelling errors
- Pronoun case errors
- V2 word order violations
- Total Error Count with error rate
Range: 0 to 1
What it measures: Percentage of unique words relative to total words
Higher is better: Indicates richer vocabulary usage
What it measures: Average vocabulary richness across the text
Why it's useful: Less affected by text length than traditional measures
Typical range: 30-90 (higher = more diverse vocabulary)
Measured with & without stop words
- With stop words: Includes common words (the, and, is)
- Without stop words: Focuses on content words (nouns, verbs, adjectives)
Higher values: May indicate repetitive writing
What it measures: Number of subordinate and relative clauses
Why it matters: Complex clause use is a marker of advanced proficiency
Calculation: Total Errors / Word Count Γ 100
Interpretation:
- 0-2% = Excellent
- 2-5% = Good
- 5-10% = Fair
- 10%+ = Needs improvement
- A1 - Beginner
- A2 - Elementary
- B1 - Intermediate
- B2 - Upper Intermediate
- C1 - Advanced
- C2 - Mastery
- The analysis uses spaCy's dependency parser, which has limitations with complex sentence structures
- Beginner texts (A1-A2) may produce false positives in error detection due to non-standard sentence structure
- Results are most accurate for texts with:
- Clear sentence boundaries
- Standard German grammar
- 60+ words minimum is recommended
- Grammar detection is rule-based and heuristic-driven
- Not all errors may be caught (especially nuanced ones)
- Some false positives may occur, especially with:
- Ambiguous sentence structures
- Poetry or creative writing
- Colloquial German
- Write clear, well-structured sentences
- Use standard German grammar and spelling
- For best results, submit 60+ word texts
- Review suggested errors but use your judgment
- Use results as learning guidance, not absolute truth
1. Language Processing
- spaCy (v3.5.0) - Tokenization, POS tagging, dependency parsing
- HanTa - German lemmatization and morphological analysis
2. Analysis Modules
- Basic Features - Word/sentence counts and content ratios
- Lexical Analysis - Diversity scores and MTLD calculation
- Error Detection - Grammar and spelling error identification
- Proficiency Assessment - CEFR level calculation
3. Visualization
- Streamlit - Web UI and interactive components
- Plotly - Interactive charts and metrics
Solution:
python -m spacy download de_core_news_mdSolution: Ensure morphmodel_ger.pgz is in the project root, or it will auto-download on first run
Possible causes:
- First time loading models (normal)
- Large text input (>10,000 words)
- Insufficient RAM Solution: Restart the app or reduce text size
Why: spaCy has limitations with complex or non-standard structures
Solution: Check error flags but verify with German grammar resources (read the notice block in the error tab at the top)
Input: "Ich bin ein Student und ich lerne Deutsch."
Results:
- Word Count: 8
- Sentence Count: 1
- Lexical Diversity: 0.75
- MTLD: 6.5
- Repetition Density: 12.5%
- Subordinate Clauses: 0
- Errors Found: 1 (capitalization: "Student" after lowercase)
- Estimated CEFR Level: A1
- Error Rate: 12.5%
Contributions are welcome! Areas for improvement:
- Enhanced error detection accuracy
- Support for more languages
- Better handling of colloquial German
- Additional proficiency assessment metrics
This project is open source. Please check the LICENSE file for details.
Aman Phadke
GitHub: @AmanPhadke
- spaCy - Industrial-strength NLP
- HanTa - German morphological analysis
- Streamlit - Making ML apps easy to build
- Plotly - For Vocabulary Growth visualization
- Pandas - For creating dataframes and basic functionalities
For issues, suggestions, or feedback:
- Open an GitHub issue
- Provide details about the problem or suggestion
- Include example text if reporting errors
Potential future features:
- Multi-language support
- Advanced ML-based error detection
- Text difficulty scoring
- Writing style analysis
- Personalized learning recommendations
- Export analysis reports
Last Updated: March 2026
Version: 1.0.0