A comprehensive, production-ready pipeline for processing, cleaning, and managing Pashto language datasets. This system combines advanced NLP techniques, web scraping, PDF processing, and dataset management capabilities specifically designed for the Pashto language.
- 🔤 Text Processing: Advanced Pashto text normalization, tokenization, and cleaning
- 🌐 Web Scraping: Multi-source data collection from Pashto websites and news sources
- 📄 PDF Processing: Extract and process text from Pashto PDF documents
- 🎯 Quality Control: Built-in quality assessment and deduplication
- 📊 Dataset Management: Create, manage, and export Pashto datasets
- 🔄 Pipeline Orchestration: Flexible, configurable processing pipelines
- 📈 Monitoring: Comprehensive logging and progress tracking
- Unicode Support: Full Unicode compatibility for Pashto script
- Character Normalization: Standardize Pashto character variations
- Digit Normalization: Convert between Western, Pashto, and Arabic numerals
- Diacritic Handling: Optional diacritic removal and standardization
- Tokenization: Context-aware Pashto word and sentence tokenization
- Stopwords: Built-in Pashto stopword lists
Pashto-Processing-pipeline/
├── pashto_pipeline/ # Main package - Core processing modules
│ ├── core/ # Pipeline orchestration
│ ├── preprocessing/ # Text normalization and tokenization
│ └── utils/ # Utility functions
│
├── code/pashto_dataset/ # Extended dataset processing system
│ ├── pipeline/ # Advanced pipeline management
│ ├── text_processor/ # Text processing modules
│ ├── scrapers/ # Web scraping tools
│ ├── pdf_processor/ # PDF extraction
│ └── dataset_manager/ # Dataset creation and management
│
├── docs/ # Documentation
│ ├── guides/ # User guides and tutorials
│ ├── api/ # API documentation
│ └── troubleshooting/ # Common issues and FAQ
│
├── examples/ # Example code and configurations
│ ├── python/ # Python examples
│ ├── config/ # Sample configurations
│ └── bash/ # Shell scripts
│
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
│
├── data/ # Data directories
│ ├── raw/ # Raw input data
│ └── processed/ # Processed output data
│
└── models/ # Trained models and resources
# Clone the repository
git clone https://github.com/tasal9/Pashto-Processing-pipeline.git
cd Pashto-Processing-pipeline
# Run the installation script
bash install.sh
# Activate the virtual environment
source venv/bin/activate# Clone the repository
git clone https://github.com/tasal9/Pashto-Processing-pipeline.git
cd Pashto-Processing-pipeline
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .from pashto_pipeline import PashtoNormalizer, PashtoTokenizer
# Initialize components
normalizer = PashtoNormalizer(
normalize_whitespace=True,
normalize_digits='western'
)
tokenizer = PashtoTokenizer(preserve_punctuation=True)
# Process text
text = "سلام دنیا! دا د پښتو متن پروسس کولو یوه ساده بېلګه ده."
normalized = normalizer.normalize(text)
tokens = tokenizer.tokenize(normalized)
print(f"Tokens: {tokens}")from pashto_pipeline import TextProcessingPipeline, PashtoNormalizer, PashtoTokenizer
# Create pipeline
pipeline = TextProcessingPipeline()
# Add processing steps
pipeline.add_step('normalize', PashtoNormalizer().normalize)
pipeline.add_step('tokenize', PashtoTokenizer().tokenize)
# Process text
result = pipeline.process("سلام دنیا!", verbose=True)
print(result)
# Batch processing
texts = ["زه په کابل کې اوسېږم.", "پښتو یوه ښکلې ژبه ده."]
results = pipeline.process_batch(texts)from code.pashto_dataset.text_processor.text_normalizer import PashtoTextNormalizer
from code.pashto_dataset.text_processor.quality_filter import QualityFilter
# Initialize advanced components
normalizer = PashtoTextNormalizer()
quality_filter = QualityFilter()
# Process and filter text
text = "سلام دنیا"
normalized = normalizer.normalize(text)
is_quality = quality_filter.is_high_quality(normalized)
print(f"Text quality: {'High' if is_quality else 'Low'}")- Installation Guide
- Configuration Guide
- Usage Tutorials
- Best Practices
- API Reference
- Troubleshooting
- FAQ
Check the examples/ directory for:
- Python Examples:
examples/python/ - Configuration Files:
examples/config/ - Shell Scripts:
examples/bash/
The pipeline is highly configurable. See examples/config/ for sample configuration files:
# example: basic_config.yaml
pipeline:
name: "pashto_processing"
version: "1.0"
processing:
normalize:
unicode_form: "NFC"
remove_diacritics: false
normalize_digits: "western"
tokenize:
preserve_punctuation: true
lowercase: false- TextProcessingPipeline: Main pipeline orchestrator
- PashtoNormalizer: Unicode and character normalization
- PashtoTokenizer: Word and sentence tokenization
- StopwordsRemover: Remove common Pashto stopwords
- Web Scrapers: Collect data from Pashto websites
- PDF Processor: Extract text from PDFs
- Quality Filter: Assess text quality
- Deduplicator: Remove duplicate content
- Dataset Manager: Create and manage datasets
Run the test suite:
# Run all tests
pytest
# Run with coverage
pytest --cov=pashto_pipeline --cov-report=html
# Run specific test file
pytest tests/unit/test_normalizer.py- Web Scraping: Pashto news sites, forums, and blogs
- PDF Documents: Books, articles, and reports
- Text Files: Plain text, CSV, JSON
- Databases: SQL and NoSQL databases
- APIs: Integration with external Pashto language APIs
- Pashto Dialects: Southern, Northern, and Western Pashto
- Script: Perso-Arabic script (Pashto variant)
- Encodings: UTF-8, UTF-16, legacy encodings
- Normalization: NFC, NFD, NFKC, NFKD forms
- Memory Efficient: Optimized for large datasets
- Batch Processing: Process multiple texts efficiently
- Progress Tracking: Real-time monitoring with tqdm
- Parallel Processing: Multi-threading support (where applicable)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Pashto NLP Research Community
- Unicode Consortium for Pashto script support
- Contributors and maintainers
- Issues: GitHub Issues
- Documentation: See
docs/directory - Examples: See
examples/directory
- Add more Pashto-specific NLP tools
- Improve dialect detection
- Add part-of-speech tagging
- Support for speech-to-text integration
- Enhanced OCR for Pashto handwriting
- Cloud deployment guides
Made with ❤️ for the Pashto language community