Comprehensive testing guide for the textxtract package.
The project uses pytest for all tests with support for both synchronous and asynchronous testing.
Install the package with all optional dependencies for complete testing:
pip install textxtract[all]
pip install pytest pytest-asyncio# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_sync.py
pytest tests/test_async.py
# Run tests with coverage
pytest --cov=textxtract# Run only synchronous tests
pytest tests/test_sync.py
# Run only asynchronous tests
pytest tests/test_async.py
# Run exception handling tests
pytest tests/test_exceptions.py
# Run edge case tests
pytest tests/test_edge_cases.pytests/
├── __init__.py
├── test_sync.py # Synchronous extractor tests
├── test_async.py # Asynchronous extractor tests
├── test_exceptions.py # Error handling tests
├── test_edge_cases.py # Edge cases and validation
└── files/ # Sample test files
├── text_file.txt
├── text_file.pdf
├── text_file.docx
├── markdown.md
├── text.csv
├── text.json
├── text.html
├── text.xml
└── ...
The test suite covers all supported file types:
| File Type | Test Files | Sync Tests | Async Tests |
|---|---|---|---|
| Plain Text | text_file.txt, text_file.text |
✅ | ✅ |
| Markdown | markdown.md |
✅ | ✅ |
text_file.pdf |
✅ | ✅ | |
| Word | text_file.docx |
✅ | ✅ |
| Legacy Word | text_file.doc |
✅ | ✅ |
| Rich Text | text_file.rtf |
✅ | ✅ |
| HTML | text.html |
✅ | ✅ |
| CSV | text.csv |
✅ | ✅ |
| JSON | text.json |
✅ | ✅ |
| XML | text.xml |
✅ | ✅ |
| ZIP | text_zip.zip |
✅ | ✅ |
- ✅ File path extraction (
extractor.extract("/path/to/file.pdf")) - ✅ Bytes extraction (
extractor.extract(file_bytes, "file.pdf")) - ✅ Both sync and async methods
- ✅ Error handling for unsupported types
- ✅ Context manager usage
- ✅
FileTypeNotSupportedErrorfor unsupported extensions - ✅
InvalidFileErrorfor corrupted/missing files - ✅
ExtractionErrorfor extraction failures - ✅
ValueErrorfor missing filename with bytes input
import pytest
from pathlib import Path
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import FileTypeNotSupportedError
def test_custom_file_extraction():
extractor = SyncTextExtractor()
# Test with file path
text = extractor.extract("path/to/test/file.txt")
assert isinstance(text, str)
assert len(text) > 0
# Test with bytes
with open("path/to/test/file.txt", "rb") as f:
file_bytes = f.read()
text = extractor.extract(file_bytes, "file.txt")
assert isinstance(text, str)
assert len(text) > 0import pytest
from textxtract import AsyncTextExtractor
@pytest.mark.asyncio
async def test_async_extraction():
extractor = AsyncTextExtractor()
text = await extractor.extract("path/to/test/file.txt")
assert isinstance(text, str)
assert len(text) > 0import pytest
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import (
FileTypeNotSupportedError,
InvalidFileError
)
def test_error_handling():
extractor = SyncTextExtractor()
# Test unsupported file type
with pytest.raises(FileTypeNotSupportedError):
extractor.extract(b"dummy", "file.unsupported")
# Test missing file
with pytest.raises(InvalidFileError):
extractor.extract("nonexistent_file.txt")
# Test missing filename with bytes
with pytest.raises(ValueError):
extractor.extract(b"dummy bytes")import psutil
import os
from textxtract import SyncTextExtractor
def test_memory_usage():
process = psutil.Process(os.getpid())
initial_memory = process.memory_info().rss
extractor = SyncTextExtractor()
# Process large file
text = extractor.extract("large_file.pdf")
final_memory = process.memory_info().rss
memory_increase = final_memory - initial_memory
# Assert reasonable memory usage
assert memory_increase < 100 * 1024 * 1024 # Less than 100MBimport asyncio
import pytest
from textxtract import AsyncTextExtractor
@pytest.mark.asyncio
async def test_concurrent_extraction():
extractor = AsyncTextExtractor()
files = ["file1.txt", "file2.pdf", "file3.docx"]
# Process files concurrently
tasks = [extractor.extract(file) for file in files]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Verify all succeeded
for result in results:
assert isinstance(result, str)
assert len(result) > 0[tool:pytest]
minversion = 6.0
addopts = -ra -q --tb=short
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
asyncio_mode = auto
markers =
slow: marks tests as slow (deselect with '-m "not slow"')
integration: marks tests as integration tests# Run only fast tests
pytest -m "not slow"
# Run only integration tests
pytest -m integration
# Run tests with specific keyword
pytest -k "test_sync"
# Run tests and stop on first failure
pytest -x# Maximum verbosity
pytest -vvv
# Show local variables in tracebacks
pytest --tb=long
# Show stdout/stderr
pytest -simport pytest
def test_with_debugger():
extractor = SyncTextExtractor()
# Set breakpoint
pytest.set_trace()
text = extractor.extract("test_file.txt")
assert text# Generate coverage report
pytest --cov=textxtract --cov-report=html
# View coverage in terminal
pytest --cov=textxtract --cov-report=term-missing
# Generate XML coverage for CI
pytest --cov=textxtract --cov-report=xml# Generate JUnit XML for CI systems
pytest --junitxml=test-results.xmlname: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9, 3.10, 3.11, 3.12]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install -e .[all]
pip install pytest pytest-asyncio pytest-cov
- name: Run tests
run: pytest --cov=textxtractBefore submitting changes, ensure:
- All existing tests pass
- New features have corresponding tests
- Error conditions are tested
- Both sync and async methods are tested
- Documentation examples are tested
- Performance regressions are checked
- Memory leaks are verified
Missing dependencies:
pip install textxtract[all] pytest pytest-asyncioImport errors:
# Ensure correct import
from textxtract import SyncTextExtractor # Correct
from text_extractor import SyncTextExtractor # WrongAsync test issues:
# Ensure pytest-asyncio is installed and configured
pytest.mark.asyncio # Required for async testsFile not found errors:
# Use absolute paths in tests
TEST_FILES_DIR = Path(__file__).parent / "files"
file_path = TEST_FILES_DIR / "test_file.txt"For more testing help, see the API Reference or Usage Guide.