Complete API reference for the textxtract package.
The synchronous text extractor for blocking operations.
from textxtract import SyncTextExtractorSyncTextExtractor(config: Optional[ExtractorConfig] = None)Parameters:
config(optional): Configuration object for customizing extraction behavior
extract(
source: Union[Path, str, bytes],
filename: Optional[str] = None,
config: Optional[dict] = None
) -> strExtract text synchronously from file path or bytes.
Parameters:
source: File path (Path/str) or file bytesfilename: Required if source is bytes, optional for file pathsconfig: Optional configuration overrides
Returns:
str: Extracted text
Raises:
ValueError: If filename is missing when source is bytesFileTypeNotSupportedError: If file extension is not supportedInvalidFileError: If file is invalid or corruptedExtractionError: If extraction fails
Examples:
extractor = SyncTextExtractor()
# From file path
text = extractor.extract("document.pdf")
text = extractor.extract(Path("document.pdf"))
# From bytes (filename required)
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = extractor.extract(file_bytes, "document.pdf")
# With custom config
config = {"encoding": "utf-8", "max_file_size": 50*1024*1024}
text = extractor.extract("document.pdf", config=config)with SyncTextExtractor() as extractor:
text = extractor.extract("document.pdf")The asynchronous text extractor for non-blocking operations.
from textxtract import AsyncTextExtractorAsyncTextExtractor(config: Optional[ExtractorConfig] = None)Parameters:
config(optional): Configuration object for customizing extraction behavior
async extract(
source: Union[Path, str, bytes],
filename: Optional[str] = None,
config: Optional[dict] = None
) -> strExtract text asynchronously from file path or bytes using thread pool.
Parameters:
source: File path (Path/str) or file bytesfilename: Required if source is bytes, optional for file pathsconfig: Optional configuration overrides
Returns:
str: Extracted text
Raises:
ValueError: If filename is missing when source is bytesFileTypeNotSupportedError: If file extension is not supportedInvalidFileError: If file is invalid or corruptedExtractionError: If extraction fails
Examples:
import asyncio
async def extract_text():
extractor = AsyncTextExtractor()
# From file path
text = await extractor.extract("document.pdf")
# From bytes
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = await extractor.extract(file_bytes, "document.pdf")
return text
text = asyncio.run(extract_text())async with AsyncTextExtractor() as extractor:
text = await extractor.extract("document.pdf")Configuration class for customizing extraction behavior.
from textxtract.core import ExtractorConfigExtractorConfig(
encoding: str = "utf-8",
max_file_size: int = 100 * 1024 * 1024, # 100MB
logging_level: str = "INFO"
)Parameters:
encoding: Default text encodingmax_file_size: Maximum file size in byteslogging_level: Logging verbosity level
Example:
config = ExtractorConfig(
encoding="utf-8",
max_file_size=50 * 1024 * 1024, # 50MB
logging_level="DEBUG"
)
extractor = SyncTextExtractor(config)All exceptions are in the textxtract.core.exceptions module.
Base exception for all extraction-related errors.
from textxtract.core.exceptions import ExtractionErrorRaised when the file extension is not supported.
from textxtract.core.exceptions import FileTypeNotSupportedErrorRaised when the file is invalid, corrupted, or not found.
from textxtract.core.exceptions import InvalidFileErrorRaised when extraction exceeds the allowed timeout.
from textxtract.core.exceptions import ExtractionTimeoutErrorExample Error Handling:
from textxtract import SyncTextExtractor
from textxtract.core.exceptions import (
ExtractionError,
FileTypeNotSupportedError,
InvalidFileError
)
extractor = SyncTextExtractor()
try:
text = extractor.extract("document.pdf")
except FileTypeNotSupportedError as e:
print(f"Unsupported file type: {e}")
except InvalidFileError as e:
print(f"Invalid file: {e}")
except ExtractionError as e:
print(f"Extraction failed: {e}")Data class containing file information.
from textxtract.core.utils import FileInfofilename: str - Name of the filesize_bytes: int - File size in bytessize_mb: float - File size in megabytessize_kb: float - File size in kilobytes (property)extension: str - File extensionis_temp: bool - Whether file is temporary
| Extension | Handler Class | Optional Dependency |
|---|---|---|
.txt, .text |
TXTHandler |
None |
.pdf |
PDFHandler |
pymupdf |
.docx |
DOCXHandler |
python-docx |
.doc |
DOCHandler |
antiword |
.md |
MDHandler |
markdown, beautifulsoup4 |
.rtf |
RTFHandler |
striprtf |
.html, .htm |
HTMLHandler |
beautifulsoup4, lxml |
.csv |
CSVHandler |
None |
.json |
JSONHandler |
None |
.xml |
XMLHandler |
lxml |
.zip |
ZIPHandler |
None |
The package uses Python's standard logging module with the following loggers:
textxtract.sync- Synchronous extractor logstextxtract.aio- Asynchronous extractor logstextxtract.utils- Utility function logs
Configure logging:
import logging
# Set debug level for detailed logs
logging.basicConfig(level=logging.DEBUG)
# Or configure specific logger
logger = logging.getLogger("textxtract")
logger.setLevel(logging.INFO)The package is fully typed. Import types for better IDE support:
from typing import Union, Optional
from pathlib import Path
from textxtract import SyncTextExtractor, AsyncTextExtractor
from textxtract.core import ExtractorConfig
from textxtract.core.exceptions import ExtractionError