All module files have been implemented following the professional, production-ready template pattern. Every function is:
- ✅ Self-documenting with comprehensive docstrings
- ✅ Registered dynamically via
@register_functiondecorator - ✅ Type-hinted for IDE support
- ✅ Audit-logged for compliance tracking
- ✅ Error-handled with meaningful exceptions
- ✅ Tested-ready with clear examples in docstrings
Fundamental data cleaning operations.
| File | Functions | Purpose |
|---|---|---|
columns.py |
clean_column_headers, make_unique_columns |
Header standardization & deduplication |
types.py |
convert_data_types, clean_numeric_column, clean_boolean_column, clean_date_column |
Type conversion & parsing |
duplicates.py |
find_duplicates, deduplicate_by_priority, remove_duplicates |
Duplicate detection & removal |
missing.py |
coerce_empty_to_nan, fill_missing |
Missing value handling |
outliers.py |
detect_outliers_iqr, remove_outliers_iqr, remove_outliers_zscore, flag_outliers, cap_outliers, winsorize_outliers |
Outlier detection & handling |
text.py |
clean_text_column, standardize_text_values, clean_categorical_column |
Text normalization |
Total: 17 functions
Feature engineering for machine learning.
| File | Functions | Purpose |
|---|---|---|
categorical.py |
limit_cardinality, rare_category_handler, encode_categorical_variables |
Categorical feature engineering |
datetime.py |
extract_date_features, create_period_keys, create_fiscal_calendar_features, lag_features |
Time-series feature engineering |
Total: 7 functions
Domain-specific financial data utilities.
| File | Functions | Purpose |
|---|---|---|
parsing.py |
parse_currency, parse_percentage, clean_accounting_negative |
Financial value parsing |
rules.py |
impute_by_rule, detect_outliers_groupwise, seasonality_aware_outliers, validate_sign_conventions, check_balanced_entries |
Finance-specific validation |
entities.py |
standardize_entity_names, strip_legal_suffixes, normalize_reference_codes |
Entity & reference standardization |
Total: 11 functions
Safe data reading and writing.
| File | Functions | Purpose |
|---|---|---|
readers.py |
read_csv_safely, read_excel_safely, chunked_processing |
Safe file input |
writers.py |
export_parquet, export_validation_report |
Safe file output |
Total: 5 functions
Data integrity and constraint validation.
| File | Functions | Purpose |
|---|---|---|
schema.py |
standardize_schema, validate_required_fields, validate_category_set |
Schema validation |
ranges.py |
validate_data_ranges |
Range constraint validation |
integrity.py |
assert_primary_key, check_referential_integrity, check_time_continuity, check_data_consistency, reconciliation_check |
Data integrity checks |
Total: 9 functions
High-level orchestration workflows.
| File | Functions | Purpose |
|---|---|---|
quick_clean.py |
quick_clean, quick_clean_finance |
Pre-built cleaning pipelines |
Total: 2 functions
Data profiling and change tracking.
| File | Functions | Purpose |
|---|---|---|
profiling.py |
infer_and_report_types, missingness_profile, get_data_summary, memory_profile, profile_report, quick_check, info |
Data profiling & discovery |
delta.py |
snapshot_dataset, compare_snapshots, delta_report |
Change tracking |
Total: 10 functions
Helper functions and infrastructure.
| File | Functions | Purpose |
|---|---|---|
logging.py |
AuditLog, audit_log, get_global_audit_log |
Audit trail tracking |
security.py |
mask_sensitive_fields, anonymize_identifiers |
Data security |
types.py |
optimize_dtypes |
Memory optimization |
Total: 6 functions + classes
| Module | Count |
|---|---|
| Core | 17 |
| Features | 7 |
| Finance | 11 |
| IO | 5 |
| Validation | 9 |
| Pipelines | 2 |
| Reporting | 10 |
| Utilities | 6 |
| TOTAL | 67 |
"""
Module docstring explaining purpose.
"""
from __future__ import annotations
import pandas as pd
from fda_toolkit.utils.logging import audit_log
from fda_toolkit.registry import register_function
@register_function(
name="your_function_name", # Display name
category="Functional Category", # Grouping for discovery
module="submodule.filename", # Module path
)
def your_function(df: pd.DataFrame) -> pd.DataFrame:
"""
Comprehensive docstring with:
- One-line summary
- Extended explanation
- Args with types and defaults
- Returns with type
- Raises with error conditions
- Example usage (doctest compatible)
"""
if not isinstance(df, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame")
# Implementation
result = df.copy()
# Log the operation
audit_log("your_function_name", before=None, after=result)
return resultimport fda_toolkit as ftk
# Read safely
df = ftk.read_csv_safely('data.csv')
# Quick profile
ftk.quick_check(df)
# Clean quickly
df = ftk.quick_clean(df)
# Export safely
ftk.export_parquet(df, 'output.parquet')# Get reference table
funcs = ftk.info()
print(funcs)
# Filter by category
finance_funcs = ftk.info(category='Finance')df = ftk.quick_clean_finance(
df,
primary_key='invoice_id',
date_cols=['invoice_date', 'due_date'],
currency_cols=['amount', 'tax']
)from fda_toolkit.utils.logging import get_global_audit_log
log = get_global_audit_log()
print(log.to_list()) # All operations recorded- Functions self-register via decorator
- No manual
__all__updates needed - New functions appear automatically in
info()
- Every function logs its operation
- Timestamp and details recorded automatically
- Export as JSON for compliance
- Full type hints on all parameters and returns
- IDE autocomplete support
- Mypy-compatible
- Clear error messages
- Specific exception types (TypeError, ValueError, etc.)
- Input validation on all public functions
- Every function has docstring
- Examples in docstring (doctest compatible)
- Cross-module consistency
copyparameter on most functions (default True for safety)- Optional chunked processing for large files
- Memory profiling available
- Returns new DataFrames by default
- Never modifies input unless explicitly requested
- Works with all pandas dtypes
Create tests/test_<module>/test_<file>.py:
import pandas as pd
from fda_toolkit.core.columns import clean_column_headers
def test_clean_column_headers_basic():
df = pd.DataFrame({'Name ': [1], 'Age (years)': [2]})
out = clean_column_headers(df)
assert out.columns.tolist() == ['name', 'age_years']
def test_clean_column_headers_duplicates():
df = pd.DataFrame({'A': [1], 'A': [2]}) # Duplicate cols
out = clean_column_headers(df)
assert 'a_1' in out.columns
if __name__ == '__main__':
test_clean_column_headers_basic()
test_clean_column_headers_duplicates()
print("✅ All tests pass!")-
Run tests:
pytest tests/
-
Generate docs:
pdoc src/fda_toolkit -o docs
-
Build package:
pip install build python -m build
-
Extend with your domain logic:
- Add custom validators to
validation/business_rules.py - Add domain-specific functions following the template
- Use
@register_functionto auto-discover them
- Add custom validators to
fda_toolkit/
├── __init__.py # Public API (pandas-like)
├── registry.py # @register_function decorator
├── core/
│ ├── columns.py # ✅ 2 functions
│ ├── types.py # ✅ 4 functions
│ ├── duplicates.py # ✅ 3 functions
│ ├── missing.py # ✅ 2 functions
│ ├── outliers.py # ✅ 6 functions
│ └── text.py # ✅ 3 functions
├── features/
│ ├── categorical.py # ✅ 3 functions
│ └── datetime.py # ✅ 4 functions
├── finance/
│ ├── parsing.py # ✅ 3 functions
│ ├── rules.py # ✅ 5 functions
│ └── entities.py # ✅ 3 functions
├── io/
│ ├── readers.py # ✅ 3 functions
│ └── writers.py # ✅ 2 functions
├── validation/
│ ├── schema.py # ✅ 3 functions
│ ├── ranges.py # ✅ 1 function
│ └── integrity.py # ✅ 5 functions
├── pipelines/
│ └── quick_clean.py # ✅ 2 functions
├── reporting/
│ ├── profiling.py # ✅ 7 functions
│ └── delta.py # ✅ 3 functions
└── utils/
├── logging.py # ✅ Audit infrastructure
├── security.py # ✅ 2 functions
└── types.py # ✅ 1 function
✅ 67 production-ready functions
✅ 8 cohesive modules
✅ Professional pandas-like API
✅ Dynamic function discovery
✅ Comprehensive audit trail
✅ Type hints throughout
✅ Clear error handling
✅ Docstring examples
✅ Memory-aware options
✅ Enterprise-grade structure
Your FDA Toolkit is ready for real-world data challenges! 🚀