Open-source, offline-first Python tool to parse credit card PDF statements from Indian banks into structured data.
Install · Quick Start · Supported Banks · Dashboard · Docs
Indian bank credit card statements are password-protected PDFs with inconsistent formats — making expense tracking painful. StmtForge solves this:
- Parse PDF statements from Indian banks (Currently implemented: HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First)
- 100% offline — no data leaves your machine, no cloud APIs, no telemetry
- Hybrid extraction — deterministic regex parsers → table extraction → OCR → local LLM (Ollama)
- One command —
pip install stmtforgeand start analyzing your credit card spend
Built for anyone in India who wants to track credit card expenses without trusting third-party apps with their financial data.
StmtForge includes a Streamlit analytics dashboard with interactive charts, filters, and CSV export.
Analytics: total spend, monthly trends, category breakdown, top merchants, bank & card comparison, daily heatmap, drill-downs
The Card Optimizer page provides advanced credit card selection intelligence:
- Scope selector — filter by All Market Cards, My Cards Only, or a custom selection
- Card recommendations — rank all cards by net annual value against your actual spend profile
- Spend Profile with per-transaction advice — for every debit transaction, see which of your cards and which market card earns the most rewards
- N-card portfolio — use the slider to pick the number of cards (1–8); a greedy algorithm finds the combination that maximises combined net annual rewards while minimising overlap
- Simulator — adjust hypothetical spend amounts category-by-category and watch rankings update live
| Feature | Description |
|---|---|
| 9 bank-specific parsers | Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First |
| PDF unlock & parse | Auto-decrypts password-protected statements (DOB, PAN, custom patterns) |
| Hybrid extraction pipeline | Deterministic → table → OCR → local LLM fallback chain |
| Local LLM via Ollama | Qwen / Mistral / Llama3 for unstructured statement parsing |
| Gmail auto-fetch | Read-only OAuth2 — downloads statement PDFs from Gmail automatically |
| User card registry | Auto-detects your cards from transaction history and stores them in user_cards DB table |
| Per-transaction card advisor | For each debit transaction, shows which of your cards and which market card earns the most |
| N-card portfolio optimizer | Greedy algorithm finds the optimal set of N cards that maximise combined annual rewards |
| Scope selector | Card Optimizer supports All Market Cards / My Cards Only / Custom Selection scopes |
| Multi-card tracking | Track spend across multiple cards and banks |
| Auto-categorization | Rule-based merchant classification (Shopping, Food, Travel, EMI, etc.) |
| Transaction deduplication | Hash-based dedup with incremental processing |
| Streamlit dashboard | Interactive Plotly charts, sidebar filters, CSV export |
| Privacy-first design | PII redacted from logs, HMAC pseudonymization, DPDP-aligned |
| CLI interface | stmtforge run, stmtforge dashboard, stmtforge init |
StmtForge now uses a separate, public rules repository for card benefits data:
- Repository: https://github.com/madhav921/stmtforge-cards-db
- Purpose: transparent, versioned, community-contributable card rules
- Runtime model:
stmtforge initclones the rules repo locally intodata/cards-db, then seedsdata/cards/for offline evaluation
The core package is the only PyPI install target. Rules updates happen through the local git clone.
See architecture blueprint: docs/PRODUCTION_ARCHITECTURE.md
pip install stmtforgeOptional extras:
pip install "stmtforge[gmail]" # Gmail fetch support
pip install "stmtforge[ocr]" # OCR fallback support
pip install "stmtforge[all]" # Gmail + OCR extrasgit clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e ".[dev]" # developer tools only
pip install -e ".[dev,all]" # developer tools + Gmail + OCR extras| Requirement | Purpose |
|---|---|
| Python 3.11+ | Runtime |
| Ollama (optional) | Local LLM for unstructured PDF parsing |
| Google Cloud project (optional) | Gmail API — not needed for manual PDF import |
| Tesseract OCR binary (optional) | Required by OCR fallback when stmtforge[ocr] is installed |
| qpdf (optional) | Fallback PDF decryption |
# 1. Set up project
mkdir ~/my-statements && cd ~/my-statements
stmtforge init # creates config.yaml, .env.example, data/, and clones the cards repo
# 2. Configure PDF passwords
cp .env.example .env # then edit .env with your passwords
# 3. (Optional) Set up local LLM
ollama pull qwen2.5:3b
# 4. Run the pipeline
stmtforge run --local # parse local PDFs
stmtforge run --full # Gmail fetch + parse
stmtforge run --folder path/to/pdfs # specific folder
# 5. View insights
stmtforge dashboardTo refresh the local rules snapshot later:
git -C data/cards-db pull
stmtforge initManual PDF import: Drop PDFs into data/raw_pdfs/<bank>/ and run stmtforge run --local. No Gmail setup needed.
Where do my PDFs go?
Place your password-protected statement PDFs insidedata/raw_pdfs/<bank>/— for example:
data/raw_pdfs/hdfc/5268XXXXXXXXXX38_19-11-2025.pdfdata/raw_pdfs/sbi/7411XXXXXXXXXXXX_15062024.pdfdata/raw_pdfs/idfc/601000XXXXXXXX_24112025_110414520.pdfThe
<bank>folder name must match one of the supported bank keys:hdfc,sbi,icici,axis,kotak,yes,csb,federal,idfc. StmtForge will auto-unlock and parse all PDFs found recursively. You can also sub-organise by month (data/raw_pdfs/hdfc/2025_11/) — both flat and nested layouts are supported.
| Bank | Parser | Card Detection |
|---|---|---|
| HDFC Bank | hdfc_parser |
Swiggy, Tata Neu, Millennia, etc. |
| ICICI Bank | icici_parser |
Amazon Pay, Coral, Platinum, etc. |
| SBI Card | sbi_parser |
Cashback, Elite, SimplyCLICK, etc. |
| Axis Bank | axis_parser |
Neo, Flipkart, Ace, etc. |
| Kotak Mahindra | kotak_parser |
811, League Platinum, etc. |
| Yes Bank | yes_parser |
Marquee, Prosperity, etc. |
| CSB Bank | csb_parser |
Edge, etc. |
| Federal Bank | federal_parser |
Signet, Scapia, etc. |
| IDFC First Bank | idfc_first_parser |
First Select, Classic, WOW, etc. |
| Any other bank | generic_parser + LLM |
Auto-detected |
Statement formats change over time. Open an issue if a parser produces incorrect results.
PDF → Unlock → Bank Parser → Table Extraction → OCR → LLM → Validate → SQLite → Dashboard
- PDF Unlock — Tries password combos (DOB, PAN, custom) via pikepdf
- Bank Parser — Bank-specific regex parser extracts transactions directly
- Fallback Chain — Table extraction (pdfplumber) → Layout text → OCR (Tesseract) → Local LLM (Ollama)
- Validation — Date normalization, amount bounds, dedup, confidence scoring
- Categorization — Rule-based merchant → category mapping
- Storage — SQLite with 6 tables:
transactions,statements_metadata,user_cards,gmail_messages,extraction_log,pipeline_state - User Card Registry —
user_cardstable auto-populated from transaction history; maps card name fragments to 29 YAML card definitions - Card Optimizer —
CardAdvisorandbest_n_card_comboconsume theuser_cardsregistry for scope-aware recommendations
StmtForge is built around a local-first, zero-upload architecture.
| Processing | 100% local — no cloud, no external APIs |
| Storage | Local SQLite + local files only |
| Telemetry | None — no analytics, no phone-home |
| Log privacy | PII auto-redacted (emails, phones, PAN, card numbers) |
| PDF passwords | .env → memory only; never logged or stored in DB |
| Gmail | Optional, read-only OAuth2; revoke anytime at Google Permissions |
See SECURITY.md for vulnerability reporting and full security policy.
stmtforge init creates a config.yaml with these sections:
| Section | Purpose |
|---|---|
gmail |
Sender domains, search keywords, attachment filters |
credit_cards |
Your banks and card names |
pdf_passwords |
Password patterns (from .env) |
parsers |
Email/filename → bank mapping, card identifiers |
categories |
Merchant → category rules |
database |
SQLite path |
llm |
Ollama model, URL, temperature |
from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount
class MyBankParser(BaseParser):
BANK_NAME = "mybank"
def parse(self, pdf_path):
records = [...] # Extract transactions
return self._get_standard_df(records)Register in src/stmtforge/parsers/registry.py and add mappings in config_template.yaml.
See CONTRIBUTING.md for details.
stmt-forge/
├── src/stmtforge/ # Package source
│ ├── cli.py # CLI entry point
│ ├── run_pipeline.py # Pipeline orchestrator
│ ├── hybrid_pipeline.py # Hybrid extraction engine
│ ├── parsers/ # 9 bank parsers + generic + categorizer
│ ├── dashboard/ # Streamlit analytics app
│ │ ├── app.py # Main dashboard (Analytics, Statements, Parse PDF)
│ │ └── pages/
│ │ └── 2_Card_Optimizer.py # Card Optimizer with scope selector, N-card portfolio
│ ├── pdf_processing/ # PDF unlock & text extraction
│ ├── llm/ # Ollama client & prompts
│ ├── gmail/ # Gmail OAuth & fetcher
│ ├── database/ # SQLite layer (user_cards table, sync methods)
│ ├── suggestor/ # Card recommendation engine
│ │ ├── card_db.py # 29-card YAML database loader
│ │ ├── card_advisor.py # Per-transaction best-card advisor (CardAdvisor)
│ │ ├── optimizer.py # rank_cards + best_n_card_combo (greedy N-card portfolio)
│ │ ├── spend_vector.py # SpendVector builder from DB transactions
│ │ └── report.py # HTML report exporter
│ ├── validator/ # Transaction validation
│ └── utils/ # Config, logging, privacy, hashing
├── data/
│ ├── cards/ # 29 YAML card definition files
│ └── ccanalyser.db # SQLite DB (transactions, user_cards, metadata)
├── tests/ # Test suite (80 tests)
├── pyproject.toml # Build config
└── README.md
Bug reports, new bank parsers, and code fixes welcome. See CONTRIBUTING.md.