| license | cc-by-4.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| language |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| pretty_name | CanFinBench | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tags |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| task_categories |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| task_ids |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| size_categories |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| annotations_creators |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| language_creators |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| multilinguality |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| source_datasets |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| paperswithcode_id | canfinbench | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| dataset_info |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| configs |
|
CanFinBench is the first public benchmark for evaluating large language models on Canadian financial regulatory knowledge, compliance reasoning, and model-governance judgment. It is designed to test the specific capabilities that Canadian federally regulated financial institutions (FRFIs) require from AI systems deployed under OSFI Guideline E-23 (Model Risk Management, in force May 1, 2027).
Existing financial LLM benchmarks (FinQA, PIXIU/FinBen, FinEval, CNFinBench) focus on US SEC filings, Chinese regulations, or general numerical reasoning. No public benchmark encodes Canadian regulatory frameworks. Yet by May 2027, every Canadian bank, insurer, and trust company must validate AI models under OSFI E-23 — creating an urgent, unmet need for standardized evaluation.
CanFinBench fills this gap by encoding:
- OSFI Guideline E-23 — Model risk management, AI governance, lifecycle requirements
- FINTRAC/PCMLTFA — AML/KYC, suspicious transaction reasoning
- OSFI B-20 — Mortgage stress test, MQR, LTI limits
- IFRS 9 ECL — Expected credit loss staging, Canadian implementation
- Basel III / OSFI CAR — Capital adequacy, Canadian output floor deferral
- PIPEDA / Quebec Law 25 — Privacy obligations for AI systems
- CASL — AI-driven marketing compliance
-
Compliance-first, not trivia-first. Models score well on regulatory QA ("What is the MQR?") but fail on compliance reasoning ("Given this drift scenario, classify the inherent vs. residual risk"). CanFinBench over-indexes on the latter.
-
Three task tiers: MCQ governance reasoning (Task A), scenario-based risk judgment (Task B), and compliance-drift red-teaming (Task C).
-
Primary source citations. Every item cites the exact guideline clause, section, or statutory provision it tests — enabling auditable, reproducible evaluation.
-
Bilingual (EN/FR). Canada's officially bilingual context and Quebec's AMF guideline (French-only source text) are represented.
-
Living benchmark. Items are versioned and refreshed quarterly as OSFI/FINTRAC/AMF guidance evolves — turning regulatory churn into a feature.
- GitHub: https://github.com/CrillyPienaah/CanFinBench
- Leaderboard: https://huggingface.co/spaces/CrillyPienaah/CanFinBench-Leaderboard (coming soon)
- Portfolio: https://chris-pienaah-portfolio.vercel.app/projects/canfinbench
Each instance is a JSON object with the following fields:
{
"id": "cfb-e23-001",
"task_type": "mcq_governance",
"domain": "osfi_e23",
"difficulty": "hard",
"question": "A federally regulated bank is deploying an autonomous LLM for real-time mortgage pricing...",
"choices": ["A) The size of the underlying asset portfolio.", "B) The model's level of autonomy...", "C) ...", "D) ..."],
"answer": "B",
"explanation": "OSFI E-23 explicitly lists 'level of autonomy' as a qualitative risk-rating factor...",
"regulatory_source": "OSFI Guideline E-23",
"regulatory_section": "Section 3.2 — Model Risk Rating",
"language": "en",
"version": "0.1.0"
}| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier. Format: cfb-{domain}-{number} |
task_type |
string | One of: mcq_governance, scenario_judgment, compliance_drift |
domain |
string | Regulatory domain: osfi_e23, fintrac, b20, ifrs9, basel3, pipeda, casl |
difficulty |
string | easy, medium, hard, expert |
question |
string | The question or scenario prompt |
choices |
list[string] | Answer choices for MCQ items (null for open-ended) |
answer |
string | Correct answer key (A/B/C/D) or gold-standard response |
explanation |
string | Detailed explanation citing the regulatory source |
regulatory_source |
string | Primary regulatory document |
regulatory_section |
string | Specific section/clause |
language |
string | en or fr |
version |
string | Dataset version when item was added |
| Split | Items | Description |
|---|---|---|
train |
40 | Development/few-shot examples with full explanations |
test |
10 | Held-out evaluation set (answers withheld in leaderboard) |
Note: A private held-out test set is maintained separately for the official leaderboard to prevent contamination.
All items are grounded in primary regulatory documents:
| Source | Version | URL |
|---|---|---|
| OSFI Guideline E-23 | September 2025 | osfi-bsif.gc.ca |
| OSFI Guideline B-20 | November 2023 | osfi-bsif.gc.ca |
| OSFI CAR Guideline | 2026 | osfi-bsif.gc.ca |
| PCMLTFA / FINTRAC | 2025 amendments | fintrac-canafe.gc.ca |
| IFRS 9 (OSFI advisory) | 2017/2024 | osfi-bsif.gc.ca |
| PIPEDA | 2024 | priv.gc.ca |
| Quebec Law 25 | September 2023 | legisquebec.gouv.qc.ca |
| CASL | 2014 (as amended) | fightspam.gc.ca |
| AMF AI Guideline (draft) | July 2025 | lautorite.qc.ca |
Items were created and validated by the dataset author (MPS Analytics, Applied Machine Intelligence, Northeastern University) against primary regulatory text. Each item:
- Is grounded in a specific, cited clause of the primary regulatory document
- Has been cross-checked against at least one secondary source (law firm commentary, OSFI FAQs)
- Includes a detailed explanation that can serve as a teaching document
This dataset contains no personal information. All scenarios are synthetic and constructed from public regulatory documents.
CanFinBench aims to improve the reliability and safety of AI systems deployed in Canadian financial services — a domain where errors can cause material harm to consumers, financial stability, and regulatory compliance. By establishing a public standard, we hope to:
- Enable transparent benchmarking of LLMs for regulated financial use cases
- Support Canadian banks in OSFI E-23 compliance
- Advance research on compliance reasoning in LLMs
- Items reflect Canadian regulatory frameworks as of the dataset version date. International frameworks (US, EU, UK) are out of scope for v0.1.
- Regulatory guidance evolves; items may become outdated as OSFI/FINTRAC/AMF update their guidelines.
- The current dataset is English-dominant; the French split will be expanded in v0.2.
- v0.1 covers 50 items — sufficient for development but not for statistically robust benchmarking. Target for v1.0 is 500+ items across all domains.
- Task C (compliance-drift red-teaming) is the most novel task type and has the fewest items in v0.1; this will be the primary expansion in v0.2.
- The private held-out test set for the official leaderboard is maintained separately and not released publicly.
Christopher Crilly Pienaah MPS Analytics (Applied Machine Intelligence), Northeastern University (2026)
- Portfolio: https://chris-pienaah-portfolio.vercel.app
- GitHub: https://github.com/CrillyPienaah
- LinkedIn: https://linkedin.com/in/christopher-crilly-pienaah
This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to share and adapt the material for any purpose, provided you give appropriate credit, provide a link to the license, and indicate if changes were made.
@dataset{pienaah2026canfinbench,
author = {Pienaah, Christopher Crilly},
title = {CanFinBench: Canadian Financial Regulatory LLM Benchmark},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/CrillyPienaah/CanFinBench},
version = {0.1.0},
license = {CC BY 4.0},
note = {First public benchmark for evaluating LLMs on Canadian financial regulatory knowledge. Covers OSFI E-23, FINTRAC/PCMLTFA, B-20, IFRS 9, Basel III, PIPEDA, and CASL.}
}Contributions, corrections, and domain expansions are welcome. Please open an issue or pull request on GitHub.
To contribute items, please follow the item schema above and ensure every item includes:
- A specific primary regulatory source citation
- A detailed explanation
- Expert validation
| Version | Date | Changes |
|---|---|---|
| 0.1.0 | June 2026 | Initial release — 50 items across Task A/B/C, OSFI E-23, FINTRAC, B-20 |
| 0.2.0 | Q3 2026 (planned) | IFRS 9 + Basel III domains; French split; expanded to 200 items |
| 1.0.0 | Q4 2026 (planned) | Full 500+ items; private leaderboard test set; arXiv paper |