CanFinBench: Canadian Financial Regulatory LLM Benchmark

license

cc-by-4.0

language

en

fr

pretty_name

CanFinBench

CanFinBench: Canadian Financial Regulatory LLM Benchmark

Dataset Description

CanFinBench is the first public benchmark for evaluating large language models on Canadian financial regulatory knowledge, compliance reasoning, and model-governance judgment. It is designed to test the specific capabilities that Canadian federally regulated financial institutions (FRFIs) require from AI systems deployed under OSFI Guideline E-23 (Model Risk Management, in force May 1, 2027).

Why CanFinBench?

Existing financial LLM benchmarks (FinQA, PIXIU/FinBen, FinEval, CNFinBench) focus on US SEC filings, Chinese regulations, or general numerical reasoning. No public benchmark encodes Canadian regulatory frameworks. Yet by May 2027, every Canadian bank, insurer, and trust company must validate AI models under OSFI E-23 — creating an urgent, unmet need for standardized evaluation.

CanFinBench fills this gap by encoding:

OSFI Guideline E-23 — Model risk management, AI governance, lifecycle requirements
FINTRAC/PCMLTFA — AML/KYC, suspicious transaction reasoning
OSFI B-20 — Mortgage stress test, MQR, LTI limits
IFRS 9 ECL — Expected credit loss staging, Canadian implementation
Basel III / OSFI CAR — Capital adequacy, Canadian output floor deferral
PIPEDA / Quebec Law 25 — Privacy obligations for AI systems
CASL — AI-driven marketing compliance

Key Design Principles

Compliance-first, not trivia-first. Models score well on regulatory QA ("What is the MQR?") but fail on compliance reasoning ("Given this drift scenario, classify the inherent vs. residual risk"). CanFinBench over-indexes on the latter.
Three task tiers: MCQ governance reasoning (Task A), scenario-based risk judgment (Task B), and compliance-drift red-teaming (Task C).
Primary source citations. Every item cites the exact guideline clause, section, or statutory provision it tests — enabling auditable, reproducible evaluation.
Bilingual (EN/FR). Canada's officially bilingual context and Quebec's AMF guideline (French-only source text) are represented.
Living benchmark. Items are versioned and refreshed quarterly as OSFI/FINTRAC/AMF guidance evolves — turning regulatory churn into a feature.

Dataset Homepage

GitHub: https://github.com/CrillyPienaah/CanFinBench
Leaderboard: https://huggingface.co/spaces/CrillyPienaah/CanFinBench-Leaderboard (coming soon)
Portfolio: https://chris-pienaah-portfolio.vercel.app/projects/canfinbench

Dataset Structure

Data Instances

Each instance is a JSON object with the following fields:

{
  "id": "cfb-e23-001",
  "task_type": "mcq_governance",
  "domain": "osfi_e23",
  "difficulty": "hard",
  "question": "A federally regulated bank is deploying an autonomous LLM for real-time mortgage pricing...",
  "choices": ["A) The size of the underlying asset portfolio.", "B) The model's level of autonomy...", "C) ...", "D) ..."],
  "answer": "B",
  "explanation": "OSFI E-23 explicitly lists 'level of autonomy' as a qualitative risk-rating factor...",
  "regulatory_source": "OSFI Guideline E-23",
  "regulatory_section": "Section 3.2 — Model Risk Rating",
  "language": "en",
  "version": "0.1.0"
}

Data Fields

Field	Type	Description
`id`	string	Unique identifier. Format: `cfb-{domain}-{number}`
`task_type`	string	One of: `mcq_governance`, `scenario_judgment`, `compliance_drift`
`domain`	string	Regulatory domain: `osfi_e23`, `fintrac`, `b20`, `ifrs9`, `basel3`, `pipeda`, `casl`
`difficulty`	string	`easy`, `medium`, `hard`, `expert`
`question`	string	The question or scenario prompt
`choices`	list[string]	Answer choices for MCQ items (null for open-ended)
`answer`	string	Correct answer key (A/B/C/D) or gold-standard response
`explanation`	string	Detailed explanation citing the regulatory source
`regulatory_source`	string	Primary regulatory document
`regulatory_section`	string	Specific section/clause
`language`	string	`en` or `fr`
`version`	string	Dataset version when item was added

Data Splits

Split	Items	Description
`train`	40	Development/few-shot examples with full explanations
`test`	10	Held-out evaluation set (answers withheld in leaderboard)

Note: A private held-out test set is maintained separately for the official leaderboard to prevent contamination.

Dataset Creation

Source Data

All items are grounded in primary regulatory documents:

Source	Version	URL
OSFI Guideline E-23	September 2025	osfi-bsif.gc.ca
OSFI Guideline B-20	November 2023	osfi-bsif.gc.ca
OSFI CAR Guideline	2026	osfi-bsif.gc.ca
PCMLTFA / FINTRAC	2025 amendments	fintrac-canafe.gc.ca
IFRS 9 (OSFI advisory)	2017/2024	osfi-bsif.gc.ca
PIPEDA	2024	priv.gc.ca
Quebec Law 25	September 2023	legisquebec.gouv.qc.ca
CASL	2014 (as amended)	fightspam.gc.ca
AMF AI Guideline (draft)	July 2025	lautorite.qc.ca

Annotation Process

Items were created and validated by the dataset author (MPS Analytics, Applied Machine Intelligence, Northeastern University) against primary regulatory text. Each item:

Is grounded in a specific, cited clause of the primary regulatory document
Has been cross-checked against at least one secondary source (law firm commentary, OSFI FAQs)
Includes a detailed explanation that can serve as a teaching document

Personal and Sensitive Information

This dataset contains no personal information. All scenarios are synthetic and constructed from public regulatory documents.

Considerations for Using the Data

Social Impact

CanFinBench aims to improve the reliability and safety of AI systems deployed in Canadian financial services — a domain where errors can cause material harm to consumers, financial stability, and regulatory compliance. By establishing a public standard, we hope to:

Enable transparent benchmarking of LLMs for regulated financial use cases
Support Canadian banks in OSFI E-23 compliance
Advance research on compliance reasoning in LLMs

Discussion of Biases

Items reflect Canadian regulatory frameworks as of the dataset version date. International frameworks (US, EU, UK) are out of scope for v0.1.
Regulatory guidance evolves; items may become outdated as OSFI/FINTRAC/AMF update their guidelines.
The current dataset is English-dominant; the French split will be expanded in v0.2.

Other Known Limitations

v0.1 covers 50 items — sufficient for development but not for statistically robust benchmarking. Target for v1.0 is 500+ items across all domains.
Task C (compliance-drift red-teaming) is the most novel task type and has the fewest items in v0.1; this will be the primary expansion in v0.2.
The private held-out test set for the official leaderboard is maintained separately and not released publicly.

Additional Information

Dataset Curators

Christopher Crilly Pienaah MPS Analytics (Applied Machine Intelligence), Northeastern University (2026)

Portfolio: https://chris-pienaah-portfolio.vercel.app
GitHub: https://github.com/CrillyPienaah
LinkedIn: https://linkedin.com/in/christopher-crilly-pienaah

Licensing Information

This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to share and adapt the material for any purpose, provided you give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

@dataset{pienaah2026canfinbench,
  author    = {Pienaah, Christopher Crilly},
  title     = {CanFinBench: Canadian Financial Regulatory LLM Benchmark},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/CrillyPienaah/CanFinBench},
  version   = {0.1.0},
  license   = {CC BY 4.0},
  note      = {First public benchmark for evaluating LLMs on Canadian financial regulatory knowledge. Covers OSFI E-23, FINTRAC/PCMLTFA, B-20, IFRS 9, Basel III, PIPEDA, and CASL.}
}

Contributions

Contributions, corrections, and domain expansions are welcome. Please open an issue or pull request on GitHub.

To contribute items, please follow the item schema above and ensure every item includes:

A specific primary regulatory source citation
A detailed explanation
Expert validation

Version History

Version	Date	Changes
0.1.0	June 2026	Initial release — 50 items across Task A/B/C, OSFI E-23, FINTRAC, B-20
0.2.0	Q3 2026 (planned)	IFRS 9 + Basel III domains; French split; expanded to 200 items
1.0.0	Q4 2026 (planned)	Full 500+ items; private leaderboard test set; arXiv paper

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
eval.yaml		eval.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CanFinBench: Canadian Financial Regulatory LLM Benchmark

Dataset Description

Why CanFinBench?

Key Design Principles

Dataset Homepage

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Source Data

Annotation Process

Personal and Sensitive Information

Considerations for Using the Data

Social Impact

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Version History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CanFinBench: Canadian Financial Regulatory LLM Benchmark

Dataset Description

Why CanFinBench?

Key Design Principles

Dataset Homepage

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Source Data

Annotation Process

Personal and Sensitive Information

Considerations for Using the Data

Social Impact

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Version History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages