Skip to content

CrillyPienaah/CanFinBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

license cc-by-4.0
language
en
fr
pretty_name CanFinBench
tags
finance
regulation
compliance
canada
osfi
llm-evaluation
benchmark
model-risk
banking
fintrac
ifrs9
basel-iii
task_categories
question-answering
text-classification
task_ids
multiple-choice-qa
open-domain-qa
size_categories
n<1K
annotations_creators
expert-generated
language_creators
expert-generated
multilinguality
monolingual
translation
source_datasets
original
paperswithcode_id canfinbench
dataset_info
features splits download_size dataset_size
name dtype
id
string
name dtype
task_type
string
name dtype
domain
string
name dtype
difficulty
string
name dtype
question
string
name sequence
choices
string
name dtype
answer
string
name dtype
explanation
string
name dtype
regulatory_source
string
name dtype
regulatory_section
string
name dtype
language
string
name dtype
version
string
name num_bytes num_examples
train
180000
40
name num_bytes num_examples
test
45000
10
225000
225000
configs
config_name data_files
default
split path
train
data/train.jsonl
split path
test
data/test.jsonl

CanFinBench: Canadian Financial Regulatory LLM Benchmark

License: CC BY 4.0 Version Language Regulation

Dataset Description

CanFinBench is the first public benchmark for evaluating large language models on Canadian financial regulatory knowledge, compliance reasoning, and model-governance judgment. It is designed to test the specific capabilities that Canadian federally regulated financial institutions (FRFIs) require from AI systems deployed under OSFI Guideline E-23 (Model Risk Management, in force May 1, 2027).

Why CanFinBench?

Existing financial LLM benchmarks (FinQA, PIXIU/FinBen, FinEval, CNFinBench) focus on US SEC filings, Chinese regulations, or general numerical reasoning. No public benchmark encodes Canadian regulatory frameworks. Yet by May 2027, every Canadian bank, insurer, and trust company must validate AI models under OSFI E-23 — creating an urgent, unmet need for standardized evaluation.

CanFinBench fills this gap by encoding:

  • OSFI Guideline E-23 — Model risk management, AI governance, lifecycle requirements
  • FINTRAC/PCMLTFA — AML/KYC, suspicious transaction reasoning
  • OSFI B-20 — Mortgage stress test, MQR, LTI limits
  • IFRS 9 ECL — Expected credit loss staging, Canadian implementation
  • Basel III / OSFI CAR — Capital adequacy, Canadian output floor deferral
  • PIPEDA / Quebec Law 25 — Privacy obligations for AI systems
  • CASL — AI-driven marketing compliance

Key Design Principles

  1. Compliance-first, not trivia-first. Models score well on regulatory QA ("What is the MQR?") but fail on compliance reasoning ("Given this drift scenario, classify the inherent vs. residual risk"). CanFinBench over-indexes on the latter.

  2. Three task tiers: MCQ governance reasoning (Task A), scenario-based risk judgment (Task B), and compliance-drift red-teaming (Task C).

  3. Primary source citations. Every item cites the exact guideline clause, section, or statutory provision it tests — enabling auditable, reproducible evaluation.

  4. Bilingual (EN/FR). Canada's officially bilingual context and Quebec's AMF guideline (French-only source text) are represented.

  5. Living benchmark. Items are versioned and refreshed quarterly as OSFI/FINTRAC/AMF guidance evolves — turning regulatory churn into a feature.

Dataset Homepage


Dataset Structure

Data Instances

Each instance is a JSON object with the following fields:

{
  "id": "cfb-e23-001",
  "task_type": "mcq_governance",
  "domain": "osfi_e23",
  "difficulty": "hard",
  "question": "A federally regulated bank is deploying an autonomous LLM for real-time mortgage pricing...",
  "choices": ["A) The size of the underlying asset portfolio.", "B) The model's level of autonomy...", "C) ...", "D) ..."],
  "answer": "B",
  "explanation": "OSFI E-23 explicitly lists 'level of autonomy' as a qualitative risk-rating factor...",
  "regulatory_source": "OSFI Guideline E-23",
  "regulatory_section": "Section 3.2 — Model Risk Rating",
  "language": "en",
  "version": "0.1.0"
}

Data Fields

Field Type Description
id string Unique identifier. Format: cfb-{domain}-{number}
task_type string One of: mcq_governance, scenario_judgment, compliance_drift
domain string Regulatory domain: osfi_e23, fintrac, b20, ifrs9, basel3, pipeda, casl
difficulty string easy, medium, hard, expert
question string The question or scenario prompt
choices list[string] Answer choices for MCQ items (null for open-ended)
answer string Correct answer key (A/B/C/D) or gold-standard response
explanation string Detailed explanation citing the regulatory source
regulatory_source string Primary regulatory document
regulatory_section string Specific section/clause
language string en or fr
version string Dataset version when item was added

Data Splits

Split Items Description
train 40 Development/few-shot examples with full explanations
test 10 Held-out evaluation set (answers withheld in leaderboard)

Note: A private held-out test set is maintained separately for the official leaderboard to prevent contamination.


Dataset Creation

Source Data

All items are grounded in primary regulatory documents:

Source Version URL
OSFI Guideline E-23 September 2025 osfi-bsif.gc.ca
OSFI Guideline B-20 November 2023 osfi-bsif.gc.ca
OSFI CAR Guideline 2026 osfi-bsif.gc.ca
PCMLTFA / FINTRAC 2025 amendments fintrac-canafe.gc.ca
IFRS 9 (OSFI advisory) 2017/2024 osfi-bsif.gc.ca
PIPEDA 2024 priv.gc.ca
Quebec Law 25 September 2023 legisquebec.gouv.qc.ca
CASL 2014 (as amended) fightspam.gc.ca
AMF AI Guideline (draft) July 2025 lautorite.qc.ca

Annotation Process

Items were created and validated by the dataset author (MPS Analytics, Applied Machine Intelligence, Northeastern University) against primary regulatory text. Each item:

  1. Is grounded in a specific, cited clause of the primary regulatory document
  2. Has been cross-checked against at least one secondary source (law firm commentary, OSFI FAQs)
  3. Includes a detailed explanation that can serve as a teaching document

Personal and Sensitive Information

This dataset contains no personal information. All scenarios are synthetic and constructed from public regulatory documents.


Considerations for Using the Data

Social Impact

CanFinBench aims to improve the reliability and safety of AI systems deployed in Canadian financial services — a domain where errors can cause material harm to consumers, financial stability, and regulatory compliance. By establishing a public standard, we hope to:

  • Enable transparent benchmarking of LLMs for regulated financial use cases
  • Support Canadian banks in OSFI E-23 compliance
  • Advance research on compliance reasoning in LLMs

Discussion of Biases

  • Items reflect Canadian regulatory frameworks as of the dataset version date. International frameworks (US, EU, UK) are out of scope for v0.1.
  • Regulatory guidance evolves; items may become outdated as OSFI/FINTRAC/AMF update their guidelines.
  • The current dataset is English-dominant; the French split will be expanded in v0.2.

Other Known Limitations

  • v0.1 covers 50 items — sufficient for development but not for statistically robust benchmarking. Target for v1.0 is 500+ items across all domains.
  • Task C (compliance-drift red-teaming) is the most novel task type and has the fewest items in v0.1; this will be the primary expansion in v0.2.
  • The private held-out test set for the official leaderboard is maintained separately and not released publicly.

Additional Information

Dataset Curators

Christopher Crilly Pienaah MPS Analytics (Applied Machine Intelligence), Northeastern University (2026)

Licensing Information

This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to share and adapt the material for any purpose, provided you give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

@dataset{pienaah2026canfinbench,
  author    = {Pienaah, Christopher Crilly},
  title     = {CanFinBench: Canadian Financial Regulatory LLM Benchmark},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/CrillyPienaah/CanFinBench},
  version   = {0.1.0},
  license   = {CC BY 4.0},
  note      = {First public benchmark for evaluating LLMs on Canadian financial regulatory knowledge. Covers OSFI E-23, FINTRAC/PCMLTFA, B-20, IFRS 9, Basel III, PIPEDA, and CASL.}
}

Contributions

Contributions, corrections, and domain expansions are welcome. Please open an issue or pull request on GitHub.

To contribute items, please follow the item schema above and ensure every item includes:

  • A specific primary regulatory source citation
  • A detailed explanation
  • Expert validation

Version History

Version Date Changes
0.1.0 June 2026 Initial release — 50 items across Task A/B/C, OSFI E-23, FINTRAC, B-20
0.2.0 Q3 2026 (planned) IFRS 9 + Basel III domains; French split; expanded to 200 items
1.0.0 Q4 2026 (planned) Full 500+ items; private leaderboard test set; arXiv paper

About

The first public LLM benchmark for Canadian financial regulatory compliance. Covers OSFI E-23, FINTRAC, B-20, IFRS 9, Basel III, PIPEDA, and CASL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors