Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
491bd0d
feat: add example notebooks for common research workflows (issue #9)
novis10813 Feb 12, 2026
5abe8f7
fix: auto-prepare data in FactorAnalyzer methods
novis10813 Feb 12, 2026
2fb6c6d
fix: correct notebook column references and API usage
novis10813 Feb 13, 2026
29a5c78
docs: document safe_* operations semantics and add regression tests
novis10813 Feb 13, 2026
4f44bcd
fix: align daily processing boundaries to UTC midnight
novis10813 Feb 13, 2026
5a4f7b7
Merge pull request #31 from novis10813/feat/safe-operations-docs-tests
lilinoct18-coder Feb 13, 2026
c309ef5
Merge pull request #30 from novis10813/feat/example-notebooks
lilinoct18-coder Feb 13, 2026
ef7ddc3
refactor(data): extract date range calculation to shared utility func…
novis10813 Feb 13, 2026
856878f
Merge pull request #32 from novis10813/fix/daily-boundary-alignment
lilinoct18-coder Feb 13, 2026
35756a2
restore(universe): re-apply checklist pipeline after temporary revert
novis10813 Feb 13, 2026
d258c88
Merge pull request #33 from novis10813/restore/universe-checklist
lilinoct18-coder Feb 13, 2026
5859162
docs(universe): add user guide and integrate mask workflow
novis10813 Feb 13, 2026
d5faaae
docs(examples): add universe checklist workflow notebook
novis10813 Feb 14, 2026
d385f98
Merge pull request #35 from novis10813/feat/universe-examples
lilinoct18-coder Feb 14, 2026
f120634
Merge pull request #34 from novis10813/feat/universe-docs
lilinoct18-coder Feb 14, 2026
4538204
test(data): migrate cache tests to storage backend API
novis10813 Feb 14, 2026
22fe3d2
fix(data): align date range to UTC midnight
novis10813 Feb 14, 2026
a02281e
fix(universe): use run_async wrappers and warn on full tag fetch
novis10813 Feb 14, 2026
a8b82d3
fix(analyzer): re-prepare data when requested periods are missing
novis10813 Feb 14, 2026
f14a061
refactor(backtest): remove redundant mask reapplication
novis10813 Feb 14, 2026
5b5b2d2
fix(data): make explicit end_date inclusive by day
novis10813 Feb 14, 2026
df90c68
fix(universe): exclude missing listing dates in MinListingAge
novis10813 Feb 14, 2026
e37420b
fix(universe): require symbols and handle CoinGecko symbol collisions
novis10813 Feb 14, 2026
a255541
test(factors): normalize minute literal style in safe ops helper
novis10813 Feb 14, 2026
65641be
Merge pull request #36 from novis10813/chore/recover-tests-data-stash
lilinoct18-coder Feb 16, 2026
d759e39
fix: put coingecko url into constants.py
novis10813 Feb 16, 2026
28a2244
fix(test): correct mock target and missing imports in universe tests
novis10813 Feb 16, 2026
696112d
Merge pull request #37 from novis10813/fix/gemini-review-fixes
lilinoct18-coder Feb 17, 2026
1ed3410
chore(release): bump version to 0.4.0
novis10813 Feb 17, 2026
b9aace4
chore(release): add CHANGELOG for v0.4.0
novis10813 Feb 17, 2026
8b1c00a
docs(release): add 0.4.0 release notes
novis10813 Feb 17, 2026
c0e1990
Merge remote-tracking branch 'origin/main' into release/0.4.0
novis10813 Feb 17, 2026
a25e6b4
chore(lock): regenerate uv.lock for Python 3.13 (fix s3transfer parse)
novis10813 Feb 17, 2026
0e7a830
fix: restore dev code and fix pandas 3.0 timestamp compatibility
novis10813 Feb 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ Factorium 是一個量化因子分析與回測框架,主要模組:

## `safe_` 函數模式

> **完整文檔**: 請參閱 [`docs/dev/safe-operations.md`](docs/dev/safe-operations.md)
> **回歸測試**: [`tests/factors/test_safe_operations.py`](tests/factors/test_safe_operations.py)

在此專案中,以 `safe_` 開頭的函數(例如:`safe_mean`, `safe_sum`, `safe_div`)旨在確保計算的「嚴格性」與「安全性」,這對於金融因子的計算尤為重要。

### 共同特點
Expand All @@ -28,9 +31,9 @@ Factorium 是一個量化因子分析與回測框架,主要模組:
* **資料充裕度檢查**: 如 `safe_corr` 會在計算前確認是否有足夠的有效數據點(例如:多於 2 個)。

3. **safe_div 一致性規範**:
* **閾值**: 使用 `POSITION_EPSILON`(`1e-10`)判斷分母接近 0 的情況。
* **缺失值回傳**: Pandas 路徑回傳 `np.nan`,Polars 路徑回傳 `null`(建議使用 `pl.lit(None)`)。
* **語義**: 分母為 0 或 `abs(denominator) <= POSITION_EPSILON` 時視為缺失,避免產生 `inf`。
* **閾值**: 使用 `EPSILON`(`1e-10`,定義於 `factorium.constants`)判斷分母接近 0 的情況。`POSITION_EPSILON` 是向後相容的別名
* **缺失值回傳**: Pandas/NumPy 路徑回傳 `np.nan`,Polars 路徑回傳 `null`(使用 `pl.lit(None)`)。
* **語義**: 分母為 0 或 `abs(denominator) <= EPSILON` 時視為缺失,避免產生 `inf`。

### 範例

Expand Down Expand Up @@ -71,7 +74,8 @@ docs/
│ └── backtest.md
└── dev/ # 開發者文檔
├── testing.md
└── regression-operators.md
├── regression-operators.md
└── safe-operations.md
```

### 本地預覽
Expand Down
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Changelog

All notable changes to this project will be documented in this file.

## [0.4.0] - 2026-02-17
### Added
- VectorizedBacktester: standardized `signal -> exposure -> weight` pipeline and portfolio schemes (market‑neutral, long‑only, top‑N patterns). (see #5)
- Factor analysis: multi‑horizon IC decay and flexible targets for `FactorAnalyzer` / reports. (see #4)
- Factor correlation utilities and clustering analysis (correlation matrix + visualizations). (see #6)
- Factor orthogonalization utilities (`cs_neutralize` / residual‑based orthogonalization). (see #7)
- Additional backtest metrics: Sortino ratio, Calmar ratio, win rate and improved metrics handling.
- New example notebooks demonstrating multi‑factor workflows and orthogonalization (`examples/04_multi_factor_combination.ipynb`).
- Extensive unit and integration tests for backtest, factor ops, and Polars paths.

### Changed
- `Backtester` is now an alias for `VectorizedBacktester` (Polars‑based implementation).
- Internal refactors and Polars migration improvements for TS/CS operators and analyzer.

### Fixed
- Various bug fixes and test stabilizations across data loading and backtest path.

### Notes
- Backward compatibility: No breaking API changes expected for typical user workflows. See `docs/dev/migration-guide.md` for migration notes if you rely on internal/edge APIs.

---

(Full changelog & commit list available in the release PR.)
250 changes: 250 additions & 0 deletions docs/dev/safe-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# Safe Operations Semantics

This document formalizes the **safe operation semantics** used throughout Factorium for
numerical safety in factor calculations and backtesting. These conventions ensure
deterministic, reproducible results and prevent silent errors from corrupting financial signals.

---

## Core Principles

1. **Strict NaN Propagation** — Any `NaN` (or `null` in Polars) in the input window
causes the entire window result to be `NaN`/`null`.
2. **Division Safety** — Division by values within `EPSILON` of zero returns `NaN`/`null`
instead of `inf`.
3. **Window Completeness** — Rolling operations require a full window; partial windows
produce `NaN`/`null`.

These rules are intentionally **stricter** than the defaults in Pandas / Polars / NumPy,
which typically skip `NaN` values. In quantitative finance, silently ignoring missing data
can produce misleading signals, so Factorium opts for explicit failure.

---

## Constants

All numeric thresholds are defined in `factorium.constants`:

| Constant | Value | Purpose |
|----------|-------|---------|
| `EPSILON` | `1e-10` | Near-zero threshold for safe division and degenerate-case detection |
| `POSITION_EPSILON` | `EPSILON` (alias) | Legacy alias used in `backtest.utils`; identical to `EPSILON` |
| `MIN_PERIODS_PER_YEAR` | `1.0` | Lower bound for `periods_per_year` in metrics |
| `MAX_PERIODS_PER_YEAR` | `525960.0` | Upper bound (minute-level data) |

```python
# factorium/constants.py
EPSILON = 1e-10
```

---

## Safe Division

The safe division pattern appears in three contexts in the codebase:

### 1. `backtest.utils.safe_divide(a, b, default=np.nan)`

A general-purpose safe division function supporting scalar, NumPy array, and Pandas Series inputs.

**Rules:**
- If `b` is `NaN` → return `default` (default: `np.nan`)
- If `|b| <= EPSILON` → return `default`
- Otherwise → return `a / b`

```python
from factorium.backtest.utils import safe_divide

safe_divide(1.0, 0.0) # → np.nan
safe_divide(1.0, 1e-15) # → np.nan (within EPSILON)
safe_divide(1.0, 2.0) # → 0.5
safe_divide(1.0, np.nan) # → np.nan
```

**Supported input types:**

| Type of `b` | Near-zero detection | NaN detection |
|-------------|-------------------|---------------|
| Scalar (`float`, `int`) | `abs(b) <= EPSILON` | `np.isnan(b)` |
| `np.ndarray` | `np.abs(b) <= EPSILON` | `np.isnan(b)` |
| `pd.Series` | `b.abs() <= EPSILON` | `b.isna()` |

### 2. Factor Division (`Factor.__truediv__`)

The `Factor / Factor` and `Factor / scalar` operations use Polars expressions with the
same EPSILON threshold:

```python
# Polars path (Factor / Factor)
pl.when(pl.col("other").abs() <= EPSILON)
.then(pl.lit(None)) # → null (Polars equivalent of NaN)
.otherwise(pl.col("factor") / pl.col("other"))

# Polars path (Factor / scalar)
pl.when(pl.lit(other).abs() <= EPSILON)
.then(pl.lit(None))
.otherwise(pl.col("factor") / pl.lit(other))
```

**Key difference:** The Polars path returns `null` (not `NaN`) for near-zero denominators.
This is consistent with Polars conventions where `null` represents missing data.

### 3. `MathOpsMixin.inverse()`

```python
# 1 / factor with safe division
pl.when(pl.col("factor").abs() <= EPSILON)
.then(pl.lit(None))
.otherwise(1 / pl.col("factor"))
```

---

## Strict NaN Propagation in Rolling Operations

All time-series operations (`ts_*`) follow **strict NaN propagation**: if any value within
the rolling window is `NaN`/`null`, or if the window is not full, the result is `NaN`/`null`.

### Mechanism

Polars rolling functions control this via the `min_samples` parameter:

```python
# min_samples=window ensures NaN if window is not full
pl.col("factor").rolling_mean(window_size=window, min_samples=window).over("symbol")
```

When `min_samples == window_size`, Polars will return `null` if:
- The window has fewer than `window` non-null values
- Any value in the window is `null`

### Operations Using This Pattern

| Operation | Polars Function | EPSILON Check |
|-----------|----------------|---------------|
| `ts_mean` | `rolling_mean(min_samples=window)` | No |
| `ts_std` | `rolling_std(min_samples=window)` | No |
| `ts_sum` | `rolling_sum(min_samples=window)` | No |
| `ts_min` | `rolling_min(min_samples=window)` | No |
| `ts_max` | `rolling_max(min_samples=window)` | No |
| `ts_median` | `rolling_median(min_samples=window)` | No |
| `ts_kurtosis` | `rolling_kurtosis(min_samples=window)` | No |
| `ts_skewness` | `rolling_skew(min_samples=window)` | No |
| `ts_rank` | `rolling_rank(min_samples=window)` | Yes (constant std check) |
| `ts_scale` | min/max + division | Yes (range < EPSILON) |
| `ts_zscore` | mean/std + division | Yes (std < EPSILON) |
| `ts_corr` | manual cov / (std_x × std_y) | Yes (either std < EPSILON) |
| `ts_beta` | manual cov / var | Yes (var < EPSILON) |
| `ts_cv` | std / \|mean\| | Yes (adds 1e-10 bias term) |

### Explicit NaN-in-Window Mask

For operations requiring EPSILON checks, an explicit NaN mask is computed:

```python
nan_in_window = (
(pl.col("factor").is_null() | pl.col("factor").is_nan())
.cast(pl.Int64)
.rolling_max(window_size=window, min_samples=window)
.over("symbol")
.fill_null(1) # Treat incomplete windows as having NaN
)
```

This mask is `> 0` if **any** value in the window is `NaN` or `null`, or if the window is
not fully populated. Result computation then uses:

```python
pl.when(nan_in_window > 0).then(pl.lit(None)).otherwise(computed_expr)
```

---

## Strict NaN Propagation in Cross-Sectional Operations

Cross-sectional operations (`cs_*`) apply a **strict NaN mask across the entire
cross-section** at each time step:

```python
# If ANY symbol has NaN at time t, ALL symbols get NaN at time t
nan_mask = (pl.col("factor").is_null() | pl.col("factor").is_nan()).any().over("end_time")
```

### Operations Using This Pattern

| Operation | EPSILON Check | Special Handling |
|-----------|---------------|------------------|
| `cs_rank` | No | Returns rank / count |
| `cs_zscore` | No (std=0 → ±inf, but caught by NaN mask) | — |
| `cs_demean` | No | — |
| `cs_winsorize` | No | Clips to quantile bounds |
| `cs_neutralize` | Yes (var_x < EPSILON → null) | OLS regression |
| `cs_mean` / `cs_median` | No | — |

---

## Degenerate-Case Handling

Beyond NaN propagation and division safety, specific operations have additional
degenerate-case guards:

| Operation | Degenerate Condition | Result |
|-----------|---------------------|--------|
| `ts_rank` | `std < EPSILON` (all values identical) | `null` |
| `ts_scale` | `max - min <= EPSILON` (no range) | `null` |
| `ts_zscore` | `std <= EPSILON` (no variance) | `null` |
| `ts_corr` | `std_x <= EPSILON` or `std_y <= EPSILON` | `null` |
| `ts_beta` | `var_x <= EPSILON` | `null` |
| `cs_neutralize` | `var_x <= EPSILON` | `null` |
| `cs_neutralize` (engine) | `std(x) < EPSILON` | `NaN` (NumPy path) |
| `inverse()` | `|factor| <= EPSILON` | `null` |
| `log()` | `factor <= 0` | `null` |
| `sqrt()` | `factor <= 0` | `null` |

---

## Edge Cases

### Empty Data

- `safe_divide` with empty `pd.Series` → empty `pd.Series`
- `neutralize_weights` with empty input → empty `pd.Series(dtype=float)`
- Factor operations on empty DataFrames → empty result DataFrame

### All NaN Input

- Rolling operations → all `null` output (no valid windows)
- Cross-sectional operations → all `null` output (NaN mask activates)

### Single-Element Window

- `ts_std(window=1)` → always `0.0` (single-value std is 0)
- `ts_corr(window=1)` → all `null` (needs window >= 2)
- `ts_beta(window=1)` → all `null` (needs window >= 2)

### Infinity Handling

Some operations explicitly replace `inf` / `-inf` with `null`:

```python
# ts_scale, ts_zscore
z_expr = pl.when(z_expr.is_finite()).then(z_expr).otherwise(pl.lit(None))

# ts_jumpiness, ts_vr
result._lf = result._lf.with_columns(
pl.col("factor").replace(float("inf"), None).replace(float("-inf"), None)
)
```

---

## Summary Table

| Pattern | Where Used | Threshold | Missing Value |
|---------|-----------|-----------|---------------|
| Safe division | `safe_divide`, `__truediv__`, `inverse` | `EPSILON` (1e-10) | `NaN` (Pandas/NumPy) / `null` (Polars) |
| Strict NaN propagation (rolling) | All `ts_*` operations | `min_samples=window` | `null` |
| Strict NaN propagation (cross-section) | All `cs_*` operations | `.any().over("end_time")` | `null` |
| Variance/std guard | `ts_corr`, `ts_beta`, `ts_rank`, `cs_neutralize` | `EPSILON` | `null` |
| Range guard | `ts_scale` | `EPSILON` | `null` |
| Infinity filter | `ts_zscore`, `ts_scale`, `ts_jumpiness`, `ts_vr` | `is_finite()` | `null` |
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ uv add factorium
| [快速開始](getting-started/quickstart.md) | 五分鐘上手教學 |
| [資料獲取](getting-started/data-acquisition.md) | 下載與載入市場數據 |
| [Bar 聚合](user-guide/bar.md) | 不同類型的 K 線聚合 |
| [Universe 與 Checklist](user-guide/universe.md) | 建立資產池遮罩(Universe / Checklist)並串接因子與回測 |
| [Factor 因子](user-guide/factor.md) | 因子計算與運算子 |
| [因子分析](user-guide/analyzer.md) | IC / 分層收益等分析工具 |
| [策略回測](user-guide/backtest.md) | 向量化回測與權重約束 |
Expand Down
Loading