Skip to content

fix: categorical encoding bug in Delta-method CI (review of #1432)#1444

Open
kf-rahman wants to merge 1 commit intopy-why:repo-assist/fix-issue-336-linear-regression-asymptotic-ci-4b5b9900c6c0a820from
kf-rahman:pr-1432-upstream
Open

fix: categorical encoding bug in Delta-method CI (review of #1432)#1444
kf-rahman wants to merge 1 commit intopy-why:repo-assist/fix-issue-336-linear-regression-asymptotic-ci-4b5b9900c6c0a820from
kf-rahman:pr-1432-upstream

Conversation

@kf-rahman
Copy link
Copy Markdown

Fixes the categorical encoding bug identified in my review of #1432.

Changes

  • _ate_and_se_for_treatment: use self._observed_common_causes.shape[1] instead of len(self._observed_common_causes_names) to count encoded columns (categorical variables expand via one-hot encoding)
  • Derive n_effect_modifiers from len(em_means) where em_means comes from the already-encoded DataFrame
  • Add an assert that catches column ordering mismatches early instead of silently producing wrong CIs

Tests added (merged with existing tests in TestLinearRegressionAsymptoticCI)

  • test_ci_no_error_continuous_common_cause
  • test_ci_no_error_categorical_common_cause
  • test_ci_uses_actual_encoded_column_count_not_name_count — regression test for this exact bug
  • test_ci_contains_estimate

All existing tests are preserved.

… add tests

The overall approach is correct and well-structured:
- Correctly identifies the Delta method as the solution: for ATE = c'β,
  Var(ATE) = c'Σc using model.cov_params() from statsmodels
- Correctly uses scipy.stats.t with model.df_resid for finite-sample CIs
- Correctly scales by (treatment_value - control_value) consistent with
  the existing no-modifier code path
- max(var_ate, 0.0) guard against floating point negatives is good practice
- _estimate_std_error and _estimate_confidence_intervals are both updated
  consistently via the shared _ate_and_se_for_treatment helper

Bug: _ate_and_se_for_treatment used len(names) to count columns when
building the contrast vector, but categorical variables are one-hot encoded
by _encode() and expand into multiple columns (k-1 columns for k categories,
with drop_first=True). This made interaction_start point at the wrong
coefficient index, silently producing incorrect CIs with no error raised.

Concretely: a 3-level categorical common cause W produces 2 encoded columns,
but len(observed_common_causes_names) = 1, so interaction_start was off by 1,
selecting a confounder dummy coefficient instead of the T·X interaction term.

The same issue affected n_effect_modifiers when effect modifiers are
categorical — len(effect_modifier_names) would undercount encoded columns,
causing the em_means slice to be too short.

1. Replace len(self._observed_common_causes_names) with
   self._observed_common_causes.shape[1] to count actual encoded columns

2. Derive n_effect_modifiers from len(em_means) where em_means comes from
   self._effect_modifiers.mean(axis=0).to_numpy() — the already-encoded
   DataFrame — so the count always matches the actual column layout

3. Add an assert that n_params equals the expected total, turning silent
   wrong-index bugs into an immediate, descriptive error if column ordering
   ever changes in _build_features

- test_ci_no_error_continuous_common_cause: baseline, no raise for continuous W
- test_ci_no_error_categorical_common_cause: no raise for 3-level categorical W
- test_ci_uses_actual_encoded_column_count_not_name_count: regression test
  that explicitly verifies shape[1] > len(names) for categorical W and that
  the internal assert passes (proving the right index is used)
- test_ci_contains_estimate: CI brackets the estimated ATE value

All 11 tests pass (7 existing + 4 new).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
github-actions bot added a commit that referenced this pull request Apr 8, 2026
…hod CI

_ate_and_se_for_treatment was computing interaction_start using
len(variable_names) instead of the actual number of encoded columns.
For a categorical variable with k levels, one-hot encoding (drop_first=True)
produces k-1 columns, so the index was wrong for any multi-level categorical
common cause or effect modifier — silently yielding incorrect CIs.

Fixes:
- Use self._observed_common_causes.shape[1] (encoded width) instead of
  len(self._observed_common_causes_names) for n_common_causes
- Use self._effect_modifiers.mean(axis=0).to_numpy() (from encoded DataFrame)
  and derive n_effect_modifiers from its length
- Add an assertion that checks n_params == expected_params to catch any
  future column-ordering regressions loudly rather than silently

Tests added:
- test_ci_no_error_with_categorical_common_cause: verifies a 3-level
  categorical common cause produces valid CIs
- test_ci_uses_encoded_column_count_not_name_count: regression test
  that verifies finite bounds and positive SE for a 4-level categorical
  common cause (the original bug scenario)

Bug reported and fix approach credited to @kf-rahman (PR #1444 / issue #336).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant