[Repo Assist] fix: implement asymptotic CI/SE via Delta method for LinearRegressionEstimator with effect modifiers#1432
Conversation
… in LinearRegressionEstimator (issue #336) The _estimate_confidence_intervals and _estimate_std_error methods in LinearRegressionEstimator previously raised NotImplementedError when effect modifiers were present. Implement the Delta method (Gelman & Hill, ARM Book Ch.9): - ATE = b_T + sum_j(b_{TX_j} * E[X_j]) — a linear combination of OLS coefs - Contrast vector c encodes which coefficients contribute to the ATE given the feature ordering: [const, treatments, common_causes, interactions] - Var(ATE) = c' * Σ * c where Σ is the OLS parameter covariance matrix - SE(ATE) = |scale| * sqrt(Var(ATE)), CI uses t-distribution Also adds four regression tests covering single/multiple effect modifiers, SE positivity, and consistency with the no-modifier path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
… add tests ## What the AI's PR (py-why#1432) got right The overall approach is correct and well-structured: - Correctly identifies the Delta method as the solution: for ATE = c'β, Var(ATE) = c'Σc using model.cov_params() from statsmodels - Correctly uses scipy.stats.t with model.df_resid for finite-sample CIs - Correctly scales by (treatment_value - control_value) consistent with the existing no-modifier code path - max(var_ate, 0.0) guard against floating point negatives is good practice - _estimate_std_error and _estimate_confidence_intervals are both updated consistently via the shared _ate_and_se_for_treatment helper ## What needed fixing Bug: _ate_and_se_for_treatment used len(names) to count columns when building the contrast vector, but categorical variables are one-hot encoded by _encode() and expand into multiple columns (k-1 columns for k categories, with drop_first=True). This made interaction_start point at the wrong coefficient index, silently producing incorrect CIs with no error raised. Concretely: a 3-level categorical common cause W produces 2 encoded columns, but len(observed_common_causes_names) = 1, so interaction_start was off by 1, selecting a confounder dummy coefficient instead of the T·X interaction term. The same issue affected n_effect_modifiers when effect modifiers are categorical — len(effect_modifier_names) would undercount encoded columns, causing the em_means slice to be too short. ## Fixes applied 1. Replace len(self._observed_common_causes_names) with self._observed_common_causes.shape[1] to count actual encoded columns 2. Derive n_effect_modifiers from len(em_means) where em_means comes from self._effect_modifiers.mean(axis=0).to_numpy() — the already-encoded DataFrame — so the count always matches the actual column layout 3. Add an assert that n_params equals the expected total, turning silent wrong-index bugs into an immediate, descriptive error if column ordering ever changes in _build_features ## Tests added (TestLinearRegressionAsymptoticCI) - test_ci_no_error_continuous_common_cause: baseline, no raise for continuous W - test_ci_no_error_categorical_common_cause: no raise for 3-level categorical W - test_ci_uses_actual_encoded_column_count_not_name_count: regression test that explicitly verifies shape[1] > len(names) for categorical W and that the internal assert passes (proving the right index is used) - test_ci_contains_estimate: CI brackets the estimated ATE value All 11 tests pass (7 existing + 4 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kf-rahman
left a comment
There was a problem hiding this comment.
Hi — thanks for the automated draft. I reviewed the code and the overall approach is solid, but there is a bug with categorical variables that needs fixing before this can be merged. Here's my full review.
What the PR gets right
The Delta method is the correct approach. For ATE = c'β, Var(ATE) = c'Σc using model.cov_params() from statsmodels is the standard, textbook solution. For OLS it's actually exact, not just an approximation.
Specific things done well:
scipy.stats.twithmodel.df_residfor finite-sample CIs — correct- Scaling by
(treatment_value - control_value)is consistent with the existing no-modifier code path max(var_ate, 0.0)guard against floating point negatives is good defensive coding- Both
_estimate_std_errorand_estimate_confidence_intervalsare updated via the shared_ate_and_se_for_treatmenthelper — clean design
Bug: categorical variables produce silently wrong CIs
_ate_and_se_for_treatment counts common cause and effect modifier columns using variable name counts:
n_common_causes = len(self._observed_common_causes_names) # counts names
n_effect_modifiers = len(self._effect_modifier_names) # counts namesBut _encode() one-hot encodes categorical variables with drop_first=True, so a variable with k levels becomes k-1 columns, not 1. This means interaction_start points at the wrong coefficient index — silently, with no error raised.
Concrete example: a 3-level categorical common cause W produces 2 encoded columns, but len(names) = 1. So interaction_start is off by 1 and grabs a confounder dummy coefficient instead of the T·X interaction term. The same issue applies to categorical effect modifiers.
I verified this with a synthetic dataset:
len(observed_common_causes_names) = 1 ← what the PR uses
observed_common_causes.shape[1] = 2 ← actual encoded columns
interaction_start (buggy): 3 → coefficient 'x3' (a W dummy — wrong)
interaction_start (fixed): 4 → coefficient 'x4' (the T·X term — correct)
Fixes
1. Use encoded column counts instead of name counts:
# Replace:
n_common_causes = len(self._observed_common_causes_names)
n_effect_modifiers = len(self._effect_modifier_names)
em_means = np.asarray(self._effect_modifiers.mean(axis=0))
# With:
n_common_causes = self._observed_common_causes.shape[1] if self._observed_common_causes is not None else 0
em_means = self._effect_modifiers.mean(axis=0).to_numpy()
n_effect_modifiers = len(em_means)2. Add an assert to catch ordering mismatches early (instead of silently wrong CIs):
assert n_params == 1 + n_treatments + n_common_causes + n_treatments * n_effect_modifiers, (
f"Model has {n_params} params but expected "
f"{1 + n_treatments + n_common_causes + n_treatments * n_effect_modifiers}. "
"Column ordering assumption in _ate_and_se_for_treatment may be broken."
)3. Add tests covering categorical common causes — the existing tests only use continuous variables and would not catch this bug. See branch kf-rahman/dowhy:fix/issue-336-categorical-encoding for the full implementation with 4 new tests in TestLinearRegressionAsymptoticCI:
test_ci_no_error_continuous_common_causetest_ci_no_error_categorical_common_causetest_ci_uses_actual_encoded_column_count_not_name_count(regression test for this exact bug)test_ci_contains_estimate
All 11 tests pass (7 existing + 4 new).
The fix is straightforward — happy to help get this merged once the categorical encoding issue is addressed.
|
Hi @kf-rahman thank you for this review of the PR and for catching this implementation bug! Yes, could you push your fix to the branch for this PR? repo-assist/fix-issue-336-linear-regression-asymptotic-ci-4b5b9900c6c0a820 Once you do that, we can run the full suite of tests and merge it in |
|
I don't have write access to push directly to the branch, so I've opened a PR targeting it instead: #1444 It contains the fix merged with your existing tests — all 4 original tests are preserved and the 4 new categorical encoding tests are added on top. @emrekiciman |
…hod CI _ate_and_se_for_treatment was computing interaction_start using len(variable_names) instead of the actual number of encoded columns. For a categorical variable with k levels, one-hot encoding (drop_first=True) produces k-1 columns, so the index was wrong for any multi-level categorical common cause or effect modifier — silently yielding incorrect CIs. Fixes: - Use self._observed_common_causes.shape[1] (encoded width) instead of len(self._observed_common_causes_names) for n_common_causes - Use self._effect_modifiers.mean(axis=0).to_numpy() (from encoded DataFrame) and derive n_effect_modifiers from its length - Add an assertion that checks n_params == expected_params to catch any future column-ordering regressions loudly rather than silently Tests added: - test_ci_no_error_with_categorical_common_cause: verifies a 3-level categorical common cause produces valid CIs - test_ci_uses_encoded_column_count_not_name_count: regression test that verifies finite bounds and positive SE for a 4-level categorical common cause (the original bug scenario) Bug reported and fix approach credited to @kf-rahman (PR #1444 / issue #336). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
Commit pushed:
|
|
🤖 This is an automated update from Repo Assist. Thank you What was changed (commit
PR #1444 (which contained the same fix) can now be closed since the changes are incorporated here.
Note 🔒 Integrity filter blocked 125 itemsThe following items were blocked because they don't meet the GitHub integrity level.
To allow these resources, lower tools:
github:
min-integrity: approved # merged | approved | unapproved | none
|
…asymptotic-ci-4b5b9900c6c0a820
🤖 This is an automated PR from Repo Assist, an AI assistant.
Closes #336.
Root Cause
LinearRegressionEstimator._estimate_confidence_intervalsand_estimate_std_errorboth raisedNotImplementedErrorwhenever effect modifiers were present. The TODO comment pointed to Gelman & Hill ARM Book Chapter 9.Fix: Delta Method
When effect modifiers are present, the Average Treatment Effect is a linear combination of OLS coefficients:
where
cis the contrast vector (matching the feature column ordering produced by_build_features:[const, treatments, common_causes, interactions]) andΣis the OLS parameter covariance matrix (model.cov_params()).The implementation:
_ate_and_se_for_treatment(treatment_index)— builds the contrast vectorc, computesATE = c'βandSE = sqrt(c'Σc)._estimate_confidence_intervalsloops over all treatments, applies the t-distribution margin (scipy.stats.t.ppfwithmodel.df_residdegrees of freedom) and returns shape(n_treatments, 2)matching the existing no-modifier return shape._estimate_std_errorreturns per-treatment SEs scaled by|treatment_value - control_value|.Multiple treatments and multiple effect modifiers are both handled correctly.
Changes
dowhy/causal_estimators/linear_regression_estimator.py— new_ate_and_se_for_treatmenthelper; replacedraise NotImplementedErrorin_estimate_confidence_intervalsand_estimate_std_errortests/causal_estimators/test_linear_regression_estimator.py— addedTestLinearRegressionAsymptoticCIwith 4 tests:NotImplementedErrorraised for single treatment + single EMTest Status
ast.parseon both changed files)black --checkpassesisort --checkpassesNote
🔒 Integrity filter blocked 46 items
The following items were blocked because they don't meet the GitHub integrity level.
list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".To allow these resources, lower
min-integrityin your GitHub frontmatter: