Skip to content

TwoStageDiD methodology validation (PR-B): did2s parity + exact GMM variance#545

Merged
igerber merged 1 commit into
mainfrom
feature/two-stage-methodology
Jun 23, 2026
Merged

TwoStageDiD methodology validation (PR-B): did2s parity + exact GMM variance#545
igerber merged 1 commit into
mainfrom
feature/two-stage-methodology

Conversation

@igerber

@igerber igerber commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

  • TwoStageDiD methodology validation (PR-B) of the Gardner (2022) / did2s review pair. Adds paper-grounded Verified Components and a did2s cross-language parity fixture, and flips the METHODOLOGY_REVIEW.md TwoStageDiD row to Complete.
  • Corrects a ~1% inaccuracy in the GMM sandwich variance surfaced by the did2s parity. The variance computed gamma_hat exactly (sparse) but derived its residuals from the iterative alternating-projection first-stage FE (_iterative_fe, ~1e-7 convergence on unbalanced untreated panels). It now re-solves the Stage-1 FE exactly (reusing the gamma_hat factorization), and _build_fe_design gains an intercept column so its column space spans the grand mean (the prior intercept-free design omitted it; the exact residual is first-order sensitive). SE now matches did2s to ~1e-7; the point estimate is unchanged (iterative FE; ImputationDiD equivalence preserved at 1e-10). Mirrors the same-class fix in ImputationDiD's exact-sparse variance.

Methodology references

  • Method name(s): TwoStageDiD (Gardner two-stage difference-in-differences)
  • Paper / source link(s): Gardner, J. (2022), Two-stage differences in differences, arXiv:2207.05943; R reference did2s::did2s() (Butts & Gardner)
  • Intentional deviations from the source (and why): No finite-sample multiplier (matches did2s; documented as Deviation from R in REGISTRY.md). The multiplier bootstrap and the vcov_type narrowing are library extensions (Gardner prescribes analytical GMM SEs only; did2s defaults bootstrap=FALSE). The Eq. (5) P̄-average estimand and the fn. 8 full-sample first-stage variant are paper-permitted but not exposed (tracked in TODO.md).

Validation

  • Tests added/updated: tests/test_methodology_two_stage.py (new — five Gardner-section Verified Component classes: §3 procedure/eqs. 4&6, §3.3 GMM variance, fn. 19 + Proposition 5 identification, library deviations, plus TestTwoStageDiDParityR); benchmarks/R/generate_did2s_golden.R + benchmarks/data/did2s_golden.json + did2s_test_panel.csv (new R parity fixture, committed so CI needs no R). did2s parity: overall + event-study ATT (abs=1e-6) and SE (abs=1e-7). Regression: full tests/test_two_stage.py (120) plus a 2315-test sweep across all TwoStageDiD/survey test files — green. black/ruff clean.
  • Backtest / simulation / notebook evidence: N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes — one P1 finding.

Executive Summary

  • The analytical TwoStageDiD GMM variance change matches the documented Gardner / did2s contract: global GMM bread, unit-clustered meat, no finite-sample multiplier.
  • The exact Stage-1 residual re-solve is documented in REGISTRY.md, so it is not a methodology deviation.
  • The bootstrap path still builds GMM scores from the old iterative residuals, so n_bootstrap > 0 can report SEs from a different influence function than the now-corrected analytical GMM path.
  • I could not run tests: this environment lacks pytest and numpy.

Methodology

Finding 1

Severity: P1
Location: diff_diff/two_stage_bootstrap.py:L106-L124, diff_diff/two_stage_bootstrap.py:L336-L350, diff_diff/two_stage_bootstrap.py:L495-L508, diff_diff/two_stage_bootstrap.py:L566-L579; contrast with fixed analytical path in diff_diff/two_stage.py:L2989-L3029
Impact: The PR fixes analytical GMM SEs by re-solving Stage-1 FE exactly and recomputing Stage-2 residuals from the exact residualized outcome. However, the multiplier bootstrap still reconstructs eps_10 from the iterative _y_tilde and receives eps_2 computed from the same iterative residualized outcome. Since bootstrap results override reported SE/CI/p-values when n_bootstrap > 0, users can still get the pre-fix approximate influence-function SEs silently on the bootstrap path. This is the same residual source the PR identifies as moving analytical SEs by about 1%.
Concrete fix: Refactor the exact residual / exact Stage-2 residual construction from _compute_gmm_variance into a shared helper and use it inside _compute_cluster_S_scores, or update _compute_cluster_S_scores to solve theta_exact with the same (X'10 W X10) factorization and recompute eps_2 internally from exact y_tilde. Add a regression test comparing bootstrap S-score construction against the exact analytical S-score inputs on the did2s parity panel, at least for static and event-study paths.

Code Quality

No additional P1/P2 code-quality findings in the changed implementation. The new exact-residual code is localized and follows the existing sparse-factorization plus warned dense-fallback pattern.

Performance

No blocking findings. The exact Stage-1 re-solve reuses the existing sparse factorization, so the added work is a small extra RHS solve plus a Stage-2 solve.

Maintainability

Severity: P3
Impact: Analytical and bootstrap GMM score construction now duplicate related logic but no longer share the exact-residual behavior, which caused the P1 above.
Concrete fix: Share the exact score-input builder between analytical variance and bootstrap to keep future methodology fixes single-sourced.

Tech Debt

No untracked deferrable tech-debt blocker found. The paper-permitted but unexposed Eq. (5) / fn. 8 variants are tracked in TODO.md, so they are informational only.

Security

No security findings. The added benchmark CSV/JSON fixture appears synthetic and contains no secrets or PII.

Documentation/Tests

Severity: P2
Impact: The new methodology tests cover analytical did2s parity and only assert that multiplier bootstrap returns a finite SE. They would not catch the bootstrap-path residual mismatch above.
Concrete fix: Add a bootstrap-specific regression test that exercises TwoStageDiD(n_bootstrap=...) on an unbalanced untreated panel and verifies its S-score inputs use the exact Stage-1 residual path, not the iterative _y_tilde path.

Path to Approval

  1. Update TwoStageDiD bootstrap score construction to use the same exact Stage-1 residual and exact Stage-2 residual logic as _compute_gmm_variance.
  2. Add tests for static and event-study bootstrap score construction on the did2s parity panel or another unbalanced untreated panel where iterative-vs-exact residuals differ.
  3. Run tests/test_methodology_two_stage.py and the existing TwoStageDiD bootstrap tests.

@igerber igerber force-pushed the feature/two-stage-methodology branch from 2acfbe0 to c5602bb Compare June 23, 2026 15:17
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c5602bbe689a512be906879079d82599059aa39b


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Previous P1 is resolved: bootstrap S-score construction now uses the shared exact Stage-1 / Stage-2 residual helper, not the old iterative residual path.
  • The affected method is TwoStageDiD: Gardner two-stage DiD, analytical GMM variance, and multiplier-bootstrap GMM scores.
  • The exact Stage-1 re-solve and _build_fe_design intercept are documented in REGISTRY.md, so they are not methodology defects.
  • No new NaN-inference, parameter propagation, or control-group logic defects found in the changed code.
  • Could not run tests here: pytest and numpy are not installed. Static git diff --check passed.

Methodology

Finding M1

Severity: P3 informational
Impact: The PR changes TwoStageDiD variance mechanics, but the implementation matches the documented registry contract: global GMM bread, unit-clustered meat, no finite-sample multiplier, exact Stage-1 residual re-solve, and documented multiplier-bootstrap extension. The prior bootstrap residual mismatch is addressed through _exact_gmm_residuals() and all bootstrap specs call it via _compute_cluster_S_scores().
Concrete fix: No action required. Relevant locations: diff_diff/two_stage.py:L2916-L3003, diff_diff/two_stage.py:L3203-L3293, diff_diff/two_stage_bootstrap.py:L66-L123, diff_diff/two_stage_bootstrap.py:L407-L420, diff_diff/two_stage_bootstrap.py:L556-L569, diff_diff/two_stage_bootstrap.py:L617-L630, docs/methodology/REGISTRY.md:L1405-L1430.

Code Quality

No findings. The exact residual construction is now centralized, and the changed inference paths continue to use the existing NaN-safe inference conventions.

Performance

No findings. The added exact Stage-1 solve reuses the same Stage-1 sparse factorization already needed for gamma_hat; the extra Stage-2 solve is small relative to the FE solve.

Maintainability

No findings. Sharing _exact_gmm_residuals() between analytical variance and bootstrap removes the divergence that caused the previous P1.

Tech Debt

Finding T1

Severity: P3 informational
Impact: Gardner Eq. (5) period-average and fn. 8 full-sample first-stage variants remain unexposed, but this is explicitly tracked in TODO.md and documented in the methodology review.
Concrete fix: No action required for this PR. Relevant location: TODO.md:L156.

Security

No findings. The added benchmark CSV/JSON fixture appears synthetic and contains no secrets or PII.

Documentation/Tests

Finding D1

Severity: P3 informational
Impact: The new white-box bootstrap exact-residual test covers the static S-score path. Event-study and group bootstrap paths are not separately asserted, though they now share the same _compute_cluster_S_scores() helper, so this is not a blocker.
Concrete fix: Optional: add an event-study bootstrap score regression that compares bread @ (S.T @ S) @ bread from _compute_cluster_S_scores() against the analytical event-study GMM variance on an unbalanced untreated panel. Relevant test: tests/test_methodology_two_stage.py:L439-L493.

Verification: Not run. This environment lacks pytest and numpy.

…ariance

PR-B of the TwoStageDiD (Gardner 2022) methodology validation. Adds
paper-grounded Verified Components and a did2s cross-language parity fixture,
and corrects a ~1% inaccuracy in the GMM sandwich variance surfaced by the
parity check.

Tests / docs:
- tests/test_methodology_two_stage.py: §3 procedure (eqs. 4/6), §3.3 GMM
  variance, fn.19 always-treated + Prop.5, the live covariate path, balance_e,
  vcov_type narrowing, a skip-guarded did2s R parity class, and a white-box
  test that the bootstrap GMM scores use the exact residual path.
- benchmarks/R/generate_did2s_golden.R + did2s_golden.json + did2s_test_panel.csv:
  did2s::did2s() golden (analytical corrected clustered SE, bootstrap=FALSE).
- REGISTRY / paper-review / METHODOLOGY_REVIEW / CHANGELOG / TODO updated;
  tracker row flipped to Complete.

Variance fix (diff_diff/two_stage.py, diff_diff/two_stage_bootstrap.py):
- The GMM sandwich derived its residuals from the iterative
  alternating-projection first-stage FE (_iterative_fe, ~1e-7 on unbalanced
  untreated panels) while computing gamma_hat exactly, leaving the SE ~1% off the
  analytical sandwich. Both the analytical variance (_compute_gmm_variance) and
  the multiplier bootstrap (_compute_cluster_S_scores) now re-solve the Stage-1
  FE EXACTLY via a shared _exact_gmm_residuals helper (reusing the gamma_hat
  factorization), so the GMM influence function is single-sourced. This matters
  for the bootstrap because its SE overrides the analytical SE when
  n_bootstrap > 0.
- _build_fe_design gains an intercept column so its column space spans the grand
  mean (the prior intercept-free design omitted it; the exact residual is
  first-order sensitive). Standard full-rank two-way FE, matching fixest / did2s.
- Point estimate unchanged (iterative FE; ImputationDiD equivalence preserved at
  1e-10); only the variance uses exact residuals. Unidentified-FE obs (rank-
  deficient / Prop.5) fall back to the iterative residual. SE now matches did2s
  to ~1e-7. Mirrors ImputationDiD's exact-sparse variance fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the feature/two-stage-methodology branch from c5602bb to 278e541 Compare June 23, 2026 15:34
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 278e54151094a0c044e62f17bbae9cc7a9f4c417


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Affected method: TwoStageDiD Gardner two-stage DiD, analytical GMM sandwich variance, and multiplier-bootstrap GMM scores.
  • The prior bootstrap residual concern remains resolved: static, event-study, and group bootstrap paths all route through _compute_cluster_S_scores(), which now shares _exact_gmm_residuals() with analytical variance.
  • The exact Stage-1 re-solve, _build_fe_design intercept, and no finite-sample multiplier are documented in REGISTRY.md, so they are not methodology defects.
  • R did2s parity is now pinned by committed golden fixtures for overall and event-study ATT/SE.
  • I could not run pytest here because numpy is not installed. Static git diff --check passed.

Methodology

Finding M1

Severity: P3 informational
Impact: The PR changes TwoStageDiD variance mechanics, but the changed implementation matches the documented methodology contract: global GMM bread, unit-clustered meat, no finite-sample multiplier, exact Stage-1 residual re-solve, and shared exact residuals for bootstrap influence scores. These are documented in REGISTRY.md, including the exact re-solve/intercept note and the did2s no-FSA deviation label.
Concrete fix: No action required. Relevant locations: diff_diff/two_stage.py:L2916-L3003, diff_diff/two_stage.py:L3203-L3293, diff_diff/two_stage_bootstrap.py:L66-L123, diff_diff/two_stage_bootstrap.py:L407-L420, diff_diff/two_stage_bootstrap.py:L556-L569, diff_diff/two_stage_bootstrap.py:L617-L630, docs/methodology/REGISTRY.md:L1424-L1430.

Code Quality

No findings. The exact residual construction is centralized and reused by analytical variance and bootstrap paths, reducing the prior divergence.

Performance

No findings. The exact Stage-1 residual solve reuses the same sparse factorization needed for gamma_hat; the additional Stage-2 re-solve is small relative to the FE solve.

Maintainability

No findings. The changed signatures propagated through all visible callers in the diff, and the previous optional event-study bootstrap coverage gap is now explicitly covered.

Tech Debt

Finding T1

Severity: P3 informational
Impact: Gardner’s Eq. (5) period-average estimand and fn. 8 full-sample first-stage variant remain unexposed, but this is explicitly tracked in TODO.md and documented as out of scope.
Concrete fix: No action required for this PR. Relevant location: TODO.md:L156.

Security

No findings. The added CSV/JSON benchmark fixture appears synthetic and contains no secrets or PII.

Documentation/Tests

Finding D1

Severity: P3 informational
Impact: The prior optional coverage note is addressed: there is now a multi-column event-study bootstrap score regression asserting bread @ (S.T @ S) @ bread equals analytical GMM variance elementwise, plus committed did2s parity fixtures for overall/event-study ATT and SE.
Concrete fix: No action required. Relevant locations: tests/test_methodology_two_stage.py:L485-L541, tests/test_methodology_two_stage.py:L770-L820, benchmarks/R/generate_did2s_golden.R:L82-L138, benchmarks/data/did2s_golden.json:L13-L20.

Verification

  • git diff --check: passed.
  • Tests not run: this environment lacks numpy (ModuleNotFoundError: No module named 'numpy').

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jun 23, 2026
@igerber igerber merged commit 215364f into main Jun 23, 2026
33 of 34 checks passed
@igerber igerber deleted the feature/two-stage-methodology branch June 23, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant