OxyFormer is a causal AI framework for estimating the effects of altitude-derived oxygen exposure on cancer outcomes. It combines physiologic treatment engineering, attention-based tabular transformer pretraining, transferred county embeddings, orthogonal double machine learning, site-level heterogeneity analysis, and technical report generation.
This repository is a research codebase for county-level causal modeling and representation learning. It is not a clinical decision-support tool and should not be used for patient care.
OxyFormer is built around a simple idea: if oxygen availability is a biologically meaningful exposure, then the right modeling stack should do more than fit a single regression. In this repository, the workflow is:
- Construct a county-level oxygen exposure proxy from elevation and barometric pressure.
- Fuse county-level public-health, demographic, socioeconomic, and behavioral covariates.
- Pretrain a tabular transformer to learn reusable county embeddings.
- Transfer those embeddings into a cross-fitted continuous-treatment DML estimator.
- Estimate total and mediator-aware direct effects for all-cancer and site-specific endpoints.
- Generate funder-facing HTML and PDF reports.
- Treatment engineering: converts elevation into inspired oxygen proxy and normalized
hypoxia_burden. - Public-data fusion: merges local files with ACS, PLACES, SVI, and RUCC county covariates.
- Representation learning: trains a dual-view masked tabular transformer with auxiliary health heads.
- Causal estimation: fits partial-linear DML models with cross-fitted ridge nuisance regressions.
- Endpoint reuse: extends the same learned representation across all-cancer and site-level models.
- Reporting: exports both an HTML white paper and a LaTeX/PDF technical paper.
| Path | Purpose |
|---|---|
phase1_county_pipeline.py |
Builds the baseline county table and oxygen-derived exposure variables from local input files. |
phase2_public_data.py |
Downloads and normalizes county-level public covariates from PLACES, SVI, RUCC, and ACS. |
phase25_feature_engineering.py |
Curates a smaller modeling feature set and prepares model matrices. |
phase26_foundation_model.py |
Trains the OxyFormer tabular transformer and exports learned county embeddings. |
phase3_causal_demo.py |
Fits the main cross-fitted DML models for total and direct effects. |
phase4_robustness.py |
Runs seed-sensitivity and leave-one-state-out robustness checks. |
plot_phase3_site_forest.py |
Renders lightweight forest plots from site-level effect tables. |
build_white_paper_report.py |
Generates the HTML white paper. |
build_technical_paper_assets.py |
Generates PDF-ready figures and LaTeX inputs for the technical paper. |
report/technical_white_paper.tex |
Main LaTeX source for the technical paper. |
causal_ai_blueprint.md |
Long-term roadmap for the broader causal AI program. |
Phase 1 expects the following files in the repository root:
BYAREA_COUNTY.csvcounties_x_elevation.csvACSST5Y2021.S1901-Data.csvDiabetesPercentage.csvObesityAll.csv
Phase 2 stores downloaded public data in:
data/public/rawdata/public/metadata
These raw public files are intentionally ignored from Git because they are large and can be regenerated locally.
OxyFormer currently runs as a script-based repository rather than a packaged module.
- Python
3.10+ numpypandasrequestsmatplotlibtorchpdflatexif you want to build the PDF technical paper
A minimal install looks like:
pip install numpy pandas requests matplotlib torchRun all commands from the repository root.
python3 phase1_county_pipeline.py --folder . --output-dir outputs/phase1Download public data:
python3 phase2_public_data.py --root-dir . --acs-year 2024 --downloadBuild the merged public-data table:
python3 phase2_public_data.py --root-dir . --acs-year 2024 --buildpython3 phase25_feature_engineering.py --root-dir .python3 phase26_foundation_model.py --root-dir .This writes learned county embeddings, training history, and model weights into outputs/phase26.
All-cancer models only:
python3 phase3_causal_demo.py --root-dir . --use-foundation-embeddingsAll-cancer + supported site-specific models:
python3 phase3_causal_demo.py --root-dir . --use-foundation-embeddings --include-site-modelspython3 phase4_robustness.py --root-dir . --use-foundation-embeddingsForest plots:
python3 plot_phase3_site_forest.py --root-dir .HTML white paper:
python3 build_white_paper_report.pyPDF technical paper assets:
python3 build_technical_paper_assets.pyPDF technical paper:
cd report
pdflatex -interaction=nonstopmode -output-directory=../outputs/report technical_white_paper.tex
pdflatex -interaction=nonstopmode -output-directory=../outputs/report technical_white_paper.tex| Artifact | Path |
|---|---|
| Phase 1 summary | outputs/phase1/phase1_summary.json |
| Phase 2 summary | outputs/phase2/phase2_summary.json |
| Phase 2.5 summary | outputs/phase25/phase25_summary.json |
| OxyFormer model summary | outputs/phase26/phase26_summary.json |
| Main causal summary | outputs/phase3/phase3_summary.json |
| Robustness summary | outputs/phase4/phase4_summary.json |
| HTML report | outputs/report/white_paper.html |
| PDF technical paper | outputs/report/technical_white_paper.pdf |
- Treatment:
hypoxia_burden = 1 - oxygen_fraction_of_sea_level - Primary outcomes:
all_cancer_incidence_rate,all_cancer_mortality_rate - Representation model: dual-view masked tabular transformer with auxiliary health targets
- Causal head: cross-fitted partial-linear DML
- Nuisance learners: ridge regression for treatment and outcome models
- Geography adjustment: state dummy variables inferred from county FIPS
- Site-level extension: high-support cancer endpoints for incidence and mortality
This repository currently produces two report formats:
- HTML white paper with left-hand navigation bookmarks at
outputs/report/white_paper.html - LaTeX/PDF technical paper at
outputs/report/technical_white_paper.pdf
Both reports are generated from the same underlying model outputs and are intended for technical communication, strategy, and fundraising.
Large raw data files and large generated artifacts are intentionally ignored via .gitignore and .ignore.
That includes:
- root-level local source files such as
BYAREA_COUNTY.csvandBYAREA_COUNTY.TXT - downloaded public raw files under
data/public/raw - large generated matrices and embeddings under
outputs/phase2,outputs/phase25, andoutputs/phase26 - Python cache files and LaTeX intermediate files
This repository is released under a restrictive proprietary license. See LICENSE for details.
If you want permission to use, adapt, distribute, or commercialize this codebase, model, or associated artifacts, contact the repository owner(s) for written permission first.