Add quickstart, ethics docs, and CI

LEANDERANTONY · LEANDERANTONY · commit 708ef0aed23b · 2026-03-27T21:25:57.000+05:30
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,32 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Check out repository
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version-file: .python-version
+
+      - name: Sync dependencies
+        run: uv sync --frozen
+
+      - name: Compile source tree
+        run: uv run python -m compileall src
+
+      - name: Run tests
+        run: uv run pytest
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # Multimodal Pancreatic Cancer Detection
 
+[![CI](https://github.com/LEANDERANTONY/Multimodal_Cancer_Detection/actions/workflows/ci.yml/badge.svg)](https://github.com/LEANDERANTONY/Multimodal_Cancer_Detection/actions/workflows/ci.yml)
+[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 
 Multimodal Pancreatic Cancer Detection is a bias-aware research repository for pancreatic cancer detection using CT imaging and urinary biomarkers. The implemented workflow combines bias-aware CT preprocessing, ResNet50-based CT classification, a seven-feature biomarker MLP, and exploratory multimodal fusion under synthetic pairing constraints.
@@ -52,7 +54,10 @@ Those numbers should be read carefully:
 
 Supporting project documentation lives in:
 
+- `docs/quickstart.md`
 - `docs/architecture.md`
+- `docs/data_and_ethics.md`
+- `docs/model_card.md`
 - `ROADMAP.md`
 - `DEVLOG.md`
 - `docs/timeline.md`
diff --git a/docs/data_and_ethics.md b/docs/data_and_ethics.md
@@ -0,0 +1,61 @@
+# Data And Ethics
+
+This project uses medical imaging and biomarker data related to pancreatic cancer detection. Because the work sits in a health context, the main ethical obligation is not just model quality, but careful handling of privacy, bias, uncertainty, and claims.
+
+## Data Handling
+
+- raw clinical data is intentionally kept local and is not tracked in Git
+- processed datasets are also kept local because they remain research assets, not public benchmark files
+- thesis files, checkpoints, embeddings, and presentation materials are treated as local-only by default
+- contributors should never commit patient-level source data, exported scans, or derived files that could create privacy or governance issues
+
+## Privacy Posture
+
+- this repository is structured to avoid publishing raw patient data
+- tracked artifacts are limited to lightweight code, documentation, reports, and curated figures
+- any future sharing of sample data should use explicitly approved, de-identified, non-sensitive examples
+
+## Bias And Scientific Validity
+
+Bias is a central concern in this project, especially for the CT branch.
+
+The dissertation-aligned interpretation is:
+
+- CT results are strong, but they may still contain residual dataset-of-origin signal
+- the biomarker branch is the cleanest reproducible result in the repository
+- fusion is exploratory because CT and biomarker cohorts are not patient-paired
+
+That means high headline metrics should not be read as proof of clinical readiness.
+
+## Intended Use
+
+This repository is intended for:
+
+- research documentation
+- thesis support
+- method development
+- portfolio demonstration of multimodal and bias-aware ML work
+
+This repository is not intended for:
+
+- clinical decision support
+- patient triage
+- diagnosis in real care settings
+- unattended deployment in healthcare environments
+
+## Ethical Reporting Expectations
+
+When describing the project publicly, keep these points explicit:
+
+- the CT branch required extensive bias-aware preprocessing and still has unresolved domain-generalization questions
+- the biomarker branch is more defensible than the fusion branch as a standalone positive result
+- decision-level and feature-level fusion were evaluated under synthetic pairing assumptions, not true paired-patient multimodal data
+- the project is a research artifact, not a validated clinical system
+
+## Future Ethical And Methodological Improvements
+
+- external validation on additional CT cohorts
+- domain-adversarial CT training to suppress dataset-of-origin shortcuts
+- clearer uncertainty reporting and calibration tracking
+- real paired multimodal cohorts instead of synthetic pairing
+- stronger dataset documentation and governance notes if data-sharing constraints change
diff --git a/docs/model_card.md b/docs/model_card.md
@@ -0,0 +1,91 @@
+# Model Card
+
+This document summarizes the main modelling components represented in the repository.
+
+## Project Scope
+
+The repository studies pancreatic cancer detection from two modalities:
+
+- CT imaging
+- urinary biomarkers
+
+It also includes exploratory multimodal fusion, but the fusion branch should not be treated as a clinically validated multimodal system because the cohorts are not patient-paired.
+
+## CT Model
+
+### Model
+
+- architecture: ResNet50-based classifier
+- input: processed CT slice images after bias-aware preprocessing, orientation correction, segmentation, and cropping
+- task: cancer vs control classification
+
+### Strengths
+
+- very strong tracked performance in the current repository outputs
+- supported by a more mature preprocessing and analysis pipeline than the earlier notebook-only versions
+
+### Limitations
+
+- residual domain structure may still be present in learned features
+- strong performance may partially reflect shortcut signal rather than purely pathological signal
+- evaluation is still research-stage, not deployment-grade external validation
+
+## Biomarker Model
+
+### Model
+
+- architecture: MLP
+- input features:
+  - `age`
+  - `plasma_CA19_9`
+  - `creatinine`
+  - `LYVE1`
+  - `REG1B`
+  - `TFF1`
+  - `REG1A`
+- task: cancer vs non-cancer classification
+
+### Strengths
+
+- this is the clearest reproducible positive result in the repository
+- uses a compact and interpretable feature set relative to the CT branch
+
+### Limitations
+
+- still evaluated as a research model
+- performance should not be generalized beyond the study setting without additional external validation
+
+## Fusion Models
+
+### Covered Strategies
+
+- decision-level weighted fusion
+- feature-level embedding fusion
+- label-matched and label-mismatch sanity controls
+
+### Interpretation
+
+- fusion experiments are methodologically useful
+- current fusion results are exploratory only
+- they should not be described as evidence of true multimodal clinical benefit because the modalities are not patient-paired
+
+## Training And Runtime Context
+
+- dependency manager: `uv`
+- main orchestration surface: `notebooks/01_multimodal_cancer_detection.ipynb`
+- reusable implementation surface: `src/`
+- tracked outputs: `reports/` and `figures/`
+- local-only assets: `data/`, `models/`, `embeddings/`, `thesis/`
+
+## Current Best Reading Of Results
+
+- biomarker branch: strongest defensible standalone result
+- CT branch: promising but scientifically ambiguous because of residual bias risk
+- fusion branch: exploratory and hypothesis-generating rather than clinically validated
+
+## Future Model Directions
+
+- domain-adversarial CT training with a gradient reversal layer
+- stronger external validation
+- uncertainty and calibration improvements
+- true paired multimodal evaluation if paired cohorts become available
diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -0,0 +1,83 @@
+# Quickstart
+
+This project is maintained as a research repository, not a packaged application. The fastest way to get oriented is to separate what works without local data from what requires your private processed assets.
+
+## 1. Set up the environment
+
+```powershell
+uv sync
+```
+
+This uses:
+
+- `pyproject.toml`
+- `uv.lock`
+- `.python-version`
+
+## 2. Run the lightweight validation path
+
+These checks do not require the local CT/biomarker datasets.
+
+```powershell
+uv run python -m compileall src
+uv run pytest
+```
+
+That validates the reusable helper layers, fusion utilities, and report-summary code paths tracked in Git.
+
+## 3. Review the tracked outputs first
+
+If you want the fastest understanding of the project without touching local data:
+
+- inspect `reports/final_summary.json`
+- inspect `reports/model_comparison.csv`
+- inspect the curated figures under `figures/`
+- read `docs/architecture.md` and `docs/model_card.md`
+
+## 4. Run the main notebook with local processed data
+
+Open:
+
+```text
+notebooks/01_multimodal_cancer_detection.ipynb
+```
+
+The maintained notebook flow is processed-data-first. In normal use it expects local artifacts under:
+
+- `data/processed/`
+- `reports/`
+- `models/` for local checkpoints when relevant
+
+The notebook should not need raw data for ordinary analysis runs.
+
+## 5. Rebuild processed data only when needed
+
+Use preprocessing scripts only if you are intentionally regenerating processed assets from local raw data:
+
+```powershell
+uv run python src/data/preprocess/ct_preprocess.py
+uv run python src/data/preprocess/biomarker_preprocess.py
+```
+
+Those scripts are rebuild steps, not required for everyday notebook execution.
+
+## 6. Expected local-only assets
+
+These stay outside Git and are expected to exist only on local machines with approved access:
+
+- `data/raw/`
+- `data/processed/`
+- `models/`
+- `embeddings/`
+- `thesis/`
+
+## 7. Practical reading order
+
+If you are new to the repo, use this order:
+
+1. `README.md`
+2. `docs/quickstart.md`
+3. `docs/architecture.md`
+4. `docs/data_and_ethics.md`
+5. `docs/model_card.md`
+6. `notebooks/01_multimodal_cancer_detection.ipynb`