Skip to content

Commit 708ef0a

Browse files
committed
Add quickstart, ethics docs, and CI
1 parent cc72eba commit 708ef0a

5 files changed

Lines changed: 272 additions & 0 deletions

File tree

.github/workflows/ci.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- name: Check out repository
15+
uses: actions/checkout@v4
16+
17+
- name: Install uv
18+
uses: astral-sh/setup-uv@v5
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v5
22+
with:
23+
python-version-file: .python-version
24+
25+
- name: Sync dependencies
26+
run: uv sync --frozen
27+
28+
- name: Compile source tree
29+
run: uv run python -m compileall src
30+
31+
- name: Run tests
32+
run: uv run pytest

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Multimodal Pancreatic Cancer Detection
22

3+
[![CI](https://github.com/LEANDERANTONY/Multimodal_Cancer_Detection/actions/workflows/ci.yml/badge.svg)](https://github.com/LEANDERANTONY/Multimodal_Cancer_Detection/actions/workflows/ci.yml)
4+
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
35
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
46

57
Multimodal Pancreatic Cancer Detection is a bias-aware research repository for pancreatic cancer detection using CT imaging and urinary biomarkers. The implemented workflow combines bias-aware CT preprocessing, ResNet50-based CT classification, a seven-feature biomarker MLP, and exploratory multimodal fusion under synthetic pairing constraints.
@@ -52,7 +54,10 @@ Those numbers should be read carefully:
5254

5355
Supporting project documentation lives in:
5456

57+
- `docs/quickstart.md`
5558
- `docs/architecture.md`
59+
- `docs/data_and_ethics.md`
60+
- `docs/model_card.md`
5661
- `ROADMAP.md`
5762
- `DEVLOG.md`
5863
- `docs/timeline.md`

docs/data_and_ethics.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Data And Ethics
2+
3+
This project uses medical imaging and biomarker data related to pancreatic cancer detection. Because the work sits in a health context, the main ethical obligation is not just model quality, but careful handling of privacy, bias, uncertainty, and claims.
4+
5+
## Data Handling
6+
7+
- raw clinical data is intentionally kept local and is not tracked in Git
8+
- processed datasets are also kept local because they remain research assets, not public benchmark files
9+
- thesis files, checkpoints, embeddings, and presentation materials are treated as local-only by default
10+
- contributors should never commit patient-level source data, exported scans, or derived files that could create privacy or governance issues
11+
12+
## Privacy Posture
13+
14+
- this repository is structured to avoid publishing raw patient data
15+
- tracked artifacts are limited to lightweight code, documentation, reports, and curated figures
16+
- any future sharing of sample data should use explicitly approved, de-identified, non-sensitive examples
17+
18+
## Bias And Scientific Validity
19+
20+
Bias is a central concern in this project, especially for the CT branch.
21+
22+
The dissertation-aligned interpretation is:
23+
24+
- CT results are strong, but they may still contain residual dataset-of-origin signal
25+
- the biomarker branch is the cleanest reproducible result in the repository
26+
- fusion is exploratory because CT and biomarker cohorts are not patient-paired
27+
28+
That means high headline metrics should not be read as proof of clinical readiness.
29+
30+
## Intended Use
31+
32+
This repository is intended for:
33+
34+
- research documentation
35+
- thesis support
36+
- method development
37+
- portfolio demonstration of multimodal and bias-aware ML work
38+
39+
This repository is not intended for:
40+
41+
- clinical decision support
42+
- patient triage
43+
- diagnosis in real care settings
44+
- unattended deployment in healthcare environments
45+
46+
## Ethical Reporting Expectations
47+
48+
When describing the project publicly, keep these points explicit:
49+
50+
- the CT branch required extensive bias-aware preprocessing and still has unresolved domain-generalization questions
51+
- the biomarker branch is more defensible than the fusion branch as a standalone positive result
52+
- decision-level and feature-level fusion were evaluated under synthetic pairing assumptions, not true paired-patient multimodal data
53+
- the project is a research artifact, not a validated clinical system
54+
55+
## Future Ethical And Methodological Improvements
56+
57+
- external validation on additional CT cohorts
58+
- domain-adversarial CT training to suppress dataset-of-origin shortcuts
59+
- clearer uncertainty reporting and calibration tracking
60+
- real paired multimodal cohorts instead of synthetic pairing
61+
- stronger dataset documentation and governance notes if data-sharing constraints change

docs/model_card.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Model Card
2+
3+
This document summarizes the main modelling components represented in the repository.
4+
5+
## Project Scope
6+
7+
The repository studies pancreatic cancer detection from two modalities:
8+
9+
- CT imaging
10+
- urinary biomarkers
11+
12+
It also includes exploratory multimodal fusion, but the fusion branch should not be treated as a clinically validated multimodal system because the cohorts are not patient-paired.
13+
14+
## CT Model
15+
16+
### Model
17+
18+
- architecture: ResNet50-based classifier
19+
- input: processed CT slice images after bias-aware preprocessing, orientation correction, segmentation, and cropping
20+
- task: cancer vs control classification
21+
22+
### Strengths
23+
24+
- very strong tracked performance in the current repository outputs
25+
- supported by a more mature preprocessing and analysis pipeline than the earlier notebook-only versions
26+
27+
### Limitations
28+
29+
- residual domain structure may still be present in learned features
30+
- strong performance may partially reflect shortcut signal rather than purely pathological signal
31+
- evaluation is still research-stage, not deployment-grade external validation
32+
33+
## Biomarker Model
34+
35+
### Model
36+
37+
- architecture: MLP
38+
- input features:
39+
- `age`
40+
- `plasma_CA19_9`
41+
- `creatinine`
42+
- `LYVE1`
43+
- `REG1B`
44+
- `TFF1`
45+
- `REG1A`
46+
- task: cancer vs non-cancer classification
47+
48+
### Strengths
49+
50+
- this is the clearest reproducible positive result in the repository
51+
- uses a compact and interpretable feature set relative to the CT branch
52+
53+
### Limitations
54+
55+
- still evaluated as a research model
56+
- performance should not be generalized beyond the study setting without additional external validation
57+
58+
## Fusion Models
59+
60+
### Covered Strategies
61+
62+
- decision-level weighted fusion
63+
- feature-level embedding fusion
64+
- label-matched and label-mismatch sanity controls
65+
66+
### Interpretation
67+
68+
- fusion experiments are methodologically useful
69+
- current fusion results are exploratory only
70+
- they should not be described as evidence of true multimodal clinical benefit because the modalities are not patient-paired
71+
72+
## Training And Runtime Context
73+
74+
- dependency manager: `uv`
75+
- main orchestration surface: `notebooks/01_multimodal_cancer_detection.ipynb`
76+
- reusable implementation surface: `src/`
77+
- tracked outputs: `reports/` and `figures/`
78+
- local-only assets: `data/`, `models/`, `embeddings/`, `thesis/`
79+
80+
## Current Best Reading Of Results
81+
82+
- biomarker branch: strongest defensible standalone result
83+
- CT branch: promising but scientifically ambiguous because of residual bias risk
84+
- fusion branch: exploratory and hypothesis-generating rather than clinically validated
85+
86+
## Future Model Directions
87+
88+
- domain-adversarial CT training with a gradient reversal layer
89+
- stronger external validation
90+
- uncertainty and calibration improvements
91+
- true paired multimodal evaluation if paired cohorts become available

docs/quickstart.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Quickstart
2+
3+
This project is maintained as a research repository, not a packaged application. The fastest way to get oriented is to separate what works without local data from what requires your private processed assets.
4+
5+
## 1. Set up the environment
6+
7+
```powershell
8+
uv sync
9+
```
10+
11+
This uses:
12+
13+
- `pyproject.toml`
14+
- `uv.lock`
15+
- `.python-version`
16+
17+
## 2. Run the lightweight validation path
18+
19+
These checks do not require the local CT/biomarker datasets.
20+
21+
```powershell
22+
uv run python -m compileall src
23+
uv run pytest
24+
```
25+
26+
That validates the reusable helper layers, fusion utilities, and report-summary code paths tracked in Git.
27+
28+
## 3. Review the tracked outputs first
29+
30+
If you want the fastest understanding of the project without touching local data:
31+
32+
- inspect `reports/final_summary.json`
33+
- inspect `reports/model_comparison.csv`
34+
- inspect the curated figures under `figures/`
35+
- read `docs/architecture.md` and `docs/model_card.md`
36+
37+
## 4. Run the main notebook with local processed data
38+
39+
Open:
40+
41+
```text
42+
notebooks/01_multimodal_cancer_detection.ipynb
43+
```
44+
45+
The maintained notebook flow is processed-data-first. In normal use it expects local artifacts under:
46+
47+
- `data/processed/`
48+
- `reports/`
49+
- `models/` for local checkpoints when relevant
50+
51+
The notebook should not need raw data for ordinary analysis runs.
52+
53+
## 5. Rebuild processed data only when needed
54+
55+
Use preprocessing scripts only if you are intentionally regenerating processed assets from local raw data:
56+
57+
```powershell
58+
uv run python src/data/preprocess/ct_preprocess.py
59+
uv run python src/data/preprocess/biomarker_preprocess.py
60+
```
61+
62+
Those scripts are rebuild steps, not required for everyday notebook execution.
63+
64+
## 6. Expected local-only assets
65+
66+
These stay outside Git and are expected to exist only on local machines with approved access:
67+
68+
- `data/raw/`
69+
- `data/processed/`
70+
- `models/`
71+
- `embeddings/`
72+
- `thesis/`
73+
74+
## 7. Practical reading order
75+
76+
If you are new to the repo, use this order:
77+
78+
1. `README.md`
79+
2. `docs/quickstart.md`
80+
3. `docs/architecture.md`
81+
4. `docs/data_and_ethics.md`
82+
5. `docs/model_card.md`
83+
6. `notebooks/01_multimodal_cancer_detection.ipynb`

0 commit comments

Comments
 (0)