De-identify clinical text, summarize findings, and tag case risk (low / moderate / high) — via CLI, REST API, or Streamlit.
Not a medical device. For demonstration and clinical decision support research only.
- Ingest & validate clinical text (radiology / pathology / discharge / ecg / echo / others).
- PHI anonymization (emails, phones, MRNs/IDs, address & name hints, dates*).
- Sentence splitting for downstream NLP.
- Summarization (abstractive): 2–3 sentence summary of key findings.
- Risk tagging (hybrid): NLI model + heuristics →
low | moderate | highwith probabilities. - Explainability: UI highlights risk-related keywords; meta includes model names/modes & PHI counts.
* Date redaction is configurable.
- Three interfaces:
CLI•REST API (FastAPI)•Streamlitweb app - PHI handling: mask or salted-hash; counts reported in responses
- Models (baseline):
- Summarizer:
google/flan-t5-base - Risk NLI:
facebook/bart-large-mnli(configurable)
- Summarizer:
- Config via
.env(limits, model names, thresholds, device) - CPU-only friendly; optional CUDA/MPS if available
Language & Runtime
- Python 3.9+ (tested on 3.11), CPU-first with optional CUDA/MPS via PyTorch
Core Libraries
- FastAPI + Uvicorn — REST API server
- Typer — CLI entrypoints
- Pydantic v2 + pydantic-settings — schema & config management
- regex, nltk, langdetect, numpy — preprocessing, sentence split, language hints
NLP / Models (Hugging Face)
- transformers, torch, accelerate, sentencepiece
- Summarizer:
google/flan-t5-base(abstractive) - Risk Tagger (NLI):
facebook/bart-large-mnli(zero-shot label probabilities) - scikit-learn — small utilities (e.g., softmax/normalization if needed)
UI
- Streamlit, Altair, pandas — interactive web app, charts, and tables
Build & Quality
- pyproject.toml (setuptools) — packaging
- Optional dev extras: pytest, pytest-cov, ruff, mypy
Storage/State
- Stateless API; models cached locally by
transformersin the user HF cache
# 1) Project root
Set-Location C:\path\to\mediscan-iq
# 2) Create & activate venv
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip wheel setuptools
# 3) Install package (editable) + dev extras
pip install -e .[dev]
# 4) NLTK punkt (once)
python - <<'PY'
import nltk; nltk.download('punkt'); print("punkt ready")
PY
cd /path/to/mediscan-iq
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip wheel setuptools
pip install -e .[dev]
python - <<'PY'
import nltk; nltk.download('punkt'); print("punkt ready")
PY
Copy the example and adjust values:
cp .env.example .env
Most baselines declare accepted_report_types as a List[str] in config.py, so use JSON in .env:
# Server
API_HOST=0.0.0.0
API_PORT=8000
# Limits
MAX_CHARS=20000
# Accepted report types (JSON array)
ACCEPTED_REPORT_TYPES=["radiology","pathology","discharge","ecg","echo","others"]
# PHI
ANONYMIZE_STRATEGY=hash # mask | hash
MASK_CHAR=█
HASH_SALT=mediscan
KEEP_DATES=false
REDUCE_WHITESPACE=true
# Models
SUMMARIZER_MODEL=google/flan-t5-base
RISK_NLI_MODEL=facebook/bart-large-mnli
RISK_THRESHOLD_HIGH=0.64
RISK_THRESHOLD_MODERATE=0.42
# Runtime
DEVICE_PREFERENCE=auto # auto | cpu | cuda | mps
SEED=42
If your
config.pyuses a*_rawCSV string with a parsing property, changeACCEPTED_REPORT_TYPESto:radiology,pathology,discharge,ecg,echo,others.
# Ingest (validate + anonymize + split)
python -m mediscan_iq.cli ingest .\sample_data\example_radiology.txt -t radiology
# Summarize (anonymize → abstractive summary)
python -m mediscan_iq.cli summarize .\sample_data\example_radiology.txt -t radiology
# Risk tagging (anonymize → NLI + heuristics)
python -m mediscan_iq.cli risk .\sample_data\example_radiology.txt
# Full pipeline (anonymize → split → summarize → risk)
python -m mediscan_iq.cli analyze .\sample_data\example_radiology.txt -t radiology
If installed as a console script:
mediscan-iq analyze path/to/report.txt -t radiology
python -m uvicorn mediscan_iq.api:app --host 0.0.0.0 --port 8000 --reload
Health probe:
Invoke-RestMethod http://localhost:8000/health
GET /health→{"status":"ok","version":"<semver>"}POST /ingest→ anonymized text, sentences, PHI countsPOST /analyze→ summary, risk level + probabilities, anonymized text, sentences, meta
PowerShell — /ingest
$body = @{ text = "Patient John Doe ..."; report_type = "radiology" } | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:8000/ingest" -Method Post -ContentType "application/json" -Body $body
PowerShell — /analyze (multi-line)
$text = @"
FINDINGS: Mild cardiomegaly. No acute pulmonary edema or effusion.
IMPRESSION: Cardiomegaly without acute cardiopulmonary process.
"@
$body = @{ text = $text; report_type = "radiology" } | ConvertTo-Json -Depth 4
Invoke-RestMethod -Uri "http://localhost:8000/analyze" -Method Post -ContentType "application/json" -Body $body
curl — /analyze
curl -s -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{"text":"FINDINGS: Mild cardiomegaly. IMPRESSION: No acute disease.","report_type":"radiology"}'
pip install streamlit altair pandas
streamlit run .\app_streamlit.py
- Paste or upload a note → click Analyze.
- See summary, risk badge, probability chart, anonymized text with highlights, and meta.
Create files quickly (PowerShell):
@'
FINDINGS: Mild cardiomegaly. No acute pulmonary edema or pleural effusion.
IMPRESSION: Cardiomegaly without acute cardiopulmonary process.
'@ | Set-Content .\CXR_mild_cardiomegaly.txt -Encoding UTF8
@'
FINDINGS: Acute subarachnoid hemorrhage within the suprasellar cisterns and anterior interhemispheric fissure.
IMPRESSION: Acute subarachnoid hemorrhage. Urgent neurosurgical evaluation recommended.
'@ | Set-Content .\HeadCT_SAH.txt -Encoding UTF8
@'
HISTORY: Fever, productive cough.
FINDINGS: New right upper lobe air-space consolidation with air bronchograms.
IMPRESSION: Right upper lobe pneumonia. Recommend therapy.
'@ | Set-Content .\CXR_RUL_pneumonia.txt -Encoding UTF8
@'
ADMISSION: Acute decompensated heart failure with volume overload.
COURSE: IV diuresis net -2.5L in 48h; renal function improved; EF 40-45%; small pleural effusions.
DISCHARGE: Furosemide 40 mg, Spironolactone 25 mg, low-sodium diet, cardiology follow-up.
'@ | Set-Content .\Discharge_ADHF_long.txt -Encoding UTF8
- CXR_mild_cardiomegaly → Summary mentions mild cardiomegaly; no acute process. Risk typically moderate (keyword “cardiomegaly”) or low if thresholds stricter—OK if the summary is correct.
- HeadCT_SAH → Summary includes acute subarachnoid hemorrhage. Risk = high;
high riskprobability on top. - CXR_RUL_pneumonia → Summary includes RUL pneumonia / consolidation. Risk = moderate;
moderate riskon top. - Discharge_ADHF_long → Summary captures ADHF, diuresis, EF ~40–45%, small effusions, meds. Risk = moderate (may be low if conservative). Summary should remain coherent despite length.
Focus on correct clinical concepts and top-ranked risk bucket rather than exact probability numbers.
-
First run downloads models; subsequent runs are fast.
-
CPU-only works; set
DEVICE_PREFERENCE=cudaif you have a GPU. -
For richer summaries, increase:
SUMMARIZER_MAX_OUTPUT_TOKENS=160SUMMARIZER_NUM_BEAMS=6(when using seq2seq)
-
For stricter triage tuning:
RISK_THRESHOLD_HIGH=0.58–0.65RISK_THRESHOLD_MODERATE=0.40–0.50
-
BaseSettings has moved to pydantic-settingsInstall & update import:pip install pydantic-settingsIn
config.py:from pydantic_settings import BaseSettings
-
Unsupported report typeeven though value looks correct Check.envformat forACCEPTED_REPORT_TYPES:- If
List[str]field → JSON array - If
*_raw: str+ property → CSV
- If
-
422 Unprocessable Entityon/analyzeEnsure JSON body has both fields and correct header:{"text":"...","report_type":"radiology"} -
“temperature ignored” notice Harmless on some HF models; disappears if temperature is only passed when sampling.
- The anonymizer removes common PHI heuristically (emails, phones, dates*, IDs, name/address hints).
- Do not ingest directly identifiable patient data without reviewing your PHI policy.
- This repository is not a medical device; clinicians remain responsible for decisions.
This project is licensed under MIT License.
- Abdulvahap Mutlu — MediScan-IQ