朝花夕拾 (Postscript) : Built in 2024, back when large models were still in their "village idiot" era — every line of code had to be hand-typed, every bug hunted down the hard way. A small memorial to my undergraduate years, and to the age of artisanal, hand-crafted programming.
A pure CNN segmentation network that outperforms mainstream Transformer-based models across three public benchmarks — without any attention mechanisms, Transformer modules, or deformable convolutions. Built entirely from standard components: 3×3 Conv + BatchNorm + ReLU.
Core insight: The underperformance of traditional CNNs in medical segmentation stems not from a lack of attention mechanisms, but from three long-overlooked architectural adaptation issues when migrating ResNet from classification to dense prediction.
📖 中文版本:README_CN.md
Most recent work follows an "additive innovation" pattern — stacking Transformers, attention gates, and deformable convolutions. ResUnet takes the opposite approach: fix the architectural flaws in ResNet itself. Three targeted changes:
Left: baseline Res34-UNet | Right: ResUnet v1.0 — MaxPool removed, layer1 skip dropped, decoder channels preserved.
The traditional ResNet Stem module applies 7×7 conv (stride 2) + MaxPool, downsampling input to 1/4 resolution immediately. Fine for ImageNet classification (only needs to know what), but in medical segmentation you need to know where the boundary is — pancreas and gallbladder shrink to 1-2 pixels after 4× downsampling.
Fix: Remove the MaxPool layer. Stem downsampling drops from 4× to 2×, and the encoder's max downsampling from 32× to 16×.
Result: Dice 79.02% → 82.03% (+3.01%), HD95 26.92 → 17.65 (-9.27), Pancreas 61.39% → 65.49% (+4.10%)
Traditional U-Net blindly preserves all skip connections. But layer1's high-resolution features are contaminated with low-level imaging noise (CT beam-hardening streaks, MRI motion artifacts), which disrupt high-level semantic fusion when fed into the decoder.
Fix: Remove layer1's skip connection, keeping only layer2–layer4 features. Decomposed as Flayer1 = ℐspatial + Nnoise — when deep semantics are strong enough, the noise Nnoise from layer1 far outweighs its spatial localization value.
Result: Gallbladder Dice 65.53% → 70.92% (+5.39%)
Traditional symmetric decoders habitually compress channel counts after every feature fusion step ("fuse then compress"). Acceptable for large objects in natural images, but small-organ features in medical images are already extremely sparse — compressing channels actively discards weak but critical signals.
Fix (three sub-components):
- Bottleneck Enhancement: Dual 3×3 convs at encoder tail to strengthen context (Dice +0.96%)
- Channel-Preserving Fusion: Layer3 keeps 512 channels (768→512, not 768→256)
- Channel-Preserving Upsampling: Layer2 keeps 384→384, not 384→192
Result: Dice 83.50% → 84.42% (+0.92%), Aorta +1.56%, Right Kidney +2.95%
| Dataset | Modality | Classes | ResUnet Dice | Previous Best | Params |
|---|---|---|---|---|---|
| Synapse | Abdominal CT | 9 | 84.42% | PVT-EMCAD-B2 83.63% | 21.3M |
| ACDC | Cardiac MRI | 4 | 92.17% | PVT-EMCAD-B2 92.12% | 21.3M |
| AVT | Aortic Angiography | 2 | 87.11% | DAE-Former 85.02% | 21.3M |
HD95 also leads: Synapse 13.03 (next best 15.68), AVT 4.44mm (next best 8.96mm — nearly 2× worse).
Runs on a single GPU — all experiments on one RTX 3090.
Synapse multi-organ CT — ResUnet vs. competing methods across 8 abdominal organs.
Left: AVT aortic angiography | Right: ACDC cardiac MRI.
Multiple U-Net variants are implemented for comparison:
| Model | Type | Notes |
|---|---|---|
| ResUnet (v0.1–v1.0) | Pure CNN | ResNet34 encoder + asymmetric decoder |
| ResUnet++ | Pure CNN | Dense skip connections |
| Vanilla U-Net | Pure CNN | Classic baseline |
| EMCADNet | CNN + Channel Attention | PVTv2/ResNet encoder + EMCAD decoder |
| TransUNet | CNN + ViT Hybrid | ResNet50 + ViT encoder |
| SwinUnet | Pure Transformer | Swin Transformer encoder-decoder |
Each step corresponds to one specific paper improvement; every file under Synapse_u/model/ carries a top-of-file docstring explaining its role.
| Version | Paper role | Key change |
|---|---|---|
| v0.1 | baseline | ResNet34 (no pretrain) + full stem + all 5 skips |
| v0.2 | training config | + ImageNet pretrain (not part of the ablation) |
| v0.3 | Improvement ① | Drop MaxPool from encoder stem (resolution preservation) |
| v0.4 | Improvement ② | Drop layer1 skip (hierarchical feature selection) |
| v0.5 | Improvement ③a | Bottleneck enhancement |
| v0.6 | Improvement ③b | Layer3 channel preservation |
| v0.7 | unused branch | Re-introduces layer1 skip (opposite of ②, kept for reference) |
| v1.0 | ★ Final model | Improvement ③c: up2 channel preservation, all changes integrated |
conda env create -f ResUnet3.yml
conda activate ResUnet
python -c "import torch; print(torch.cuda.is_available())"Requirements: Python 3.9.20 · PyTorch 2.4.1 · CUDA 12.4
data/ResCNN/data/
├── Synapse/
│ ├── train_npz/ # Training .npz files (image + label)
│ └── test_vol/ # Testing .h5 volumes
├── AVT/
│ ├── train_npz/
│ └── test_vol/ # .npy.h5 volumes
└── ACDC/ # Separate structure
Split files are in Synapse_u/lists/, one case ID per line.
Unified entry point train.py: pick the model with --model, the dataset with --dataset.
# Final paper model
python train.py --model ResUnet1_0 --dataset Synapse
python train.py --model ResUnet1_0 --dataset AVT
# Ablation ladder (the three paper improvements applied step by step)
python train.py --model ResUnet0_1 --dataset Synapse # baseline
python train.py --model ResUnet0_3 --dataset Synapse # +Improvement (1) drop MaxPool
python train.py --model ResUnet0_4 --dataset Synapse # +Improvement (2) drop layer1 skip
python train.py --model ResUnet0_5 --dataset Synapse # +Improvement (3a) bottleneck
python train.py --model ResUnet0_6 --dataset Synapse # +Improvement (3b) layer3 channels
# Comparison models
python train.py --model EMCAD --dataset Synapse
python train.py --model TransUnet --dataset Synapse
python train.py --model SwinUnet --dataset Synapse
python train.py --model ResUnetpp --dataset AVT
python train.py --model Unet --dataset AVT
# List every supported model with its paper role
python train.py --list-models
# Laptop / 3060 smoke test (does not actually train, just verifies the pipeline)
python train.py --model ResUnet1_0 --dataset Synapse --max_epochs 1 --batch_size 1 --eval_interval 1The legacy
train_Synapse_*.py/train_AVT_*.pyscripts are kept for reference and remain equivalent totrain.py. New work should usetrain.py.
| Argument | Default | Description |
|---|---|---|
--model |
required | See python train.py --list-models |
--dataset |
Synapse |
Synapse or AVT |
--max_epochs |
200 | Training epochs |
--batch_size |
8 | Batch size per GPU |
--base_lr |
0.02 | Initial LR (SGD) |
--img_size |
224 | Input spatial size |
--eval_interval |
25 | Eval interval (epochs) |
- Loss: 0.4 x CrossEntropyLoss + 0.6 x DiceLoss (0.3 + 0.7 for ACDC)
- Optimizer: SGD, momentum=0.9, weight_decay=1e-4, polynomial LR decay
- Evaluation: Starts at
epoch >= max_epochs/1.5, runs everyeval_interval, plus last 3 epochs - Checkpoints: Saved under
checkpoints_save/<model_name>_SGD_<lr>_<epochs>_<batch>_<dataset>/
Unified entry point test.py:
python test.py --model ResUnet1_0 --dataset Synapse \
--model_name vResUnet1_0_v1_0_0_SGD_0.02_200_8_Synapse/vResUnet1_0_v1_0_0_epoch_199.pth
python test.py --model EMCAD --dataset AVT \
--model_name vEMCADNet_v1_0_0_SGD_0.02_200_8_AVT/vEMCADNet_v1_0_0_epoch_199.pth
# Laptop smoke test (only first case, no .nii.gz output)
python test.py --model ResUnet1_0 --dataset Synapse \
--model_name vResUnet1_0_v1_0_0_SGD_0.02_200_8_Synapse/vResUnet1_0_v1_0_0_epoch_199.pth \
--max_cases 1 --is_savenii False| Argument | Description |
|---|---|
--model |
Required, must match the model used during training |
--dataset |
Synapse or AVT |
--model_name |
Path relative to checkpoints_save/ |
--max_cases |
Run only the first N cases; 0 = all (default) |
--is_savenii |
True to save .nii.gz predictions (default False) |
The legacy
test_Synapse.py/test_AVT.py/test_*_TransUnet.pyare kept, buttest.pyis preferred — it shares the same MODEL_REGISTRY astrain.py, so the model class never gets mismatched with the weights.
- Dice Score (DSC): Overlap between prediction and ground truth (higher is better)
- 95% Hausdorff Distance (HD95): Boundary distance (lower is better)
Computed via medpy.metric.binary. Results reported per-class and averaged over foreground classes.
ResUnet/
├── train.py / test.py # ★ Unified entry points (recommended)
├── train_Synapse_*.py / train_AVT_*.py / test_*.py # Legacy scripts, kept for reference
├── Synapse_u/ # Core library
│ ├── trainer.py # Training loop
│ ├── utils.py # DiceLoss, metrics, 3D inference
│ ├── datasets/ # Dataset classes
│ ├── model/ # All model implementations
│ │ ├── ResUnet0_1.py ~ 0_7.py, ResUnet1_0.py # Ablation ladder (each file documents its role)
│ │ ├── ResUnetpp.py, Unet.py # Baselines
│ │ ├── v3_8_19.py, v3_8_19_2.py # PPM exploration (not used in paper)
│ │ ├── EMCAD/ # EMCADNet
│ │ ├── TransUNet/ # ViT hybrid
│ │ └── SwinUnet/ # Swin Transformer
│ └── lists/ # Train/test splits
├── ACDC/ # ACDC dataset standalone subproject (kept as-is)
├── checkpoints_save/ # Training outputs (auto-generated, gitignored)
└── data/ResCNN/data/ # Dataset storage (gitignored)
Training outputs at checkpoints_save/<model_name>_...:
<name>_epoch_<N>.pth— Model weights at epoch Nlog/— TensorBoard event files<name>_<timestamp>_loss.csv— Training loss<name>_<timestamp>_dice.png/_hd95.png— Metric plots<name>_<timestamp>_results.csv— Evaluation data
- Input: single-channel grayscale → replicated to 3 channels for ImageNet-pretrained encoder compatibility
- TransUNet requires additional ViT pretrained weights — paths in
vit_seg_configs.py - Multi-GPU supported via
--n_gpuandnn.DataParallel - ACDC dataset uses a separate codebase under
ACDC/; useSynapse_u/as the primary reference
If you find this work useful, please cite:
@mastersthesis{resunet2026,
title={ResUnet: Fixing ResNet's Adaptation Defects for Medical Image Segmentation},
year={2026},
school={Shaoxing University}
}



