-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Problem Statement
We conducted an ablation study comparing model performance with 2-frame versus 5-frame inputs. Surprisingly, the results show only minimal differences (<0.1% on most metrics), despite the significant difference in input information.
Experimental Setup
Model: DeltaFlow
Configuration:
Voxel size: ${voxel_size}
Point cloud range: ${point_cloud_range}
Planes: [16, 32, 64, 128, 256, 256, 128, 64, 32, 16]
Num_layer: [2, 2, 2, 2, 2, 2, 2, 2, 2] (MinkUnet 18)
Decay factor: 0.4
Decoder option: default
Results
2-Frame Input (without history frames):

5-Frame Input (with history frames):

Key Observations
Both configurations achieve nearly identical performance across most evaluation metrics.
The difference is consistently within 1% for the majority of measurement items.
Additional experiments show that the 2-frame version converges to performance levels very close to the 5-frame version within 5 training epochs.
Question/Concern
Given the substantial difference in input information (2 frames vs 5 frames), we would expect more significant performance variation. The minimal observed difference raises questions about:
Whether the model is effectively utilizing the additional temporal information from 5-frame inputs
Potential issues with the implementation or configuration