Hi @Linwei-Chen,
Thank you for your great work and for releasing the code for FDAM! I'm currently trying to integrate it into the official Mask DINO (Detectron2 version) for instance segmentation, following the results reported in Table 3 of your ICCV paper.
I understand the standard FDAM integration shown in the README is for a ViT backbone Block:
- Replace Attention with
AttentionwithAttInv.
- Add
GroupDynamicScale after Attention.
- Add
GroupDynamicScale after MLP.
However, Mask DINO's Transformer Decoder uses object queries (tgt) of shape (num_queries, batch, dim), which lack a spatial structure (H, W). It seems GroupDynamicScale (which uses rfft2 and requires a (B, C, H, W) input) cannot be directly applied to these sequences.
- AttInv: I successfully integrated
AttentionwithAttInv into DeformableTransformerDecoderLayer to replace self_attn.
- FreqScale: I am stuck here. I see
GroupDynamicScale requires spatial dimensions.
Could you please clarify the intended integration path for Mask DINO?
- Is the FreqScale component used at all for the Mask DINO experiments? If so, what is the correct location? For example, should it be applied to the
memory features (Key/Value) from the Pixel Decoder (which do have spatial structure)?
- Or does the Mask DINO integration in the paper only use the AttInv part?
Any guidance or pointers to relevant configs/files in the repo would be greatly appreciated. Thank you in advance for your time and help!
Best,
zetao
Hi @Linwei-Chen,
Thank you for your great work and for releasing the code for FDAM! I'm currently trying to integrate it into the official Mask DINO (Detectron2 version) for instance segmentation, following the results reported in Table 3 of your ICCV paper.
I understand the standard FDAM integration shown in the README is for a ViT backbone Block:
AttentionwithAttInv.GroupDynamicScaleafter Attention.GroupDynamicScaleafter MLP.However, Mask DINO's Transformer Decoder uses object queries (
tgt) of shape(num_queries, batch, dim), which lack a spatial structure (H, W). It seemsGroupDynamicScale(which usesrfft2and requires a(B, C, H, W)input) cannot be directly applied to these sequences.AttentionwithAttInvintoDeformableTransformerDecoderLayerto replaceself_attn.GroupDynamicScalerequires spatial dimensions.Could you please clarify the intended integration path for Mask DINO?
memoryfeatures (Key/Value) from the Pixel Decoder (which do have spatial structure)?Any guidance or pointers to relevant configs/files in the repo would be greatly appreciated. Thank you in advance for your time and help!
Best,
zetao