Skip to content

[Question] How to correctly integrate FDAM (especially FreqScale) into Mask DINO (Detectron2)? #7

Description

@justdays

Hi @Linwei-Chen,

Thank you for your great work and for releasing the code for FDAM! I'm currently trying to integrate it into the official Mask DINO (Detectron2 version) for instance segmentation, following the results reported in Table 3 of your ICCV paper.

I understand the standard FDAM integration shown in the README is for a ViT backbone Block:

  1. Replace Attention with AttentionwithAttInv.
  2. Add GroupDynamicScale after Attention.
  3. Add GroupDynamicScale after MLP.

However, Mask DINO's Transformer Decoder uses object queries (tgt) of shape (num_queries, batch, dim), which lack a spatial structure (H, W). It seems GroupDynamicScale (which uses rfft2 and requires a (B, C, H, W) input) cannot be directly applied to these sequences.

  1. AttInv: I successfully integrated AttentionwithAttInv into DeformableTransformerDecoderLayer to replace self_attn.
  2. FreqScale: I am stuck here. I see GroupDynamicScale requires spatial dimensions.

Could you please clarify the intended integration path for Mask DINO?

  • Is the FreqScale component used at all for the Mask DINO experiments? If so, what is the correct location? For example, should it be applied to the memory features (Key/Value) from the Pixel Decoder (which do have spatial structure)?
  • Or does the Mask DINO integration in the paper only use the AttInv part?

Any guidance or pointers to relevant configs/files in the repo would be greatly appreciated. Thank you in advance for your time and help!

Best,
zetao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions