Skip to content

请问为什么MOE层值添加在decoder的最后一层呢,因为每一层的预测都需要经过一对一以及一对多的匹配训练,如果只在最后一层采用MOE,其他层采用普通的FFN,这样是否影响训练效果呢 #21

@yangrongkun

Description

@yangrongkun
No description provided.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions