self.alpha = torch.nn.Parameter(torch.tensor(1.33))
attention_probs = entmax_bisect(attention_scores, alpha=self.alpha, dim=-1)
I directly used the Adamw optimizer for backpropagation and found that the value of output a kept decreasing and was less than 1.
May I ask if I used the entmax method incorrectly?