What if there is only one speaker in mix audio, and he/she is not the enrollment person

First thank you for your great work!
I've tried your pretrained model, it can't solve the problem in the title, it just export the original audio.

In non-causal case, maybe i can use spk_model to clasify the whole audio, and export a zero tensor.
But in causal case, what should i do? 
I've found that ecapa_tdnn can export frame-level embbeding, but i'm not sure if it's discriminative.