-
Notifications
You must be signed in to change notification settings - Fork 50
Description
I tried using the baseline model with LSTM on my version of the dataset. I downloaded the videos and loaded the labels using the make_dataset.py script. However, the labels in my dataset don't match the original ones. Despite this, I tested the model on this modified dataset using the average of the user_summary annotations as the evaluation labels. The resulting F-score was about 0.30. Then, I tried using the maximum value instead, which gave better results with an F-score of 0.52.
Later, I tried evaluating the model using the gt_score and converting it to shot summaries, similar to our training approach. After evaluation, I got an average F-score of 0.70. But the F1-score varied a lot.
As you can see in the image, the F1-score keeps changing. My question is whether this way of evaluating is not good, and if the unstable F-score indicates a problem.
