Evaluation Approach for the baseline model

I tried using the baseline model with LSTM on my version of the dataset. I downloaded the videos and loaded the labels using the `make_dataset.py` script. However, the labels in my dataset don't match the original ones. Despite this, I tested the model on this modified dataset using the average of the user_summary annotations as the evaluation labels. The resulting F-score was about 0.30. Then, I tried using the maximum value instead, which gave better results with an F-score of 0.52.

Later, I tried evaluating the model using the gt_score and converting it to shot summaries, similar to our training approach. After evaluation, I got an average F-score of 0.70. But the F1-score varied a lot.

![ckpt3-Lstm](https://github.com/li-plus/DSNet/assets/75118102/49d2bf10-3fe3-44f0-8c27-bce49d013e5d)

As you can see in the image, the F1-score keeps changing. My question is whether this way of evaluating is not good, and if the unstable F-score indicates a problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Approach for the baseline model #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation Approach for the baseline model #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions