
Build a classifier for if something is a cue block or if something is a discussion block.
Can test different size block sizes (moving window) from 1, 2, 3, 4, 5, 10, sentences etc.
After finding the block size classifier that performs best, use the trained classifier to generate a signal that is 0 for discussion blocks and 1 for cue blocks.
So a transcript's generated sequence from the classifier may look something like the bottom sequence in the image. (1, 0, 0, 0, 1, 1, 0, 0 ,0, 0, 1, 1, 0, 1)
The top sequence is created by assuming that there will always be an "intro cue", some discussion, and then an "outro cue". So generate this sequence as (1, 0, 1) * M where M is the number of minutes items. I.e. for three minutes items the generated sequence is (1, 0, 1, 1, 0, 1, 1, 0, 1).
Finally perform dynamic time warping / sequence alignment on these two sequences to find best path.
Eval overal performance with PK / WindowDiff.