Skip to content

chore(scripts): add segmented-transcript backfill script#121

Closed
ssarunic wants to merge 2 commits into
mainfrom
feat/drop-legacy-transcript-tabs
Closed

chore(scripts): add segmented-transcript backfill script#121
ssarunic wants to merge 2 commits into
mainfrom
feat/drop-legacy-transcript-tabs

Conversation

@ssarunic

Copy link
Copy Markdown
Owner

Follow-up to #120 (which dropped the legacy-blended/shadow transcript tabs). That PR merged the frontend change but not the script that made it possible — this PR adds it.

What

scripts/backfill_segmented_transcripts.py — a one-off, idempotent repair that re-runs the spec #18 segmented cleanup (TranscriptCleaningProcessor) on episodes that have a cleaned Markdown transcript but no AnnotatedTranscript JSON sidecar. This is what backfilled the 5 stragglers that predated the segmented pipeline, bringing coverage to 866/866 so the legacy tab could be removed.

  • Dry-run by default; --apply to write; --episode-id to target specific rows; --force to re-clean episodes that already have a sidecar.
  • Skips episodes that already have a sidecar unless --force, so re-running is safe and won't re-spend LLM tokens.
  • Already ran successfully against the production DB (5/5 episodes).

Note

Uses print() for CLI output, consistent with the sibling scripts/backfill_feed_holes.py. scripts/ is one-off operational tooling, not application code.

ssarunic added 2 commits May 30, 2026 20:07
All episodes now carry segmented transcripts, so the "Legacy blended"
alternate-rendering tab and the never-populated "Shadow" debug tab no
longer earn their place in the transcript panel.

- Remove the sub-tab toggle, the TranscriptSubTab type, SubTabButton,
  and the ShadowTranscript type / shadow response field.
- TranscriptViewer stays as the fallback renderer for raw / mid-pipeline
  / loading episodes (the backend `content` field still feeds it); only
  the user-facing legacy/shadow *tabs* are removed.

tsc + frontend transcript tests green.
The prerequisite for dropping the legacy-blended tab: re-runs the spec #18
segmented cleanup (TranscriptCleaningProcessor) on episodes that have a
cleaned Markdown transcript but no AnnotatedTranscript JSON sidecar, so
every episode gets a segmented view. Used to backfill the 5 stragglers
that predated the segmented pipeline; idempotent and re-runnable
(skips episodes that already have a sidecar unless --force).

Dry-run by default; --apply to write; --episode-id to target specific rows.
@ssarunic

Copy link
Copy Markdown
Owner Author

Superseded by a cleaner single-file branch off current main (this branch's merge-base predated #120's squash-merge, so its diff showed 4 unrelated no-op frontend files). Reopening as a fresh PR containing only the backfill script.

@ssarunic ssarunic closed this May 30, 2026
@ssarunic ssarunic deleted the feat/drop-legacy-transcript-tabs branch May 30, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant