Skip to content

IrdiZ/albsub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ‡ฆ๐Ÿ‡ฑ AlbSub โ€” Subtitle Translation Pipeline for Albanian

Translate movie subtitles into Albanian using any LLM. Built because we couldn't find English subs for a 2004 Italian comedy at 3 AM in January.


The Problem

I wanted to watch Christmas in Love (2004) โ€” a classic Italian cinepanettone with Boldi & De Sica. The movie is in Italian. English subtitles? Don't exist. Albanian subtitles? Forget about it.

This isn't a one-off problem. Thousands of movies โ€” Italian, Turkish, Greek, Indian โ€” are loved by Albanian audiences but have zero Albanian subtitle coverage. The existing subtitle databases (OpenSubtitles, Subscene, Podnapisi) have virtually nothing in Albanian. What does exist is often machine-translated garbage that misses cultural context, humor, and natural speech.

Albanian is one of the most underserved languages in the subtitle ecosystem.

The implications go beyond just watching movies:

  • Albanian diaspora (estimated 10M+ worldwide) consumes foreign media daily with no subtitle support
  • Albanian film education suffers โ€” students can't study foreign cinema in their language
  • Cultural accessibility โ€” older generations who don't speak English are locked out of global entertainment
  • The Albanian art scene โ€” directors, screenwriters, and filmmakers lose exposure to international storytelling techniques when they can't access foreign films with quality translations

The Solution

AlbSub is a CLI pipeline that takes subtitle files (.srt) in any source language and produces high-quality Albanian translations using LLMs. Not Google Translate. Not a lookup table. Actual contextual, natural, colloquial Albanian โ€” the kind that sounds like a human translator wrote it.

Key Features

  • ๐ŸŒ Multi-language input โ€” Italian, English, Turkish, Greek, French, German, Spanish, and more โ†’ Albanian
  • ๐Ÿค– Any LLM backend โ€” OpenAI, Anthropic, local Ollama models, or any OpenAI-compatible API
  • ๐Ÿ“Š Live progress tracking โ€” real-time progress bar with ETA, blocks translated, speed
  • โœ… Line validation โ€” automatically checks that every block has the correct number of lines (no dropped second lines, no truncated dialogue)
  • ๐Ÿ”„ Batch processing โ€” translates in configurable batches for speed and reliability
  • ๐Ÿ” Auto-retry โ€” failed blocks are automatically retried with exponential backoff
  • ๐Ÿ“ SRT-aware โ€” preserves timestamps, HTML tags (<i>, <b>), speaker labels ([Name]), and subtitle formatting
  • ๐ŸŽญ Context-aware โ€” sends surrounding blocks as context so the LLM understands the scene, not just isolated lines
  • ๐Ÿ” Validation report โ€” post-translation report showing block count match, line count match, empty block detection
  • โšก Parallel workers โ€” configurable concurrency for faster translation

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input .srt  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  SRT Parser   โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Batch Chunker   โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  LLM Workers  โ”‚
โ”‚  (any lang)  โ”‚     โ”‚  (validate)   โ”‚     โ”‚  (configurable)  โ”‚     โ”‚  (parallel)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                                         โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
                    โ”‚  Output .srt  โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚  Validator       โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚  (Albanian)   โ”‚     โ”‚  (line matching) โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pipeline Steps

  1. Parse โ€” Read .srt, extract blocks (number, timestamp, text lines)
  2. Detect language โ€” Auto-detect source language or accept user override
  3. Chunk โ€” Group blocks into batches (default: 50 blocks per batch)
  4. Translate โ€” Send each batch to the configured LLM with:
    • System prompt enforcing Albanian translation rules
    • Context window (previous 3 blocks for continuity)
    • Strict instruction to preserve line count per block
  5. Validate โ€” For each translated block:
    • Line count matches original โœ“
    • No empty lines where original had text โœ“
    • HTML tags preserved โœ“
    • Speaker labels preserved โœ“
    • Timestamps unchanged โœ“
  6. Retry โ€” Any failed validation โ†’ re-translate that block with explicit error feedback
  7. Assemble โ€” Write validated blocks to output .srt
  8. Report โ€” Print summary: total blocks, pass rate, any remaining issues

Usage

# Basic usage โ€” Italian to Albanian using Claude
albsub translate movie.ita.srt -o movie.alb.srt --language it --provider anthropic

# Using OpenAI
albsub translate movie.srt -o movie.alb.srt --language en --provider openai --model gpt-4o

# Using local Ollama model
albsub translate movie.srt -o movie.alb.srt --language tr --provider ollama --model llama3

# With config file
albsub translate movie.srt -o movie.alb.srt --language el --config albsub.config.yml

# Parallel workers for speed
albsub translate movie.srt -o movie.alb.srt --language it --workers 4

# Validate an existing translation
albsub validate original.srt translated.srt

# Dry run โ€” show what would be translated without calling the API
albsub translate movie.srt -o movie.alb.srt --language it --dry-run

Configuration

# albsub.config.yml
provider: anthropic          # anthropic | openai | ollama | custom
model: claude-sonnet-4-20250514       # any model the provider supports
api_key: ${ANTHROPIC_API_KEY}  # env var reference
base_url: null               # custom endpoint (for ollama, vllm, etc.)

translation:
  target: sq                 # Albanian (ISO 639-1)
  batch_size: 50             # blocks per API call
  context_window: 3          # surrounding blocks for context
  workers: 2                 # parallel translation workers
  max_retries: 3             # retry failed blocks

validation:
  strict_line_count: true    # enforce matching line counts
  check_empty: true          # flag empty translations
  check_tags: true           # verify HTML tag preservation
  check_labels: true         # verify speaker label preservation

style:
  formality: colloquial      # colloquial | neutral | formal
  dialect: standard          # standard | gheg | tosk
  preserve_slang: true       # attempt to find Albanian equivalents for slang

Supported Source Languages

Language Code Quality
Italian it โญโญโญโญโญ (tested extensively)
English en โญโญโญโญโญ
Turkish tr โญโญโญโญ
Greek el โญโญโญโญ
French fr โญโญโญโญ
German de โญโญโญโญ
Spanish es โญโญโญโญ
Serbian sr โญโญโญโญ
Arabic ar โญโญโญ
Hindi hi โญโญโญ

Quality depends on the LLM's training data for that language pair. Italian/English โ†’ Albanian works best since most LLMs have strong coverage of all three.

Validation System

The #1 problem with LLM subtitle translation is dropped lines. A 2-line subtitle block comes back as 1 line, losing half the dialogue. AlbSub solves this:

Original (Italian):                    Bad Translation:              AlbSub Output:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€               โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[Guido] <i>Questo sono io,</i>        [Guido] <i>This is me,</i>   [Guido] <i>Ky jam unรซ,</i>
<i>Guido Baldi. Ho 54 anni.</i>       (LINE MISSING!)               <i>Guido Baldi. Jam 54 vjeรง.</i>

Every block is validated post-translation. If line counts don't match, the block is automatically re-sent to the LLM with an explicit correction prompt. This runs up to 3 times before flagging it for manual review.

Why LLMs Beat Traditional Machine Translation

Google Translate for subtitles gives you:

  • โŒ Literal word-for-word translation
  • โŒ No understanding of humor, sarcasm, or cultural context
  • โŒ Formal register when the character is being casual
  • โŒ No awareness that this is dialogue, not a document

LLMs give you:

  • โœ… Natural, conversational Albanian
  • โœ… Humor and cultural references adapted (not just translated)
  • โœ… Correct register โ€” casual when characters are casual, formal when formal
  • โœ… Context from surrounding dialogue
  • โœ… Understanding of speaker labels and scene context

The Origin Story

January, 3 AM. I wanted to watch Christmas in Love (2004) โ€” a Boldi & De Sica Italian Christmas comedy. The movie exists in Italian. English subtitles? Scraped the entire internet โ€” OpenSubtitles, Subscene, Podnapisi, SubDL, obscure forums โ€” nothing. Found Italian .srt files, ran them through a translation pipeline I built on the spot, and had English subs in 15 minutes.

Then I thought: if English subs don't exist for a popular Italian comedy, what about Albanian? Albanian subtitles are virtually nonexistent for foreign films. Millions of Albanian speakers worldwide consuming Turkish dramas, Italian comedies, Greek films โ€” all without subtitle support.

That's how AlbSub was born. A tool that can take any .srt file in any language and produce quality Albanian subtitles using the LLM of your choice.

Contributing

PRs welcome. Especially:

  • New language pair testing and quality reports
  • Albanian dialect support (Gheg/Tosk)
  • Performance optimizations
  • Additional LLM provider integrations

Translation Results โ€” Side-by-Side Comparisons

All translations below were generated by AlbSub using GPT-4o with default settings (batch size 25, context window 3, temperature 0.3). 100% validation pass rate on all runs.


๐Ÿ‡ฎ๐Ÿ‡นโ†’๐Ÿ‡ฆ๐Ÿ‡ฑ Italian to Albanian โ€” Christmas in Love (2004)

Classic cinepanettone with Boldi & De Sica. The film that started this whole project.

# ๐Ÿ‡ฎ๐Ÿ‡น Italian (Original) ๐Ÿ‡ฆ๐Ÿ‡ฑ Albanian (AlbSub)
4 Questo sono io, Guido Baldi. Ho 54 anni. Ky jam unรซ, Guido Baldi. Kam 54 vjeรง.
11 e una moglie splendida, mai tradita. Finchรฉ non รจ arrivata lei. dhe njรซ grua e mrekullueshme, kurrรซ e tradhtuar. Derisa erdhi ajo.
12 Sofia, russa di Siberia, 25 anni, bella da far paura! Sofia, ruse nga Siberia, 25 vjeรง, e bukur sa tรซ tremb!
16 Mi sono innamorato di lei come un bimbo. U dashurova me tรซ si njรซ fรซmijรซ.
18 - Tieni, amore. - Cos'รจ? - Ja, dashuri. - ร‡farรซ รซshtรซ?
19 - Buon compleanno! L'ho ricordato. - Gรซzuar ditรซlindjen! E mbajta mend.

๐Ÿ‡ฌ๐Ÿ‡งโ†’๐Ÿ‡ฆ๐Ÿ‡ฑ English to Albanian

# ๐Ÿ‡ฌ๐Ÿ‡ง English (Original) ๐Ÿ‡ฆ๐Ÿ‡ฑ Albanian (AlbSub)
1 Good morning everyone! Mirรซmรซngjes tรซ gjithรซve!
3 I can't believe this happened. Nuk mund ta besoj qรซ ndodhi kjo.
5 I went to the market with my mother. Shkova nรซ treg me mamanรซ time.
9 Life is beautiful, but also difficult. Jeta รซshtรซ e bukur, por edhe e vรซshtirรซ.
14 [Julia] I really hope so. With all my heart. [Julia] Shpresoj shumรซ. Me gjithรซ zemรซr.
16 Don't forget the keys! Mos harro รงelรซsat!

๐Ÿ‡ฎ๐Ÿ‡น vs ๐Ÿ‡ฌ๐Ÿ‡ง โ†’ ๐Ÿ‡ฆ๐Ÿ‡ฑ Cross-Language Consistency

Same dialogue translated from both Italian and English sources. Shows AlbSub produces consistent Albanian regardless of source language.

๐Ÿ‡ฎ๐Ÿ‡น Italian ๐Ÿ‡ฌ๐Ÿ‡ง English ๐Ÿ‡ฆ๐Ÿ‡ฑ from Italian ๐Ÿ‡ฆ๐Ÿ‡ฑ from English
Buongiorno a tutti! Good morning everyone! Mirรซmรซngjes tรซ gjithรซve! Mirรซmรซngjes tรซ gjithรซve!
Come stai oggi? Tutto bene? How are you today? Everything okay? Si je sot? ร‡do gjรซ mirรซ? Si jeni sot? Gjithรงka nรซ rregull?
Sono andato al mercato con mia madre. I went to the market with my mother. Shkova nรซ treg me mamin. Shkova nรซ treg me mamanรซ time.
Grazie di tutto, amico mio. Thank you for everything, my friend. Faleminderit pรซr gjithรงka, miku im. Faleminderit pรซr gjithรงka, miku im.

โœ… Consistent meaning across source languages ยท โœ… Natural phrasing variation ยท โœ… Speaker labels & HTML tags preserved

โš ๏ธ A Note on AI Translation

Let's be real: no AI will ever match a native Albanian speaker. AlbSub gets you 90% of the way there โ€” fast. It handles structure, context, formatting, and produces surprisingly natural Albanian. But it's still an LLM at the end of the day. It might mix up gender (Kjo vs Ky), pick a slightly awkward phrasing, or miss a cultural nuance that only a native would catch.

The point isn't to replace human translators. It's to give you working subtitles in minutes when there's no human translator available โ€” which, for Albanian, is almost always. Watch the movie tonight, not next month.

For production-quality subtitles, run AlbSub first, then have a native speaker do a quick pass. You'll save hours compared to translating from scratch.

License

MIT


Made with ๐Ÿ”ฅ by Irdi Zeneli

Because every language deserves subtitles.

About

๐Ÿ‡ฆ๐Ÿ‡ฑ Translate movie subtitles into Albanian using any LLM. Multi-language input, live progress, strict validation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors