Skip to content

rhofkens/translation-memory-tester

Repository files navigation

Translation Memory Tester for Machine Translation

A CLI tool for testing translation memory (TM) systems by generating test data, creating TMX files, and calculating Levenshtein-based match scores.

Core idea

Many machine translation tools (e.g. DeepL Translate) now support the integration of translation memories. If there is a match in the source text, the MT system will prefer the human-validated translation from the translation memory over the AI-generated version. This improves consistency and quality of the translation, while keeping the flexibility of AI translation when there's no match.
But how well does this work in practice? This tool lets you find out. Here's the basic flow:

  • Generate text in English, translate it to German.
  • Build a translation memory based on these texts and the TMX standard.
  • Create a variant of the source text
  • Upload the translation memory to your machine translation service. They all accept the TMX format. Example: DeepL lets you upload TMX files in the customization hub.
  • Translate the source text variant with the MT service.
  • Evaluate how well the TM integration works in the MT output. Compare with the matching report generated by this tool to see if matches were inserted as expected.

You will probably notice big differences, usually caused by segmentation misalignment. This tool allows you to experiment with different segmentation approaches to see what works better.

Features

  • Text Generation: Generate English source text about electric vehicles in cities using Claude API
  • Smart Segmentation: Segment text into TM-style units (short or long mode)
  • German Translation: Translate segments to casual, fluent German
  • TMX Export: Export to TMX 1.4 format for import into OmegaT, DeepL, memoQ, etc.
  • Variation Generation: Create test variations targeting specific match percentages (100%, 95-99%, 85-94%, 70-84%, 50-69%, 0-49%)
  • Match Scoring: Calculate Levenshtein-based similarity scores
  • Reports: Generate JSON and HTML reports with diff highlighting

Installation

Prerequisites

  • Python 3.10+
  • UV package manager (recommended)

Setup

# Clone the repository
git clone https://github.com/rhofkens/translation-memory-tester.git
cd translation-memory-tester

# Create virtual environment and install
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

API Key Configuration

Set your Anthropic API key:

export ANTHROPIC_API_KEY='your-api-key'

Quick Start

Run the complete pipeline with a single command:

tmtest run-all --output-dir ./my-test

This will:

  1. Generate ~500 words of English text about EVs
  2. Segment the text into TM units
  3. Translate to German (casual style)
  4. Export to TMX format
  5. Generate variations for different match levels
  6. Calculate match scores and generate reports

Commands

tmtest run-all

Run the complete pipeline.

tmtest run-all --output-dir ./output --verbose --segment-mode short

Options:

  • --output-dir, -o: Output directory (default: ./output)
  • --verbose, -V: Show detailed output
  • --segment-mode, -m: short for better TM reuse, long for full sentences

tmtest generate

Generate English source text.

tmtest generate --output source.json

tmtest segment

Segment source text into TM units.

tmtest segment source.json --output segments.json --mode short

Options:

  • --mode, -m: short (splits at conjunctions) or long (full sentences)

tmtest translate

Translate segments to German.

tmtest translate segments.json --output translated.json

tmtest validate

Validate German translations for coherence.

tmtest validate translated.json
tmtest validate translated.json --fix --output fixed.json

tmtest export-tmx

Export to TMX format.

tmtest export-tmx translated.json --output memory.tmx

tmtest variate

Generate variations for different match percentages.

tmtest variate segments.json --output variations.json

tmtest match

Match variations against TM and generate reports.

# JSON report
tmtest match variations.json --tm memory.tmx --output report.json

# HTML report with diff highlighting
tmtest match variations.json --tm memory.tmx --output report.html --format html

Output Files

File Description
source.json Generated English text with word count
segments.json Segmented text with segment IDs
translated.json Segments with German translations
memory.tmx TMX file for import into TM systems
variations.json Test variations tagged by intended match category
report.json Match results with scores and statistics
report.html Visual report with diff highlighting

Match Categories

Category Score Range Variation Strategy
Exact 100% Identical segments
Near-exact 95-99% Punctuation/capitalization changes
High Fuzzy 85-94% Synonym substitutions
Medium Fuzzy 70-84% Phrase-level changes
Low Fuzzy 50-69% Significant rewrites
No Match 0-49% New content

TMX Compatibility

The generated TMX files are compatible with:

  • OmegaT
  • DeepL Translation Memory
  • memoQ
  • SDL Trados
  • Any TMX 1.4-compliant system

Development

# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

# Format code
ruff format src/
ruff check src/ --fix

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages