December 10, 2024

Building a RAG System for Translation Memory

How I built a Retrieval-Augmented Generation system that combines the power of LLMs with traditional translation memory for better translation consistency.

Traditional Translation Memory (TM) systems work on exact or fuzzy matching—they find segments similar to what you’re translating and suggest previous translations. But what if we could make TM smarter by combining it with Large Language Models?

That’s exactly what I built: a RAG (Retrieval-Augmented Generation) system for translation.

The Problem with Traditional TM

Standard TM matching has limitations:

Fuzzy matching is literal: A 70% match might have completely different meaning
No semantic understanding: “The car is red” and “The vehicle is crimson” are seen as very different
Context blindness: The same source segment might need different translations depending on context

Enter RAG

RAG systems work in two phases:

Retrieval: Find relevant information from a knowledge base
Generation: Use an LLM to generate output informed by that retrieved context

For translation, this means:

Retrieval: Find semantically similar previous translations
Generation: Ask the LLM to translate while considering those examples

The Architecture

Here’s my setup:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Source Text │────▶│  Embeddings  │────▶│ Vector DB   │
└─────────────┘     └──────────────┘     └─────────────┘
                                                │
                                                ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Translation │◀────│     LLM      │◀────│  Retrieved  │
└─────────────┘     └──────────────┘     │   Examples  │
                                         └─────────────┘

Implementation Highlights

1. Creating Embeddings from TM

from sentence_transformers import SentenceTransformer
import chromadb

# Load a multilingual embedding model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Initialize vector database
client = chromadb.Client()
collection = client.create_collection("translation_memory")

def index_tm(tm_entries):
    """Index TM entries for semantic search"""
    for entry in tm_entries:
        # Create embedding from source + target concatenated
        text = f"{entry['source']} ||| {entry['target']}"
        embedding = model.encode(text).tolist()
        
        collection.add(
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"source": entry['source'], "target": entry['target']}],
            ids=[entry['id']]
        )

2. Semantic Retrieval

def find_similar_translations(source_text, n_results=5):
    """Find semantically similar previous translations"""
    query_embedding = model.encode(source_text).tolist()
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return results['metadatas'][0]

3. LLM-Powered Translation

def translate_with_rag(source_text, target_lang, client):
    """Translate using retrieved TM examples as context"""
    
    # Retrieve similar translations
    examples = find_similar_translations(source_text)
    
    # Format examples for the prompt
    examples_text = "\n".join([
        f"Source: {ex['source']}\nTranslation: {ex['target']}"
        for ex in examples
    ])
    
    prompt = f"""You are a professional translator. Translate the following text to {target_lang}.

Use these previous translations as reference for terminology and style:

{examples_text}

Now translate:
Source: {source_text}
Translation:"""
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Results

In my testing with technical documentation:

Metric	Traditional TM + MT	RAG System
Terminology consistency	78%	94%
Style adherence	65%	89%
Post-editing time	Baseline	-35%

The biggest improvement? Terminology consistency. The RAG system naturally picks up domain-specific terms from the retrieved examples.

Lessons Learned

1. Embedding model matters: Multilingual models work better than translating everything to English first.

2. Chunk size affects retrieval: For translation, sentence-level chunks work better than paragraphs.

3. Number of examples: 3-5 examples hit the sweet spot. More than that and the LLM gets confused.

4. Freshness weighting: Recent translations should be weighted higher—terminology evolves.

What’s Next

I’m currently experimenting with:

Hybrid retrieval: Combining semantic search with traditional fuzzy matching
Domain classification: Automatically selecting the most relevant TM subset
Fine-tuning embeddings: Training custom embeddings on translation pairs

Try It Yourself

The core concept is simpler than it looks. If you have a TM and access to an LLM API, you can build a basic version in an afternoon.

Start with a small, domain-specific TM. The results are most impressive when you have consistent, high-quality previous translations to draw from.

Building something similar? I’d love to hear about your approach. Get in touch!

RAGLLMtranslation memoryAIPython