Machine Translation

Custom MT engines

Standard machine translation engines are trained on general data and often fail to translate specialized industry content accurately, misinterpreting terminology and brand voice.

Custom MT engines solve this by learning from your unique linguistic assets (like Translation Memories). This tailored approach produces more accurate, brand-aligned translations from the start, which reduces editing costs, speeds up project timelines, and ensures a consistent voice across all content.

How it works

We use your curated translation memories and glossaries to train a neural machine translation (NMT) engine. Tilde offers this as a full-service solution, managing asset preparation, training, and evaluation.

Step 1: Contact us

Reach out to us for pricing and to begin: https://tilde.ai/contact-us/

We will need information on what language pairs you are interested in and the translation domains.

Step 2: Data collection

Training a custom engine requires at least 100,000 unique translation sentence pairs in your specific domain. With 100,000 examples, our process involves combining a small amount of your data with our own massive dataset, which often contains tens of millions of examples.

If your domain is very specialized and you can provide over a million examples, we might train the engine on your data only.

We prefer TMX format. See data requirements below.

Step 3: Data pre-processing

We will evaluate your data quality, its correspondence to the domain and communicate any issues with you.

The optimized liguistic data will be split into train, tune, and test datasets to train the model, refine its parameters, and measure its performance.

Step 4: Engine training and evaluation

Tilde trains and assesses the trained engine's improvement over generic MT using both textual scoring algorithms like BLEU, ChrF++ and semantic scoring algorithms like COMET to measure accuracy and meaning.

Step 5: Engine testing and feedback

Your Customer Success Manager (CSM) will notify you when your engine is ready for evaluation. During the specific time period provided, you can test the engine with your own content.

If retraining an existing engine, use this time to compare the new version against the old one. Once satisfied with the quality, provide your final approval to your CSM.

We will analyse the feedback and implement changes if deemed necessary and possible.

Step 6: Maintenance

To stay current with new terminology and style, engines should be retrained time after time. It is crucial to have a significant amount of new data for retraining in order to achieve the best results.

What are the data requirements?

Data is the single most important element in building a powerful and accurate Machine Translation (MT) system. To create the best possible MT engine for you, we need to understand the data you have. Here are answers to some common questions about our data requirements.

What is "parallel data" and why do you need it for training?

Parallel data is a collection of sentences or text segments in a source language paired with their professionally translated counterparts in a target language. It's the primary fuel for training a custom MT engine.

For training, the ideal parallel data has four key qualities:

Large Amount - The best results come from at least one million sentence pairs. We can work with 100,000 to a million, but anything less is too small for training, though it can still be valuable for testing.
Relevant Domain - The subject matter must match your needs. For example, if you need a legal MT system, medical data won't help.
Relevant Language Pair - The languages must match your project. English-Spanish data is not useful for building a Latvian-Estonian system.
High Quality - The texts must be accurate translations of each other.

If you have this data, please share it with us as early as possible for analysis.

Do you also need data in just the target language?

Yes. This is called monolingual data. Large amounts of high-quality text in the target language can help the MT system learn to produce more fluent and natural sounding output. We are primarily interested in target language data, not source language data.

What kind of data is needed for testing and evaluation?

This is also parallel data, but its purpose is different from training data.

Development Data - We use this to make key decisions while building the MT engine.
Evaluation Data - We use this to measure the quality of the final MT model before delivering it to you.

We need 500 to 2,000 high-quality sentence pairs for each of these sets, ideally from a few documents that perfectly represent the content you'll be translating. MT system training cannot begin without these data sets.

If you don't have this data, we strongly encourage you to create it, for instance by translating a few representative documents. This service would come at an additional cost.

What if my data is in PDFs, Word documents, or on a website?

If your data isn't in a simple, translation-ready format (like a translation memory), it will require preparation. This can include scraping your website or processing a large number of PDF and Word documents to extract the text.

This data collection and preparation process requires the involvement of our data team and will incur additional costs. We believe it's important for this possibility and its associated costs to be clear and defined for everyone from the start.

I have terminology lists or glossaries. Are those useful?

Absolutely. If you plan to use the MT engine with specific terminology, the system is only as good as the glossary it's given. If you have a glossary, you will need to share it with us. If you don't have one, you will need to create and manage one for the best results.

Will you use my data to test your system against competitors?

To provide you with concrete evidence of how our system performs, we like to compare its output against other major MT providers. To do this, we need to translate your evaluation data set using their systems.

We will only do this with your explicit written consent. Because we cannot guarantee data security with third-party providers, any data used for this purpose must not contain sensitive or confidential information.

How to start using your engine

Once the internal training and scoring are complete, the final steps are to evaluate and deploy your new engine for production use.

Here’s how to start using your newly trained engine in different scenarios. Your custom engine can be distinguished by its domain - public engines will have General, or General AI domain.

For simple text & file translation
For Trados, memoQ, or API Integrations
For Online CAT tool

This is the most straightforward way to use your engine directly on the Tilde MT platform.

Navigate to https://translate.tilde.ai.
Select your language pair (e.g., English to German).
In the domain menu, select the correct domain (e.g., Custom, Technical).

Retraining an existing engine

When an existing engine is retrained, a new version will be made available for your evaluation under a temporary, versioned domain name, such as "Custom v2" or "Technical v2025," while your original engine remains active.

During the evaluation period, you can directly compare the performance of the new version against the original. To test the retrained engine in your integrations, you will need to create a new, temporary access key specifically for the versioned domain (e.g., for "Custom v2").

Once you approve the new engine, it will replace the old one by taking over the original domain name. For example, "Custom v2" will become "Custom," and the old version will be turned off. At that moment, all your existing integrations using the original domain (including Trados, memoQ, and the API) will automatically start using the improved engine with no further changes required from you.

How do I report translation errors or quality issues?

Your feedback is essential for the continuous improvement and long-term performance of your custom MT engine. If you encounter translations that do not meet your quality standards, we encourage you to report them.

The primary channel for all quality-related feedback is support@tilde.com

Providing actionable feedback

To help our team investigate and resolve the issue effectively, please provide us with specific examples. For each error, please include:

The original word or sentence that was translated.
The incorrect translation produced by the custom engine.
A brief description of what is wrong with the translation (e.g., "wrong terminology," "grammatically incorrect," "unnatural phrasing").
The correct/preferred translation - how the text should have been translated.

What happens to your feedback?

Your feedback is logged and reviewed by our expert teams. This information is incredibly valuable as it helps us identify any immediate patterns or issues that need attention.

How it works​

Step 1: Contact us​

Step 2: Data collection​

Step 3: Data pre-processing​

Step 4: Engine training and evaluation​

Step 5: Engine testing and feedback​

Step 6: Maintenance​

What are the data requirements?​

What is "parallel data" and why do you need it for training?​

Do you also need data in just the target language?​

What kind of data is needed for testing and evaluation?​

What if my data is in PDFs, Word documents, or on a website?​

I have terminology lists or glossaries. Are those useful?​

Will you use my data to test your system against competitors?​

How to start using your engine​

Retraining an existing engine​

How do I report translation errors or quality issues?​

Providing actionable feedback​

What happens to your feedback?​