You are here: Home » Localization Engineers » Evaluation of Machine Translation Output
Evaluation of Machine Translation Output

Evaluation of Machine Translation Output

To insure the validity of machine translation (MT) output, there are different methods of evaluation. A rudimentary form of evaluation is to perform a “round-trip translation”, meaning that the original text is machine translated into the target language, and then the result of that process is translated back into the original language to test the quality of the machine translation. As the quality of machine translation continues to improve, a reliable method for evaluation will also be necessary. Currently, there are two main types of evaluation used for machine translation: human and automated. There is a lot of research being conducted in an attempt to streamline processes to take better advantage of time, energy and cost savings of using machine translation.

Human evaluation will always be required for quality control. Currently, the field is lacking a standardised method and criteria for evaluating MT. Research is being conducted to test different methods and develop the necessary protocol. For human evaluation, the scoring criteria that are typically used include Word Error Rate (WER), which is an algorithm-based test, and Subjective Sentence Error Rate (SSER), a process that divides translations into different quality classes and then scores them. There are challenges with either of the scoring criteria because neither one is able to reliably catch translation errors that involve complex issues such as nuances of language, historical context, and culture.

Many human evaluation metrics have been proposed. The most commonly used manual evaluation metrics are fluency and adequacy, through which human judges are presented with the following definition, with no additional instructions:




All Meaning

Flawless English


Most Meaning

Good English


Much Meaning

Non-native English


Little Meaning

Disfluent English





The following figure (from the proceedings volume of the 2006 NAACL/HLT Workshop on Machine Translation) illustrates an Annotation Tool for manual judgement of adequacy and fluency of machine translation output. In this example, translations from 5 randomly selected systems for a randomly selected sentence is presented – no additional information.


Similarly, the DQF Tools developed by TAUS adopt an elaborate Scoring approach, evaluating segment by segment, based on multiple criteria, viz. adequacy, fluency and error typology, and automatically calculate the total average.

Another example is Asia Online Language Studio which includes a human evaluation tool that is highly configurable and can be used to measure aspects of linguistic quality.



Automated or automatic evaluation also uses standards to evaluate the quality of the MT output. Different metrics have been developed to mimic human evaluation in order to speed up the overall process of translation. At their best, the metrics function as an aid to reduce the volume of the output that a human has to process. A challenge is the subjective nature of translation and language itself. From one human to another, many fine points of language could be discussed, compared and contrasted and still both evaluators would not agree, and therefore would not rate the quality of a translation exactly the same.

One of the first metrics that consistently measures up to human judgments of quality is called BLEU, which compares a machine-translated segment with a human reference translation and scores the translation on a scale of 0 to 1. Following its success, other metrics, such as NIST have been modelled after it and have brought further advances to the research and methodology used for evaluation. A tool developed for testing speech recognition systems, WER (Word Error Rate) compares the number of words between the source text and the translation to the target language. Two more recently developed metrics are METEOR and LEPOR. The first one is based on BLEU and features improvements and refinements. LEPOR is the most recently developed metric and combines many different components to further improve the quality of MT.

Following MT is the process of post-editing requiring the skills and experience of professional translators. Evaluation of machine translation output is important to determine the feasibility of moving forward to the post-editing step and the effort to be exerted on it. The process of Machine Translation Post-Editing is done by skilled human translators who refine the output of the target language to make sure that the meaning and purpose of the source document are retained and that the information is properly conveyed to native speakers of the target language.


Print Friendly

Comments are closed.

Scroll To Top