Day: May 14, 2018

A New Way to Measure NMT Quality

A New Way to Measure NMT Quality

Neural Machine Translation (NMT) systems produce very high quality translations, and are poised to radically change the professional translation industry. These systems require quality feedback / scores on an ongoing basis. Today, the prevalent method is via Bilingual Evaluation Understudy (BLEU), but methods like this are no longer fit for purpose.

A better approach is to have a number of native speakers assess NMT output and rate the quality of each translation. One Hour Translation (OHT) is doing just that: our new NMT index is released in late April 2018 and fully available for the translation community to use.

A new age of MT

NMT marks a new age in automatic machine translation. Unlike technologies developed over the past 60 years,  the well-trained and tested NMT systems that are available today,  have the potential to replace human translators.

Aside from processing power, the main factors that impact NMT performance are:

  •      the amount and quality of initial training materials, and
  •      an ongoing quality-feedback process

For a NMT system to work well, it needs to be properly trained, i.e. “fed” with hundreds of thousands (and in some cases millions) of correct translations. It also requires feedback on the quality of the translations it produces.

NMT is the future of translation. It is already much better than previous MT technologies, but issues with training and quality assurance are impeding progress.

NMT is a “disruptive technology” that will change the way most translations are performed. It has taken over 50 years, but machine translation can now be used to replace human translators in many cases.

So what is the problem?

While NMT systems could potentially revolutionize the translation market, their development and adoption are hampered by the lack of quality input, insufficient means of testing the quality of the translations and the challenge of providing translation feedback.

These systems also require a lot of processing power, an issue which should be solved in the next few years, thanks to two main factors. Firstly, Moore’s law, which predicts that processing power doubles every 18 months, also applies to NMT, meaning that processing power will continue to increase exponentially. Secondly, as more companies become aware of the cost benefit of using NMT, more and more resources will be allocated for NMT systems.

Measuring quality is a different and more problematic challenge. Today, algorithms such as BLEU, METEOR, and TER try to predict automatically what a human being would say about the quality of a given machine translation. While these tests are fast, easy, and inexpensive to run (because they are simply software applications), their value is very limited. They do not provide an accurate quality score for the translation, and they fail to estimate what a human reviewer would say about the translation quality (a quick scan of the text in question by a human would reveal the issues with the existing quality tests).

Simply put, translation quality scores generated by computer programs that predict what a human would say about the translation are just not good enough.

With more major corporations including Google, Amazon, Facebook, Bing, Systran, Baidu, and Yandex joining the game, producing an accurate quality score for NMT translations becomes a major problem that has a direct negative impact on the adoption of NMT systems.

There must be a better way!

We need a better way to evaluate NMT systems, i.e. something that replicates the original intention more closely and can mirror what a human would say about the translation.

The solution seems simple: instead of having some software try to predict what a human would say about the translation, why not just ask enough people to rate the quality of each translation? While this solution is simple, direct, and intuitive, doing it right and in a way that is statistically significant means running numerous evaluation projects at one time.

NMT systems are highly specialized, meaning that if a system has been trained using travel and tourism content, testing it with technical material will not produce the best results. Thus, each type of material has to be tested and scored separately. In addition, the rating must be done for every major language pair, since some NMT engines perform better in particular languages. Furthermore, to be statistically significant, at least 40 people need to rate each project per language, per type of material, per engine. Besides that, each project should have at least 30 strings.

Checking one language pair with one type of material translated with one engine is relatively straightforward: 40 reviewers each check and rate the same neural machine translation consisting of about 30 strings. This approach produces relatively solid (statistically significant) results, and repeating it over time also produces a trend, i.e. making it possible to find out whether or not the NMT system is getting better.

The key to doing this one isolated evaluation is selecting the right reviewers and making sure they do their job correctly. As one might expect, using freelancers for the task requires some solid quality control procedures to make sure the answers are not “fake” or “random.”

At that magnitude (one language, one type of material, one NMT engine, etc), the task is manageable, even when run manually. It becomes more difficult when an NMT vendor, user, or LSP wants to test 10 languages and 10 different types of material with 40 reviewers each. In this case, each test requires between 400 reviewers (1 NMT engine x 1 type of material x 10 language pairs x 40 reviewers) and 4,000 reviewers (1 NMT engine x 10 types of material x 10 language pairs x 40 reviewers).

Running a human based quality score is a major task, even for just one NMT vendor. It requires up to 4,000 reviewers working on thousands of projects.

This procedure is relevant for every NMT vendor who wants to know the real value of their system and obtain real human feedback for the translations it produces.

The main challenge is of course finding, testing, screening, training, and monitoring thousands of reviewers in various countries and languages — monitoring their work while they handle tens of thousands of projects in parallel.

The greater good – industry level quality score

Looking at the greater good,  what is really needed is a standardised NMT quality score for the industry to employ, measuring all of the various systems using the same benchmark, strings, and reviewers, in order to compare like for like performance. Since the performance of NMT systems can vary dramatically between different types of materials and languages, a real human-based comparison using the same group of linguists and the same source material is the only way to produce real comparative results. Such scores will be useful both for the individual NMT vendor or user and for the end customer or LSP trying to decide which engine to use.

To produce the same tests on an industry-relevant level is a larger undertaking. Using 10 NMT engines, 10 types of material, 10 language pairs and 40 reviewers, the parameters of the project can be outlined as follows:

  •      Assuming the top 10 language pairs are evaluated, ie EN > ES, FR, DE, PT-BR, AR, RU, CN, JP, IT and KR;
  •      10 types of material – general, legal, marketing, finance, gaming, software, medical, technical, scientific, and tourism;
  •      10 leading (web-based) engines – Google, Microsoft (Bing), Amazon, DeepL, Systran, Baidu, Promt, IBM Watson, Globalese and Yandex;
  •      40 reviewers rating each project;
  •      30 strings per test; and
  •      12 words on average per string

This comes to a total of 40,000 separate tests (10 language pairs x 10 types of material x 10 NMT engines x 40 reviewers), each with at least 30 strings, i.e. 1,200,000 strings of 12 words each, resulting in an evaluation of approximately 14.4 million words. This evaluation is needed to create just one instance (!) of a real, comparative, human-based NMT quality index.

The challenge is clear: to produce just one instance of a real viable and useful NMT score, 4,000 linguists need to evaluate 1,200,000 strings equating to well over 14 million words!

The magnitude of the project, the number of people involved and the requirement to recruit, train, and monitor all the reviewers, as well as making sure, in real time, that they are doing the job correctly, are obviously daunting tasks, even for large NMT players, and certainly for traditional translation agencies.

Completing the entire process within a reasonable time (e.g. less than one day), so that the results are “fresh” and relevant makes it even harder.

There are not many translation agencies with the capacity, technology, and operational capability to run a project of that magnitude on a regular basis.

This is where One Hour Translation (OHT) excels. They have recruited, trained, and tested thousands of linguists in over 50 languages, and already run well over 1,000,000 NMT rating and testing projects for our customers. By the end of April 2018, they published the first human-based NMT quality index (initially covering several engines and domains and later expanding), with the goal of promoting the use of NMT across the industry.

A word about the future

In the future, a better NMT quality index can be built using the same technology NMT is built on, i.e. deep-learning neural networks. Building a Neural Quality system is just like building a NMT system. The required ingredients are high quality translations, high volume, and quality rating / feedback.

With these ingredients, it is possible to build a deep-learning, neural network based quality control system that will read the translation and score it like a human does. Once the NMT systems are working smoothly and a reliable, human based, quality score/feedback developed, , the next step will be to create a neural quality score.

Once a neural quality score is available, it will be further possible to have engines improve each other, and create a self-learning and self-improving translation system by linking the neural quality score to the NMT  (obviously it does not make sense to have a closed loop system as it cannot improve without additional external data).

With additional external translation data, this system will “teach itself” and learn to improve without the need for human feedback.

Google has done it already. Its AI subsidiary, DeepMind, developed AlphaGo, a neural network computer program that beat the world’s (human) Go champion. AlphaGo is now improving, becoming better and better, by playing against itself again and again – no people involved.

Reference: https://bit.ly/2HDXbTf