Tag: Computational Linguistics

Corporates Going All in on Neural Machine Translation Research

Corporates Going All in on Neural Machine Translation Research

Reading Time: 1 minute

A number of research directions have been a staple since NMT began to appear on arXiv. Research on improving NMT output and the processes used by NMT systems, for example, are almost always present. Some research directions only recently gained steam, such as low-resource languages or languages where little training data is available “in the wild.”

Read more

Graduate Interviews: It Is Hypocritical to Assume a Translator’s Role is Sacrosanct

Graduate Interviews: It Is Hypocritical to Assume a Translator’s Role is Sacrosanct

Reading Time: 1 minute

Last September, shortly after her graduation from the Johannes Gutenberg University of Mainz, Ms Ekaterini Ntouska joined Lionbridge Poland to work as a Linguistic Game Tester. Thanks to her flawless results in the Memsource Student Certification Program and her positive, friendly, and highly proactive approach, Ekaterini also became a Translation Intern at PureFluent through our Talent Endorsement Program.

Read more here

Edit Distance in Translation Industry

Edit Distance in Translation Industry

Reading Time: 2 minutes

In computational linguistics, edit distance or Levenshtein distance, is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.  The edit distance between (a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966 which are:

1- Insertion.

2- Deletion

3- Substitution.

In Levenshtein’s original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b.

For example, the Levenshtein distance between “kitten” and “sitting” is 3. A minimal edit script that transforms the former into the latter is:

  • kitten – sitten (substitution of “s” for “k”).
  • sitten –  sittin (substitution of “i” for “e”).
  • sittin –  sitting (insertion of “g” at the end).

What are the application of edit distance in translation industry?

1- Spell Checkers

Edit distance is applied where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.

2- Machine Translation Evaluation and Post Editing

Edit distance can be used to compare a postedited file to the machine translated output that was the starting point for the postediting. When you calculate the edit distance, you are calculating the “effort” that the posteditor made to improve the quality of the machine translation to a certain level. Starting from the source content and same MT output, if you perform a light postediting and a full postediting, the edit distance for each task will be different, and the human quality level is expected to have a higher edit distance, because more changes are needed. This means that you are measuring light and full postediting using the edit distance.

Therefore, the edit distance is a kind of “word count” measure of the effort, similar in a way to the word count used to quantify the work of translators throughout the localization industry. It also helps in evaluating the quality of MT engine by comparing the raw MT to the post edited version by a human translator.

3- Fuzzy Match

In translation memories, edit distance is the technique of finding strings that match a pattern approximately (rather than exactly). Translation memories provide suggestions to translators, and fuzzy matches are used to measure the effort made to improve those suggestions.