Edit Distance in Translation Industry

Edit Distance in Translation Industry

In computational linguistics, edit distance or Levenshtein distance, is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.  The edit distance between (a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966 which are:

1- Insertion.

2- Deletion

3- Substitution.

In Levenshtein’s original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b.

For example, the Levenshtein distance between “kitten” and “sitting” is 3. A minimal edit script that transforms the former into the latter is:

  • kitten – sitten (substitution of “s” for “k”).
  • sitten –  sittin (substitution of “i” for “e”).
  • sittin –  sitting (insertion of “g” at the end).

What are the application of edit distance in translation industry?

1- Spell Checkers

Edit distance is applied where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.

2- Machine Translation Evaluation and Post Editing

Edit distance can be used to compare a postedited file to the machine translated output that was the starting point for the postediting. When you calculate the edit distance, you are calculating the “effort” that the posteditor made to improve the quality of the machine translation to a certain level. Starting from the source content and same MT output, if you perform a light postediting and a full postediting, the edit distance for each task will be different, and the human quality level is expected to have a higher edit distance, because more changes are needed. This means that you are measuring light and full postediting using the edit distance.

Therefore, the edit distance is a kind of “word count” measure of the effort, similar in a way to the word count used to quantify the work of translators throughout the localization industry. It also helps in evaluating the quality of MT engine by comparing the raw MT to the post edited version by a human translator.

3- Fuzzy Match

In translation memories, edit distance is the technique of finding strings that match a pattern approximately (rather than exactly). Translation memories provide suggestions to translators, and fuzzy matches are used to measure the effort made to improve those suggestions.

Print Friendly, PDF & Email
Spread Knowledge
  • 1
    Share
Comments are closed.