Translation memory: a computer-aided translation program. In essence it is a database that stores translated sentences (translation units or segments) with their respective source segments in a database (the “memory”). For each new segment to be translated, the program scans the database for a previous source segment that matches the new segment exactly or approximately (a fuzzy match) and, if found, suggests the corresponding target segment as a possible translation. A translator can then accept, modify or reject the suggested translation.
Translation memory system: refers to a type of machine-aided human translation tool that stores previous translations and offers these translations when identical or similar sentences are encountered when translating new materials.
Similarity match: a type of matching scheme for the free-form queries in a computer-aided translation system. The queries are first passed through the system and the browser performs a similarity match between the internal representation of the queries and the internal representation of each sentence in the database. In this way, both surface similarity and structural similarities can be matched.
Source: A Dictionary of Translation Technology, Chan Sin-wai, The Chinese University Press, 2004
• A perfect or exact match occurs when a new source language segment is completely identical including spelling, punctuation and inflections, to the old segment found in the database, that is in the TM.
• Unlike a perfect match, a fuzzy match occurs when an old and a new source language segment are similar but not exactly identical. Even a very small difference such as punctuation leads to a fuzzy match.
As the degree of similarity between old source segments in the database or memory and new source text segments currently being translated may vary, an algorithm is used to calculate a percentage which expresses the degree of match. The higher the percentage of the fuzzy match the closer the similarity between the two source language segments. The threshold percentage can be set by the user at a high level, for instance at 90%, to restrict the retrieval of old source language segments to those containing only small differences from the new source language segment. In contrast, the threshold can be set at a low level, for instance at 10%, to allow the translation memory to retrieve segments only weakly related to the new segment. Segments that mean the same thing but differ in format such as dates, measurements, time and spellings all fall in the fuzzy match category although they are differently categorized. Some systems allow for the automatic processing of such changes. Polysemous and homonymous words, that is homographs, always need careful handling a present a challenge.
Segmentation is the process of breaking a text up into units consisting of a word or a string of words that is linguistically acceptable. Segmentation is needed in order for a TM to perform the matching (perfect and fuzzy) process. A pair of old source and target language texts is usually segmented into individual pairs of sentences. However, not all parts of texts, particularly specialist texts, are in a sentence format. Exceptions include headings, lists and bullet points. As a result, different units of segmentation are needed. A translator can decide the length of a segment but often punctuation is used as an indicator. A segment is then allocated a unique number or tag by the system. It is important to note that while segmentation is quite natural for Latin-based alphabets, it is rather alien to languages such as Chinese, Thai and Vietnamese, which are written continuously without any spaces between characters. Thus, other methods of segmentation are required to determine the beginning and ending of a segment in such cases. New segments can be added to the TM while translating, and alternatively previously translated source language texts and their translations can be entered into the memory through a process of text alignment.
Source: Translation and Technology, C.K. Quah, Palgrave Macmillan, 2006
Most simply, a TM can be viewed as a list of source text segments explicitly aligned with their target text counterparts. The resulting structure is sometimes referred to as a parallel corpus or a bitext. Translation units are stored in the TM database. Some sophisticated TM programs use a type of technology called a neural network to store information. A neural network allows information to be retrieved more quickly than a sequential search technique. The essential idea behind a TM system is that it allows a translator to reuse or recycle previously translated segments. Reusing a previous translation in a new text is sometimes referred to as “leveraging”.
How does a TM system work? This technology works by automatically comparing a new source text against a database of texts that have already been translated. When a translator has a new segment to translate, the TM system consults the database to see if this new segment corresponds to a previously translated segment. If a matching segment is found, the TM system presents the translator with the previous translation, and the translator decides whether or not to incorporate it into the new translation.
Segmentation: In most instances, the basic unit of segmentation is the sentence. However, not all text is written in sentence form. Headings, list items and table cells are familiar elements of text, but they may not strictly qualify as sentences. Therefore, many TM systems allow the user to define other units of segmentation in addition to sentences. These units can include sentence fragments or entire paragraphs. Deciding what constitutes a segment is not a trivial task. How can the TM system identify sentences? Punctuation parks such as periods, exclamation points, and question marks are typically used. Problematic cases are abbreviations, or section headings, or embedded sentences. Some of these problems can be resolved by incorporating stop lists (eg. Lists of abbreviations that do not indicate the end of a sentence, such as Mrs. and e.g.) into the TM system. An additional issue is the fact that the segmentation units used in the source text may not correspond exactly to those used in the translation. This lack of one-to-one correspondence can create difficulties for automatic alignment programs.
Matches: most TM systems present the user with a number of different types of segment matches. The most common types are exact, fuzzy, and term matches. Research is being done on full and sub-segment matches. Exact matches are the most straightforward or perfect matches.
An exact match is 100% identical to the segment that the translator is currently translating, both linguistically and in terms of formatting. The process used by the TM system to identify perfectly matching segments is one of strict pattern matching. This means that the two strings must be identical in every way, including spelling, punctuation, inflection, numbers, and even formatting. Any segment in the new source text that does not match an original segment precisely will not produce an exact match. The translator is not forced to accept the translation proposed by the TM system. Even though a segment may be identical, translators are concerned with translating complete texts rather than isolated segments so it is important to read the proposed translation in its new context to be sure that it s both stylistically appropriate and semantically correct.
Full matches occur when a new source segment differs from a stored TM unit only in terms of so-called variable elements, which are sometimes referred to as “placeables” or “named entities”. Variable elements include numbers, dates, times, currencies, measurements, and sometimes proper names. These elements typically require some kind of special treatment in a text. TM systems need to ignore variable elements for matching purposes.
Fuzzy matches are approximate or partial matches. A fuzzy match retrieves a segment that is similar, but not identical, to the new source segment. Some TM systems use color coding to illustrate various types of differences between the new source text segment and the retrieved segment. The degree of similarity in a fuzzy match can range from 1% to 99%, and the user generally has the ability to set the sensitivity threshold to allow the TM system to locate previously translated segments that may differ only slightly from the new source text segment or segments that vary greatly. If the sensitivity threshold is set too high, there is a risk that the TM will produce “silence”: potentially useful partial matches will not be retrieved. However, if it is set too low, the system will produce “noise”: the suggested translations that are retrieved will be too different from the new source text segment and therefore not helpful. When the threshold is very low, a match may be made on the basis of very general words (“the”, “and”) and the overall content of the retrieved segment may contain little of value for helping the translator to translate the new segment. Many translators prefer to set the threshold somewhere between 60% and 70%. Although fuzzy matching can be useful, it requires careful proofreading and editing to ensure that the proposed translation is appropriate for inclusion in the new target text.
Term matches are done through the process of active terminology recognition and essentially constitutes automatic dictionary lookup. If one or more terms are recognized as being in the term base, the TM system points to the appropriate term records and the translator can then make use of the relevant information contained there. This means that when no exact or fuzzy matches are found for source text segments, the translator might at least find some translation equivalents for individual terms in the term base.
Sub-segment matching falls partway between fuzzy and term matching. In fuzzy matching, the two segments must have a number of elements in common in order for a match to be established. In term matching, the new source segment is compared against entries in the term base. In the case of sub-segment matching, the elements that are compared are smaller chunks of segments. This means that a match can be retrieved between two small chunks of segments, even if the complete segments do not have a high degree of overall similarity. When both segments contain a chunk that is very similar indeed, there is a possibility that the translator may be able to reuse that chunk. Further refined, a combined full segment/sub-segment approach allows the TM system to automatically compare the new source text segment against the stored TM. It will begin by examining complete segments, first looking for exact matches and then for fuzzy matches, and if no such match is found at the segment level, it will compare increasingly smaller chunks in an effort to find a match. In this way, the translator may be presented with sub-segment matches originating from several different segments, even if none of those complete segments qualified as a fuzzy match.
This strategy is similar to the approach used in example-based machine translation (EBMT). The principal difference between a TM as a support tool and a full-fledged EBMT System is basically a question of who has the primary responsibility for analysis of the segments and formulation of the target text, whereas with EBMT, the computer is responsible for producing a complete draft of a target text, though this may still need to be post-edited by a human translator.
No matches: in which case the translator must translate from scratch. Another option is to use a machine translation system to translate the portions of the source text for which no match was found in the TM.
There are two main ways in which translations can be entered into the TM database: through interactive translation or through post-translation alignment. Interactive translation has the potential to produce a TM that is high in quality but initially low in volume, where post-translation alignment has the potential to produce a TM that is higher in volume but (possibly) lower in quality. It is entirely possible to build a TM using a combination of both.
Interactive translation is the most straightforward way for translators to construct a TM, adding translation units to the memory as they go along. Each time the translator translates a source text segment, the paired translation unit can be stored in the TM database. Once a segment has been translated and stored, it immediately becomes part of the TM. This means that if that segment, or a similar one, occurs again in the text-even in the very next sentence- the previous translation is suggested to the translator automatically. The translator then has the choice of accepting the previous translation or editing it if the context requires change. Note that many TM systems can also be networked, which means that multiple translators can contribute to one TM, and the volume of data that it contains can be built more quickly. In a networked situation, it is possible to give different types of privileges to different users in order to exercise some form of quality control. For ex., all users can be given permission to consult the TM, but the ability to add new TUs can be restricted to revisers or senior translators.
Working with an existing TM: there are two main methods – interactive mode and batch mode. A translator working in interactive mode proceeds to work through the new source text segment by segment, and the TM system attempts to match the segments stored in the database against the new source text segments. As each new segment is translated, the TU is immediately added to the TM and is available for reuse the next time an identical or similar segment is encountered. In the second, most TM systems also allow for batch translation, sometimes referred to as pre-translation, which means that a user can run a complete source text through the system, and whenever it finds an exact match, it will automatically replace the new source text segment with the translation that is stored in the TM. Segments for which no match is found must later be translated by either a human translator or a machine-translation system. In either case, the entire text must then be post-edited by a human translator to ensure that the replacements made by the system were correct. If the translator makes changes to any matches that were inserted automatically, these changes can subsequently be added to the TM to keep it up to date.
TM systems are often integrated with other tools:
- With terminology-management systems — the TM system compares the source text segments against the previously translated segments stored in the TM database and at the same time, using a process known as active terminology recognition, the TMS compares the individual terms contained in each source text segment against the terms contained in the term base. If the term is recognized as being in the term base, the translator’s attention is drawn to the fact that an entry exists for this term, and the translator can view the term record and then insert the term from the record directly into the target text.
- With bilingual concordancers – which allow the user to retrieve all instances of a specific search string and view these occurrences in their immediate context. This means that a translator can ask to see all the occurrences of any text fragment (not just a pre-defined segment) that appear anywhere in the TM, along with their translation equivalents. This allows the translator to quickly view the search string in context together with its translations, which may not always be the same.
- With machine translation systems – where a new source text is first compared against a TM, which will replace those segments for which exact matches are retrieved. The segments that are still untranslated can be fed into a machine translation system, which produces a draft translation. The entire document is then passed on to a human translator for post-editing. The final translation can be aligned with the original source text and stored in the TM database for future reuse.
Source: Computer-Aided Translation Technology, Lynne Bowker, University of Ottawa Press, 2002
Most current commercial TM systems offer a quantitative evaluation of the match in the form of a score, often expressed as a percentage, and sometimes called a fuzzy match score or similar. How this “score”, is arrived at can be quite complex, and is not usually made explicit in commercial systems, for proprietary reasons.
In all systems, matching is essentially based on character-string similarity, but many systems allow the user to indicate weightings for other factors, such as the source of the example, formatting differences, and even significance of certain words. The character-string similarity calculation uses the well-established concept of “sequence comparison”, also known as the “string-edit distance” because of its use in spell checkers, or more formally the “Levenshtein distance” after the Russian mathematician who discovered the most efficient way to calculate it. The string-edit distance is a measure of the minimum number of insertions, deletions and substitutions needed to change one sequence of letters into another. For ex., to change “waiter” into “waitress” requires one deletion and three insertions. The measure can be adjusted to weight in favor of insertions, deletions or substitutions, or to favor contiguous deletions over non-contiguous ones. In fact, the sequence-comparison algorithm developed by Levenshtein, which compares any sequences of symbols—characters, words, digits, etc.—has a huge number of applications, ranging from file comparison in computers, to speech recognition (sound waves represented as sequences of digits), comparison of genetic sequences such as DNA, image processing…in fact anything that can be digitized can be compared using Levenshtein distance.
Source: “Translation Memory Systems”, Harold Somers, Computers and Translation, A translator’s guide, 2003