Tag: human-aided machine translation

How can word counts differ within the same tool on different machines? (2)

How can word counts differ within the same tool on different machines? (2)

Have you ever run a word count with the same document on two different machines and received different word counts?

Well, here is what can have an impact on the word count statistics:

  • The use of a TM on one machine and no TM on the other machine can produce different word counts. A project with no TM will use default settings for counting, which might have been adjusted in the TM you actually use. For example, the setting to count words with hyphens as one or two words.

Read the rest from here

Translation Automation with a Human Touch

Translation Automation with a Human Touch

Automation is advancing quickly in the translation industry, too. Translation management systems are becoming comprehensive service platforms with numerous functionalities to help your company reach the highest level of efficiency possible. But although there are almost always brilliant technological solutions available for every single problem or action, a human touch can sometimes make the difference.

Find out more from here


Transitioning to a post-editing machine translation business model

Transitioning to a post-editing machine translation business model

When someone argues that MT engines produce poor results, the first thing I ask is when they last tested machine translation. Many in the industry are still basing their opinion on results from years ago, which are no longer valid. The reality is that machine translation is cheaper, faster, more secure, and increasingly better quality. LSPs that do not adopt this quickly dominating technology will not be able to compete in this new market.

Read more about machine translation post editing.

Here’s Why Neural Machine Translation is a Huge Leap Forward

Here’s Why Neural Machine Translation is a Huge Leap Forward

When building rules-based machine translation systems, linguists and computer scientists joined forces to write thousands of rules for translating text from one language to another. This was good enough for monolingual reviewers to be able to get the general idea of important documents in an otherwise unmanageable body of content in a language they couldn’t read. But for the purposes of actually creating good translations, this approach has obvious flaws: it’s time-consuming and, naturally, results in low quality translations.

Read more here

Translation Memory and Survival

Translation Memory and Survival

Written by:  Eman Elbehiry

Homer and Langley Collyer, two brothers, were killed by “hoarding”, and storing things in their house. That led psychology scientists to categorizing this “hoarding” under “psychological disease” umbrella. But when it comes to Translation, sorry psychology! You’re wrong this time. Why? Because “Hoarding” is a big sign of perfectionism, and adulthood.

How many times did you looked at your outstanding translation and wished you could use it again? How many times have you came across a sentence that you bet your life that you have translated it before, but unfortunately you can’t remember in which file, or which project exactly? How many times have you translated similar texts, and wished for something that could help you get the job done in half of its time?
Though you are stuck in only “wishing”, hoarding and storing this translation has found its way to you through translation memories, TM.

While the CAT tool divides the whole text into segments, the translation memory becomes where all your translation is stored in units; exactly as it was saved either in sentences, paragraphs, headlines, or titles. Which means that it stores the segment with its language pair. Consequently, you can get back to it in the time of need. According to SDL description of the translation memory “When a translator’s jobs regularly contain the same kinds of phrases and sentences, a translation memory will drastically increase the speed of translation.” This makes it one essential component of any CAT tool par excellence.

Later, when you summon this translation memory to re-use this “stored” translation; it starts suggesting for you a translation. You can add and enhance this translation, you can modify it, or replace it with a better one. For the translation memory being so smart; it updates itself all through. It has whatever you added, and enhanced. If the translator accepted the exact suggested translation, the tool will choose to call this an exact match; which is percented as 100%.  We can see this crystal clear especially in the texts that include a lot of repeated patterns. Furthermore the minimum similarity between a new segment and a saved segment in TM varies from 99% to 50% depending on the similarity match is called “Fuzzy match”.

According to what we mentioned above, the translation memory is best fit for texts including repetitions, or similarities. Which means that it is most suitable for technical, and legal translations, for having specialized and repeated expressions and vocabularies. Moreover, if you are working on one project in a team, it is very possible that each translator have his own distinctive expressions, and vocabulary in his mind, but if you are working with one translation memory, your documents will be more coherent and cohesive, and you are always on the same track.
As a result to this, translation memory saves time and efforts, which results in reducing the cost of the long-termed projects. Therefore, by now, we agree that it provides the best quality possible, and it’s definitely unlimited.

            Genius ha! Wondering how it works? Here is a hint:
The mechanism is called “Edit Distance”. This mechanism’s role is to identify how dissimilar two entities (e.g words, segments, units) are. Thus, in the case of the fuzzy match, it tries to measure approximately how close these two patterns are, and consequently, suggests to the translators who have the power to accept or modify, so that it could improve itself.

            The translation memory allows you to use it hundreds of times, to include it in whatever project and keep updating it. But the question here is “What if I didn’t use it before, will my past treasure go in vain?” The answer is NO!
Actually you are allowed to “Align” the segments of your past work to make a translation memory out of it, we will explain that the alignment in another article. It will also be your own made-dictionary to search in your previous work for any terms, sentences. In addition to this, you will be able to share your translation memory, and use the shared-with you, so that you have a solid, and saving base. The reviewer can have a part of the share with you so you have everything updated, and flawless. This is seen usually in the online tools. And yes it happens to have a movement of exchanging translation memories. “Sharing is caring” right?!

            And here emerges another question, which is: What if I am using a specific tool, while my peers, or my reviewers are using another? How can I share with them my TM? Or how do they share theirs?
Now, there is an extension for sharing which is “TMx” (Translation Memory Exchange). This allows you to import and export the translation memory among many different tools. Never easier!

According to Martín-Mor (2011), the use of TM systems does have an effect on the quality of the translated texts, especially on novices, but experienced translators are able to avoid it.” And here we say to all translators, experienced or beginners, “hoarding” is not killing, and there is nothing better than hoarding and storing years of hard work embracing experience. This storing and hoarding is exactly “what doesn’t kill you makes you stronger”.

Share with us a story in which you used a TM and it was super beneficial!

Nimdzi Language Technology Atlas

Nimdzi Language Technology Atlas

For this first version, Nimdzi has mapped over 400 different tools, and the list is growing quickly. The Atlas consists of an infographic accompanied by a curated spreadsheet with software listings for various translation and interpreting needs.

As the language industry becomes more technical and complex, there is a growing need for easy-to-understand materials explaining available tech options. The Nimdzi Language Technology Atlas provides a useful view into the relevant technologies available today.

Software users can quickly find alternatives for their current tools and evaluate market saturation in each segment at a glance. Software developers can identify competition and find opportunities in the market with underserved areas.

Reference: https://bit.ly/2ticEyT

Six takeaways from LocWorld 37 in Warsaw

Six takeaways from LocWorld 37 in Warsaw

Over the past weekend, Warsaw welcomed Localization World 37 which gathered over 380 language industry professionals. Here is what Nimdzi has gathered from conversations at this premiere industry conference.

1. A boom in data processing services

A new market has formed preparing data to train machine learning algorithms. Between Lionbridge, Pactera, appen, and Welocalize  – the leading LSPs that have staked a claim in this sector – the revenue from these services already exceeds USD 100 million.

Pactera calls it “AI Enablement Services”, Lionbridge and Welocalize have labelled it “Global services for Machine Intelligence”, and appen prefers the title, “data for machine learning enhanced by human touch.” What companies really do is a variety of human tasks at scale:

  • Audio transcription
  • Proofreading
  • Annotation
  • Dialogue management

Humans help to train voice assistants and chat bots, image-recognition programs, and whatever else the Silicon Valley disruptors decide to unleash upon the world. One prime example was performed at the beginning of this year when Lionbridge recorded thousands of children pronouncing scripted phrases for a child-voice recognition engine.

Machine learning and AI are the second biggest areas for venture investment, according to dealroom.co. According to the International Data Corporation’s (IDC) forecast, this is likely to  quadruple in the next 5 years, from USD 12 billion in 2017 to USD 57.6 billion. Companies will need lots of accurate data to train their AI, hence there is significant business opportunity in data cleaning. Compared to flash platforms like Clickworker and Future Eight, LSPs have a broader human resource management competence and can compete for a large slice of the market.

2. LSP AI: Separating fact from fantasy

Artificial intelligence was high on information at #Locworld 37, but apart from the advances in machine translation, nothing radically new was presented. If any LSPs use machine learning for linguist selection, ad-hoc workflow building, or predictive quality analytics, they didn’t show it.

On the other hand, everyone is chiming in to the new buzzword. In a virtual show of hands at the AI panel discussion, an overwhelming proportion of LSP representatives voted that they already use some AI in their business. That’s pure exaggeration to put it mildly.

3. Introducing Game Global

Locworld’s Game Localization Roundtable expanded this year into a fully-fledged sister conference – the Game Global Forum. The two-day event gathered just over 100 people, including teams from King.com, Electronic Arts, Square Enix, Ubisoft, Wooga, Zenimax / Bethesda, Sony, SEGA, Bluehole and other gaming companies.

We spoke to participants on the buying side who believe the content to be very relevant, and vendors were happy with pricing – for roughly EUR 500, they were able to network with the world’s leading game localization buyers. This is much more affordable than the EUR 3,300+ price tag for the rival IQPC Game QA and Localization Conference.

Given the success of Game Global and the continued operation of the Brand2Global event, it’s fair to assume there is room for more industry-specific localization conferences.

4. TMS-buying rampage

Virtually every client company we’ve spoken to at Locworld is looking for a new translation management system. Some were looking for their first solution while others were migrating from heavy systems to more lightweight cloud-based solutions. This trend has been picked up by language technology companies which brought a record number of salespeople and unveiled new offerings.

Every buyer talked about the need for integration as well as end-to-end automation, and shared the “unless there is an integration, I won’t buy” sentiment. Both TMS providers and custom development companies such as Spartan Software are fully booked and churning out new connectors until the end of the 2018.

5. Translation tech and LSPs gear up for media localization

Entrepreneurs following the news have noticed that all four of the year’s fastest organically-growing companies are in the business of media localization. Their success made ripples that reached the general language services crowd. LSP voiceover and subtitling studios are overloaded, and conventional CAT-tools will roll out media localization capabilities this year. MemoQ will have a subtitle editor with video preview, and a bigger set of features is planned to be released by GlobalLink.

These features will make it easier for traditional LSPs to hop on the departed train of media localization. However, LSP systems won’t compare to specialized software packages that are tailored to dubbing workflow, detecting and labeling individual characters who speak in videos, tagging images with metadata, and the like.

Reference: https://bit.ly/2JZpkSM

A Gentle Introduction to Neural Machine Translation

A Gentle Introduction to Neural Machine Translation

One of the earliest goals for computers was the automatic translation of text from one language to another.

Automatic or machine translation is perhaps one of the most challenging artificial intelligence tasks given the fluidity of human language. Classically, rule-based systems were used for this task, which were replaced in the 1990s with statistical methods. More recently, deep neural network models achieve state-of-the-art results in a field that is aptly named neural machine translation.

In this post, you will discover the challenge of machine translation and the effectiveness of neural machine translation models.

After reading this post, you will know:

  • Machine translation is challenging given the inherent ambiguity and flexibility of human language.
  • Statistical machine translation replaces classical rule-based systems with models that learn to translate from examples.
  • Neural machine translation models fit a single model rather than a pipeline of fine-tuned models and currently achieve state-of-the-art results.

Let’s get started.

What is Machine Translation?

Machine translation is the task of automatically converting source text in one language to text in another language.

In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.

— Page 98, Deep Learning, 2016.

Given a sequence of text in a source language, there is no one single best translation of that text to another language. This is because of the natural ambiguity and flexibility of human language. This makes the challenge of automatic machine translation difficult, perhaps one of the most difficult in artificial intelligence:

The fact is that accurate translation requires background knowledge in order to resolve ambiguity and establish the content of the sentence.

— Page 21, Artificial Intelligence, A Modern Approach, 3rd Edition, 2009.

Classical machine translation methods often involve rules for converting text in the source language to the target language. The rules are often developed by linguists and may operate at the lexical, syntactic, or semantic level. This focus on rules gives the name to this area of study: Rule-based Machine Translation, or RBMT.

RBMT is characterized with the explicit use and manual creation of linguistically informed rules and representations.

— Page 133, Handbook of Natural Language Processing and Machine Translation, 2011.

The key limitations of the classical machine translation approaches are both the expertise required to develop the rules, and the vast number of rules and exceptions required.

What is Statistical Machine Translation?

Statistical machine translation, or SMT for short, is the use of statistical models that learn to translate text from a source language to a target language gives a large corpus of examples.

This task of using a statistical model can be stated formally as follows:

Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we wish to choose S so as to maximize Pr(S|T).

— A Statistical Approach to Machine Translation, 1990.

This formal specification makes the maximizing of the probability of the output sequence given the input sequence of text explicit. It also makes the notion of there being a suite of candidate translations explicit and the need for a search process or decoder to select the one most likely translation from the model’s output probability distribution.

Given a text in the source language, what is the most probable translation in the target language? […] how should one construct a statistical model that assigns high probabilities to “good” translations and low probabilities to “bad” translations?

— Page xiii, Syntax-based Statistical Machine Translation, 2017.

The approach is data-driven, requiring only a corpus of examples with both source and target language text. This means linguists are not longer required to specify the rules of translation.

This approach does not need a complex ontology of interlingua concepts, nor does it need handcrafted grammars of the source and target languages, nor a hand-labeled treebank. All it needs is data—sample translations from which a translation model can be learned.

— Page 909, Artificial Intelligence, A Modern Approach, 3rd Edition, 2009.

Quickly, the statistical approach to machine translation outperformed the classical rule-based methods to become the de-facto standard set of techniques.

Since the inception of the field at the end of the 1980s, the most popular models for statistical machine translation […] have been sequence-based. In these models, the basic units of translation are words or sequences of words […] These kinds of models are simple and effective, and they work well for man language pairs

— Syntax-based Statistical Machine Translation, 2017.

The most widely used techniques were phrase-based and focus on translating sub-sequences of the source text piecewise.

Statistical Machine Translation (SMT) has been the dominant translation paradigm for decades. Practical implementations of SMT are generally phrase-based systems (PBMT) which translate sequences of words or phrases where the lengths may differ

— Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.

Although effective, statistical machine translation methods suffered from a narrow focus on the phrases being translated, losing the broader nature of the target text. The hard focus on data-driven approaches also meant that methods may have ignored important syntax distinctions known by linguists. Finally, the statistical approaches required careful tuning of each module in the translation pipeline.

What is Neural Machine Translation?

Neural machine translation, or NMT for short, is the use of neural network models to learn a statistical model for machine translation.

The key benefit to the approach is that a single system can be trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

Unlike the traditional phrase-based translation system which consists of many small sub-components that are tuned separately, neural machine translation attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation.

— Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

As such, neural machine translation systems are said to be end-to-end systems as only one model is required for the translation.

The strength of NMT lies in its ability to learn directly, in an end-to-end fashion, the mapping from input text to associated output text.

— Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.

Encoder-Decoder Model

Multilayer Perceptron neural network models can be used for machine translation, although the models are limited by a fixed-length input sequence where the output must be the same length.

These early models have been greatly improved upon recently through the use of recurrent neural networks organized into an encoder-decoder architecture that allow for variable length input and output sequences.

An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence.

— Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

Key to the encoder-decoder architecture is the ability of the model to encode the source text into an internal fixed-length representation called the context vector. Interestingly, once encoded, different decoding systems could be used, in principle, to translate the context into different languages.

… one model first reads the input sequence and emits a data structure that summarizes the input sequence. We call this summary the “context” C. […] A second mode, usually an RNN, then reads the context C and generates a sentence in the target language.

— Page 461, Deep Learning, 2016.

Encoder-Decoders with Attention

Although effective, the Encoder-Decoder architecture has problems with long sequences of text to be translated.

The problem stems from the fixed-length internal representation that must be used to decode each word in the output sequence.

The solution is the use of an attention mechanism that allows the model to learn where to place attention on the input sequence as each word of the output sequence is decoded.

Using a fixed-sized representation to capture all the semantic details of a very long sentence […] is very difficult. […] A more efficient approach, however, is to read the whole sentence or paragraph […], then to produce the translated words one at a time, each time focusing on a different part of he input sentence to gather the semantic details required to produce the next output word.

— Page 462, Deep Learning, 2016.

The encoder-decoder recurrent neural network architecture with attention is currently the state-of-the-art on some benchmark problems for machine translation. And this architecture is used in the heart of the Google Neural Machine Translation system, or GNMT, used in their Google Translate service.

… current state-of-the-art machine translation systems are powered by models that employ attention.

— Page 209, Neural Network Methods in Natural Language Processing, 2017.

Although effective, the neural machine translation systems still suffer some issues, such as scaling to larger vocabularies of words and the slow speed of training the models. There are the current areas of focus for large production neural translation systems, such as the Google system.

Three inherent weaknesses of Neural Machine Translation […]: its slower training and inference speed, ineffectiveness in dealing with rare words, and sometimes failure to translate all words in the source sentence.

— Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.

Reference: https://bit.ly/2Cx8zxI