“Okay Google”! How Do You Interpret Me?

“Okay Google”! How Do You Interpret Me?

Reading Time: 3 minutes

Sphinx offers you a riddle to solve before you continue reading:

A blind Chinese man had a house on fire, went to his deaf and mute American neighbor to ask for help. How will the mute tell the blind about the catastrophe if the two don’t speak each other’s languages?

Many riddles appear every day, yet many fascinating solutions technology offers for the industry of machine translation and transcription.

And that was the first phase of Google advancing their machine translation. It takes what you want to “say”, and views it typed on the screen. Thus, this is a good solution to the Sphinx riddle.

The blind will address his speech to a transcription machine to convert it into a typed translated text to be read by the mute in no time.

But Sphinx isn’t technologically advanced. Yeah, at least we are!!

In Sequence-to-Sequence Models Can Directly Translate Foreign Speech, a paper published by Cornell University, the researches explained the powerful new method of translation. The encoder-decoder deep neural network architecture directly translates speech in one language into text in another, trained and unsupervised.

Furthermore, researches continued to skip this step of transcribing the audio of the source language that is done by the auto-recognition engine, to translate directly from the audio in a zero-shot performance. Like these apps here.

Are we there yet?

Not yet. In the International MT Summit 17” held in Dublin, 2019, there were researches to even translate audio to audio. So instead of this process of speaking to the engine, then transcribing them first, then translating it, and then produce it in a translated voice. It happens even in a more Zero-Shot process.

Suggestions to this were even farther. It included personalization and gendering. So when a man talks in Chinese it will be interpreted into English with the same man’s voice. Surprisingly, if it’s a female, the same thing will apply. No machinal voices anymore. Only in-person humanly pure voice.

But how does the machine understand voice AND produce an interpretation of it?

I am Cortana, What Can I do Now
I am Cortana, What Can I do Now

A good question was asked in Quora saying does the computer understand a language while its language is only 0 1, which also might answer the question of, how does google, and other voice recognition systems respond to us.

“Sound is made up of information, just like words, or numbers. Sound is analog in nature, and is represented by frequency (pitch) and amplitude (loudness) over time. More complex sound, i.e. songs, contain multiple frequencies being played at the same time.”

Thanks to neural network, this is exactly how we train the machine. As the engineers give the machines bits and pieces of the audio, we call it “samples”. The audio gets divided to bits that the engine receives shorter than just a click. Collecting all of these bits together forms the full audio. That’s the normal way of it.


The trend is now rising to convert and jump above these audios. For example, the machine is trained to interpret Chinese into English, and another machine is trained to interpret English into Arabic. The future interpretation machine will interpret Chinese…. and VOILA Arabic COMES OUT directly in one move in no time.

A final apology to Shakespeare who once believed that “Tomorrow will never come”, because technology is finally here.

Try Google’s interpretation and other good apps for interpretation share your experience with us.


Print Friendly, PDF & Email
Spread Knowledge
Comments are closed.