Statistical analysis beats algorithms for language translation, Google says, but which massive volumes of linguistic data will it use?
Google Translate is currently best known for being a quick and dirty way to render Web pages or short text snippets in another language. But according to Der Spiegel, the next step for the core technology behind that service is a device that amounts to the universal translator from “Star Trek.”
Google isn’t alone, either. Apparently everyone from Facebook to Microsoft is ramping up similar ambitions: to create services that eradicate language barriers as we currently know them. A realistic goal or still science fiction? And at what cost?
Machine translation has been around in one form or another for decades, but has always lagged far behind translations produced by human hands. Much of the software written to perform machine translation involved defining different languages’ grammars and dictionaries, a difficult and inflexible process.
Google’s approach, under the guidance of engineer Franz Och, was to replace all that with a purely statistical approach. Looking at masses of data in parallel — for instance, the English and French translations of various public-domain texts — produced far better translations than the old algorithm-driven method. The bigger the corpus, or body of parallel texts, the better the results. (The imploding costs of storage and processing power over the last couple of decades have also helped.)
If Google’s plan is to create its own technology from scratch, Facebook’s strategy appears to be to import it. Back in August, Facebook picked up language translation software company Mobile Technologies, which Facebook product management director described as “an investment in our long-term product roadmap.” Among Mobile Technologies’ products is an app named Jibbigo, which translates speech.
From these two projects alone, it’s easy to see a common element: the backing of a company that has tons of real-world linguistic data at its disposal. Google and Microsoft both have search engines that harvest the Web in real time; Facebook has literally a billion users chatting away. All of this constitutes a massive data trove that can be harvested for the sake of a translation corpus.
The big unanswered question so far: If Google, Facebook, Microsoft, and the rest plan on using real-time conversations to generate a corpus for translations, will any of that data be anonymized? Is it even possible? An opt-in program that allows people to let their talk be used as part of the corpus seems like the best approach. But based on their previous behavior, isn’t it more likely they’ll simply roll such harvesting into a terms-of-service agreement?