It was announced today on Microsoft’s blog post, that they will showcase their breakthrough in speech recognition technology at the upcoming Interspeech 2011 event. While Microsoft is maintaining relatively quiet about the their breakthrough, they did provide some background about the technology that led them to it.
Speech recognition technology has been in development for decades, but it has been a very long and slow process. Due to the dynamic patterns of human language, combined with the millions of variations of speech by individuals, it has been hard to transform these audible sounds into information that a computer can process accurately.
That is not to say we haven’t come a long way. Current commercially available technology is behind applications such as voice-to-text software and automated phone services. Accuracy is the main goal of this software, and voice-to-text typically achieves this by having the user “train” the software during setup and by adapting more closely to the user’s speech patterns over time. How this technology works fundamentally, is that it breaks down fragments of speech, commonly called “phonemes”, which make up the building blocks of our English language (there are about 30 or so). More state-of-the-art speech recognizers use even shorter fragments, known as “senones” that make up these phonemes.
But this technology has its flaws. Due to many very close sounding phonemes and senones, the training period can be very annoying, as the machine attempts to “learn” your speech patterns. Also, text to speech technologies are pretty much limit themselves to one person per device. To make matters much more complicated, the speech commands themselves must be phrased in a certain order for the device to recognize the command.
However, this new breakthrough technology does not require the user to ‘train’ the system at all, but instead involves “real-time, speaker-independent, automatic speech recognition.” To say it another way, true vocal recognition of human speech. The nuts and bolts of this technology is a bit esoteric for most people, but it involves combining artificial neural networks (ANNs) and “deep” neural networks (DNNs).
“Others have tried context-dependent ANN models,” Yu observes, “using different architectural approaches that did not perform as well. It was an amazing moment when we suddenly saw a big jump in accuracy when working on voice-based Internet search. We realized that by modeling senones directly using DNNs, we had managed to outperform state-of-the-art conventional CD-GMM-HMM large-vocabulary speech-recognition systems by a relative error reduction of more than 16 percent. This is extremely significant when you consider that speech recognition has been an active research area for more than five decades.”
So essentially what this means, is that we are a major step closer to having fully responsive voice-interaction with computers. It’s hard not to jump to thoughts of AI being a very real and foreseeable future. With advances in robotics keeping in step with voice recognition software, it seems like these two worlds colliding is an inevitable outcome. But that’s just my
For more information about their research, you can click on the above link to Microsoft’s blog article.