Microsoft’s experimental speech recognition software has made a technological breakthrough that recognises the words in a conversation as well as a person does.
“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, executive vice president, Microsoft Artificial Intelligence and Research group.
In a paper published Monday, a team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than humans.
The researchers reported a word error rate (WER) of 5.9%, down from the 6.3% WER the team reported just last month.
The breakthrough will have wide-ranging effects for consumer and business products that can be significantly improved by speech recognition. Microsoft plans to use the technology for its consumer entertainment devices like the Xbox, accessibility tools such as instant speech-to-text transcription and their personal digital assistant—Cortana.
“We’ve reached human parity. This is an historic achievement,” said Xuedong Huang, Microsoft’s chief speech scientist in a statement.
The research milestone’s parity however doesn’t mean perfection. Instead, it means that the error rate–or the rate at which the computer misheard a word–is the same as a person hearing the same conversation.
The neural language model used in the software is capable to understand relationships between two words and not just their sounds. Simply meaning, the speech recognition mechanism can recognize and search for synonyms.
“To reach the human parity milestone, the team used Microsoft’s Computational Network Toolkit, a homegrown system for deep learning that the research team has made available on GitHub via an open source license,” said a Microsoft blogpost published on Tuesday.
Despite huge progresses in recent years in vision and speech recognition, the researchers however caution there is still much work to be done.
Microsoft did not announce when the software used in the study will make its way into commercial products.
Google, Apple, and others have recently been publicising their own efforts to teach neural networks how to recognise speech and other sound patterns.
In May, Google Inc. unveiled its Magenta project, which applies the neural network idea for composing music.
In August, researchers at Stanford University carried out an experiment with text messages, where a group of 32 texters (aged 19-32) worked on an Apple iPhone against the Chinese tech giant Baidu’s speech recognition software. Baidu’s software was not only three times faster than the human typists, it was also more accurate.