Deepmind Technologies Limited, a British AI company which was acquired by Google in 2014 has made a breakthrough in machine-generated speech. According to the blog post, this new technology is going to outperform the existing one by 50%. The new system is called WaveNet and it is “a deep generative model of raw audio waveforms,” as stated in the post.
Speech has been increasingly used by the people to communicate with many of the devices which include from smartphones to cars. If you have used one, you will have experienced it as not-so-natural human voice. This is because many of the computer-generated speech programs will try to make a complete sentence by combining a set of short recordings of a human. Even though these statements are understandable, it often feels unnatural.
Blind tests were conducted with different individuals for US English and Mandarin Chinese and it was found that WaveNet generated speeches sounded more natural than Google’s text-to-speech programs. However, there will no applications that you can download to use WaveNet anytime soon as it requires too much computational power. First, the WaveNet has to sample the audio given to it and for the each sample generated, it has to make a prediction on what the soundwave should be like.
Several examples and differences between the Google’s TTS system and WaveNet were mentioned in the blog, which also included some audio examples. The WaveNet did sound more natural than the existing system as we did listen to both. It was also mentioned in the post that the new system will be able to identify different types of voices such as male and female. It also has the ability to generate breathing and mouth movements.
DeepMind also stated that the system is more expensive than any of the existing ones, but it helps create a more natural sounding audio.