Devices like Google Home or the Amazon Echo, Android Auto or the new Pixel phones suggest that in the not too distant future practically everything will be done by voice.

However, for that to be possible new virtual assistants They must not only answer user questions correctly, but also they must have a pleasant and human voice.

Finally, Google seems to have succeeded with a new artificial intelligence speech generation system called tacotron 2, which in the demo audios it is virtually indistinguishable from a human voice.

What is Tacotron 2 about?

Google has published a new research paper (non-peer reviewed) titled “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”, detailing the operation of a new text-to-speech system with amazing results.

The new system, called Tacotron 2, uses two neural networks:

  1. The first converts the text into a spectrogram, or visual representation of audio frequencies over time.
  2. The second, dubbed WaveNet, reads that spectrogram and generates the corresponding audio playback.

WaveNet is a development of DeepMind, Alphabet's artificial intelligence research laboratory, Google's parent company, and since its launch in 2016, it has been used to generate the voice of Google's virtual assistant: Google Assistant.

The results: a voice indistinguishable from the human

Accompanying the research work, Google has launched a website where you can listen to audio samples of his new system by pronouncing really complex sentences.

Finally, the website includes a section entitled "Tacotron 2 or human?", with pairs of audios in which the artificial intelligence system and a person pronounce the same phrase.

The goal is power compare the Tacotron 2 voice to a human voice without knowing in advance which is which and really they are indistinguishable.

Web with examples of Tacotron 2

Based on the results, when Tacotron 2 is ready to go commercial and replace Wavenet as the voice of Google Assistant, it will be a huge step forward in the user experience of Google voice-controlled devices.

However, it should be noted that the system has only been trained to imitate a specific female voice. To achieve a male voice or a different female voice, it would be necessary to retrain the system.

The research article can be consulted at the following link: “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", published in arXiv, in December of 2017.

Sources:

Keep reading: