Microsoft has revealed its latest research in text-to-speech AI with VALL-E, as reported by Engadget. VALL-E can simulate someone's voice from only a three-second audio sample.
How It Works
According to ARS Technica, the speech can match the timbre of the voice and the emotional tone of the speaker. In addition, it can also match the room's acoustics.
Microsoft calls VALL-E a "neural codec language model," which is derived from Meta's AI-powered compression neural net Encodec that generates audio from text input and short samples from the speaker.
VALL-E was trained on 60,000 hours of English language speakers from over 7,000 speakers on Meta's LibriLight audio library. The voice that tries to mimic should be closely matched with the voice in the training data. Then, it will use the training data to infer what the speaker would like if speaking the desired text input.
The research team showed how well this works on the VALL-E Github page. For each phrase they want the AI to say, there is a three-second promo from the speaker to mimic. Then, there is a "ground truth" of the same speaker saying another phrase for comparison, and finally, a "baseline" conventional text-to-speech synthesis and the VALL-E sample at the end.
The team got mixed results as some sounded like a machine, while others sounded real.
Model Improvements
Microsoft aims to improve the model with plans to scale up the training data. They are also looking into reducing unclear or missed words.
The code will not be open source, which may be to prevent the risks with AI that can put words in someone's mouth. Instead, they will follow their "Microsoft AI Principals" on the further development of the model.
It will be interesting to see if VALL-E is ever released to the public market. It could be used to generate custom celebrity voices or to simulate a certain person's voice for advertising a product. Even if it is not released, it will raise the bar on text-to-speech AI that may lead to a Siri-like AI program that fits your voice.
AI text-to-speech (TTS) has been around for some time. It is also getting improvements from these types of research. Furthermore, there are applications for TTS in natural language processing, voice interfaces, and game development, among others. As the ability to do the speech gets better, there will be more applications.
VALL-E is an AI that turns voice clips into digital text. It was trained in a large audio library to learn from sounds from thousands of voices. The research team has released a demo video showing VALL-E's capabilities. It is not clear if VALL-E will be released to the public. However, the research is bound to lead to much-improved text-to-speech and general AI.
Related article : Microsoft Edge Brings Voice Typing Support, MS Editor on Windows 11