Microsoft’s new AI bot VALL-E can be trained with only a three-second audio sample An innovative text-to-speech AI model named VALL-E has been created by a team of Microsoft researchers. Once trained, it can replicate a person’s voice almost perfectly. The team requires only a three-second audio sample to train this Microsoft’s new AI bot. Moreover, the researchers claim that once the AI tool learns a specific voice, VALL-E can synthesize audio of that person saying anything, and do it in a way that attempts to preserve the speaker’s emotional tone, as well as the environment where the speaker is in. The developers of Microsoft’s VALL-E may be utilized for high-quality text-to-speech applications, and speech editing, which would allow a person’s voice recording to be altered and changed from a text transcript, and in conjunction with other generative AI models like GPT-3 to create content. A technique dubbed Encode, which Meta revealed in October 2022, is the foundation for Microsoft’s VALL-E. VALL-E produces discrete audio codec codes from text and acoustic cues, in contrast to conventional text-to-speech systems that typically synthesize speech by modifying waveforms. VALL-E decodes a person’s voice into tokens after conducting a voice analysis. Then it matches what it “knows” about how that voice would sound if it spoke additional words with the training data.
Microsoft has trained the synthesis abilities of its new VALL-voice E using the audio library LibriLight, which was assembled by Meta, the parent company of Facebook. More than 7,000 different people are represented among the 60,000 hours of English-language speech that were primarily extracted from LibriVox public domain audiobooks. For Microsoft’s new AI bot to produce an acceptable result, the voice in the three-second sample must closely resemble a voice in the training data.
In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. For example, the audio output will simulate the acoustic and frequency characteristics of a phone call in its synthetic output, which is another way of saying that it would sound like a phone call. Furthermore, Microsoft’s samples (included in the “Synthesis of Diversity” section) demonstrate how VALL-E may generate various voice tones by changing the random seed used during creation. Microsoft AI Research is creating artificial intelligence machines that complement human reasoning to augment and enrich our experience and competencies.
Table of Contents
WHAT IS MICROSOFT’S VALL-E?
Vall-E is essentially a text-to-speed (TTS) system that lets you input a script of text that it then turns into audio. In the past, such software has always generated audio that either sounds incredibly robotic or costs an arm and a leg for “human voices”. Vall-E, a neural codec language model, has been trained using 60,000 hours of English speed and produces results that are as close to a human talking as possible. Microsoft has claimed that its AI tool can “significantly outperform” other TTS tools in the market. What actually makes it stand out isn’t its ability to sound like you. It’s the ability to capture emotion in speech, which is what makes it sound like someone is actually talking.
USING MICROSOFT’S VALL-E
At this point, Microsoft has not created a free-to-use version the way OpenAI did with ChatGPT. They have, however, posted a bunch of samples on their website, showing the range of results you can get with their tool. Of course, while the tool can be used to help the mute speak, it can also be used to create really good deep fakes and audios of known personalities. Between this, Chat GPT, and Dall-E, we’ll soon be living in a world where we won’t be able to distinguish between content created by humans and machines.
Don’t fall off the VALL-E
The voice-matching AI was trained on 60 000 hours of speech data in English, using 3-second voice clips as prompts. Particular voices were used to teach it how to generate content. Examples of Microsoft VALL-E’s work were shared by GitHub. Some sound authentic. Others still have a robotic tone to them. With a bigger sample size of voices, the feature seems set to spark a new dimension in vocal imitation.
The quick development and evolution of AI continue to raise ethical issues. What do you do when someone is able to capture a mere three seconds of your voice and uses it to say something you’d never say? It’s possible that you’ll be cancelled for actions that you never took.