Technology

Microsoft AI can clone a voice from just three seconds of audio.

Author

[vc_row][vc_column][vc_column_text dp_text_size=”size-4″]Microsoft’s new text-to-speech AI can replicate voices, including tone and pitch, from a brief three-second audio clip. Despite being a complicated system, VALL-“neural E’s codec language model” is extremely simple to use and only requires the insertion of audio and text.

The developers of the programme are certain that it can be applied to high-quality text-to-speech tasks including speech modifying and audio content production. Microsoft’s application is based on EnCodec, which Meta unveiled in October of the previous year.

VALL-E analyses how someone sounds and separates that information into discrete components, producing discrete audio codec codes from text and acoustic stimuli. EnCodec compares what it knows about how that voice would sound if it delivered a different phrase to training data.

The speech-synthesis abilities of VALL-E were taught using audio from a library that Meta put together, which contained 60,000 hours of English speakers from more than 7,000 speakers. The submitted training data must closely match the three-second voice clip sample for a successful outcome.

By altering the random seed used in the generating process, the computer can produce differences in voice tone, as shown by the sample provided by Microsoft. VALL-E can simulate the acoustic environment of the audio that was present in the sample audio, such as simulating a voice over the phone.

Although speech-generating software is frequently used by news sites, it requires a lot of input. What’s more, the voice lacks a human-like quality and is unable to communicate expressions or inflections. Because it requires less input and produces better and more accurate results, VALL-E is highly sophisticated. The programme, however, offers significant hazards when the model is misused, such as voice identification spoofing or speaker impersonation.

[/vc_column_text][/vc_column][/vc_row]