Microsoft’s New AI Text-to-Speech Model Is Too Realistic To Be Released

July 18, 2024
Text-to-Speech
280
Views

The New York Post has reported that Microsoft researchers have created an AI text-to-speech technology that is realistic to the point of being mistaken as human.

Microsoft claimed that VALL-E 2 is the first AI voice software of its kind that has come close to “achieving human parity.” VALL-E only needs a few seconds of audio to replicate a voice accurately. Up till now, subtleties in language could be used to identify other audio generation models as AI. However, the audio generated by VALL-E  is reportedly indistinguishable from an individual’s speech. Microsoft Research developers claim that VALL-E 2 can generate “accurate, natural speech in the exact voice of the original speaker, comparable to human performance.”

VALL-E is capable of synthesizing both short phrases and lengthy sentences. The tool uses two characteristics, Grouped Code Modeling and Repetition Aware Sampling, to achieve this. Repetition Aware Sampling It helps change up the speech of the system and gives it a more natural feel by preventing repetitive sounds or phrases from occurring during the decoding process. Grouped Code Modeling is utilized to get faster results.

In order to assess how well VALL-E handled more complex tasks, the researchers used ELLA-V, a framework for evaluation for zero-shot text-to-speech synthesis, in addition to comparing VALL-E 2 against audio samples from two English-language databases, LibriSpeech and VCTK. In the end, the system outperformed its competitors “in speech robustness, naturalness, and speaker similarity.”

Microsoft maintains that VALL-E 2 will not be made available to the public anytime soon, labeling it “purely a research project.” The tech giant has also stated that any suspected abuse of the program can be reported through an online portal. 

The company made a post on its website stating, “Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.” Microsoft went on to state that “It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker.”  

Article Categories:
Tech News

Leave a Reply

Your email address will not be published. Required fields are marked *

The maximum upload file size: 256 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here