Microsoft's AI speech generator VALL-E 2 'reaches human parity' — but it's too dangerous to release, scientists say

Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.

VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.

Microsoft researchers said VALL-E 2 was capable of generating “accurate, natural speech in the exact voice of the original speaker, comparable to human performance,” in a paper that appeared June 17 on the pre-print server arXiv. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.

“VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time,” the researchers wrote in the paper. “Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.”

Related: New AI algorithm flags deepfakes with 98% accuracy — better than any other tool out there right now

Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.

The AI engine is capable of this given the inclusion of two key features: “Repetition Aware Sampling” and “Grouped Code Modeling.”

Repetition Aware Sampling improves the way the AI converts text into speech by addressing repetitions of “tokens” — small units of language, like words or parts of words — preventing infinite loops of sounds or phrases during the decoding process. In other words, this feature helps vary VALL-E 2’s pattern of speech, making it sound more fluid and natural.

Grouped Code Modeling, meanwhile, improves efficiency by reducing the sequence length — or the number of individual tokens that the model processes in a single input sequence. This speeds up how quickly VALL-E 2 generates speech and helps manage difficulties that come with processing long strings of sounds.

The researchers used audio samples from speech libraries LibriSpeech and VCTK to assess how well VALL-E 2 matched recordings of human speakers. They also used ELLA-V — an evaluation framework designed to measure the accuracy and quality of generated speech — to determine how effectively VALL-E 2 handled more complex speech generation tasks.

“Our experiments, conducted on the LibriSpeech and VCTK datasets, have shown that VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity,” the researchers wrote. “It is the first of its kind to reach human parity on these benchmarks.”

The researchers pointed out in the paper that the quality of VALL-E 2’s output depended on the length and quality of speech prompts — as well as environmental factors like background noise.

“Purely a research project”

Despite its capabilities, Microsoft will not release VALL-E 2 to the public due to potential misuse risks. This coincides with increasing concerns around voice cloning and deepfake technology. Other AI companies like OpenAI have placed similar restrictions on their voice tech.

“VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the researchers wrote in a blog post. “It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker.”

That said, they did suggest AI speech tech could see practical applications in the future. “VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on,” the researchers added.

They continued: “If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.”

Refined carbs and red meat driving global rise in type 2 diabetes, study says

Texas Governor removes over 1 million from voter roll

University of Michigan’s Student Government Votes To Oust Pro-Palestinian President for Inciting Violence

Biden Backs Down on Israel Arms Ultimatum

Democrat Ruben Gallego wins Arizona U.S. Senate race, defeating Republican Kari Lake

Stephen Miller expected to be named Trump’s deputy chief of staff of policy, oversee deportations

Donald Trump pressures Senate GOP candidates to fill his Cabinet

Boy, 17, shot to death on violence-plagued Bronx block

Orioles Star Free Agent’s Market Heating Up As Free Agency Begins

Refined carbs and red meat driving global rise in type 2 diabetes, study says

Texas Governor removes over 1 million from voter roll

University of Michigan’s Student Government Votes To Oust Pro-Palestinian President for Inciting Violence

Biden Backs Down on Israel Arms Ultimatum

Democrat Ruben Gallego wins Arizona U.S. Senate race, defeating Republican Kari Lake

Stephen Miller expected to be named Trump’s deputy chief of staff of policy, oversee deportations

Donald Trump pressures Senate GOP candidates to fill his Cabinet

Boy, 17, shot to death on violence-plagued Bronx block

Orioles Star Free Agent’s Market Heating Up As Free Agency Begins

Microsoft’s AI speech generator VALL-E 2 ‘reaches human parity’ — but it’s too dangerous to release, scientists say

Government is committed to undertake further Reforms in enhancing Domestic Defence Production: Defence Secretary

King allows public to see other side of famous palace balcony for first time

Related News

Observations of coronal holes with the Siberian Radio Heliograph by Altyntsev et al. – Community of European Solar Radio Astronomers

How to Make a Mammal in Nine Evolutionary Steps

Giving robots superhuman vision using radio signals

150,000-year-old rock-shelter in Tajikistan found on ‘key route for human expansion’ used by Homo sapiens, Neanderthals and Denisovans

King allows public to see other side of famous palace balcony for first time

Discussion about this post

Subscribe To Our Newsletters

Customer Support

Subscribe To Our Newsletters

Categories

Recent News

Manahau grounding could’ve been avoided, investigation finds

‘I like what I’m seeing’… Stephen A. Smith says LA Lakers star he expected a lot more from could be ‘formidable’

5 Stunning Spots To See The Autumn Colors Of Brasov From

GREMLIN, but no aliens: Pentagon UAP office plans first deployment of new sensor suite

Welcome Back!

Retrieve your password

Add New Playlist