Microsoft can synthesize your voice with just a 3 second clip

Microsoft can synthesize your voice with just a 3 second clip

Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person’s voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker’s emotional tone and background environmental noise balance.

The scientists also note that since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.

You can find audio samples and the paper here.

It sure would make breaking into Werner Brandes office a lot easier (1992 movie Sneakers) than convincing your friend to record snippets of a really terrible date.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.