DeepMind’s new AI generates soundtracks and dialogue for videos

DeepMind, Google’s AI research lab, says it is developing AI technology to generate soundtracks for videos.

In a post on its official blog, DeepMind says it sees the technology, V2A (short for “video-to-audio”), as a crucial piece of the AI-generated media puzzle. While many bodies, including DeepMind, have developed AI models that generate videos, these models cannot create sound effects to synchronize with the videos they generate.

“Video generation models are advancing at an incredible pace, but many current systems can only generate silent results,” writes DeepMind. “V2A technology [could] becomes a promising approach to bringing established films to life.”

DeepMind’s V2A technology takes the description of a soundtrack (eg “sea urchin pulsating underwater, sea life, ocean”) paired with a video to create music, sound effects and even dialogue that matches the characters and the tone of the video, watermarked by DeepMind’s deep forgeries – fighting SynthID technology. The AI model that powers V2A, a diffusion model, was trained on a combination of sounds and dialogue transcripts, as well as video clips, DeepMind says.

“By being trained on video, audio and additional notes, our technology learns to associate specific audio events with different visual scenes while responding to information provided in the notes or transcripts,” according to DeepMind.

Mum says whether any of the training data was copyrighted – and whether the creators of the data were aware of DeepMind’s work. We’ve reached out to DeepMind for clarification and will update this post if we hear back.

AI-powered sound generation tools are not new. Startup Stability AI released one just last week, and ElevenLabs launched one in May. Neither are models to create video sound effects. A Microsoft project can generate talking and singing videos from a still image, and platforms like Pika and GenreX have trained models to take a video and make a better guess as to what music or effects are appropriate in it. a certain scene.

But DeepMind claims its V2A technology is unique in that it can understand the raw pixels from a video and synchronize the generated sounds with the video automatically, optionally without description.

V2A isn’t perfect, and DeepMind admits that. Because the underlying model was not trained on many videos with artifacts or distortions, it does not create particularly high-quality audio for these. And in general, the generated audio is not great persuasive; my colleague Natasha Lomas described it as “a smorgasbord of stereotypical sounds,” and I can’t say I disagree.

For these reasons, and to prevent misuse, DeepMind says it won’t be releasing the technology to the public anytime soon, if ever.

“To ensure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development ,” writes DeepMind. “Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous security evaluations and testing.”

DeepMind pitches its V2A technology as a particularly useful tool for archivists and people working with historical footage. But generative AI along these lines also threatens to upend the film and television industry. It’s going to take some seriously strong job protections to ensure that media-generating tools don’t eliminate jobs—or, as the case may be, entire professions.

#DeepMinds #generates #soundtracks #dialogue #videos #TechCrunch
Image Source : techcrunch.com

DeepMind’s new AI generates soundtracks and dialogue for videos | TechCrunch

Leave a Comment Cancel Reply