Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers -

Microsoft has released VibeVoice-1.5B, an open-source text-to-speech (TTS) model capable of synthesizing up to 90 minutes of continuous speech involving four distinct speakers. This cutting-edge model leverages a novel architecture combining a Large Language Model backbone with acoustic and semantic tokenizers to enable extended multi-speaker conversations with natural turn-taking and consistent vocal identities.

VibeVoice-1.5B is available under the MIT license, making it accessible to researchers and developers. It requires about 7 GB of GPU memory, allowing users with consumer-grade GPUs like the RTX 3060 to run multi-speaker synthesis. Supported languages are English and Chinese, and the model can also perform cross-lingual synthesis and singing voice generation.

Microsoft plans to expand this line with a larger, 7-billion-parameter streaming-optimized model in the future, while also embedding safety measures like audio watermarks and restrictions against misuse such as voice impersonation or disinformation. This release marks a significant democratization of advanced TTS technology for extended, natural, multi-speaker audio generation.

Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers

Comments

Leave a Reply Cancel reply

More posts

PayPal and OpenAI Team Up for Revolutionary AI-Powered Checkout in ChatGPT

Palo Alto Networks Launches Cortex AgentiX: AI Agents Revolutionize Cybersecurity

Nvidia Invests $1 Billion in Nokia to Pioneer AI-Native 6G Networks

Nvidia’s AI Dominance: Strategic Partnerships and Bold Projections Reshape Tech Landscape