Investing.com -- Alibaba Group Holdings Ltd ADR (NYSE:BABA) has introduced Qwen2.5-Omni, its new flagship model in the Qwen series. The end-to-end multimodal model is designed for extensive multimodal perception and can process a variety of inputs such as text, images, audio, and video. It provides real-time streaming responses through text generation and natural speech synthesis.
Key features of the model include its Thinker-Talker architecture, designed to perceive a range of modalities, including text, images, audio, and video. This architecture allows the model to generate text and natural speech responses simultaneously. It also includes a novel position embedding, dubbed TMRoPE (Time-aligned Multimodal RoPE), which synchronizes the timestamps of video inputs with audio.
The model is designed for fully real-time interactions, supporting chunked input and immediate output. It surpasses many existing streaming and non-streaming alternatives in terms of robustness and naturalness in speech generation. Qwen2.5-Omni showcases exceptional performance across all modalities and outperforms the similarly sized Qwen2-Audio in audio capabilities. It also matches the performance of Qwen2.5-VL-7B.
Qwen2.5-Omni employs the Thinker-Talker architecture, where the Thinker functions like a brain, processing and understanding inputs from text, audio, and video modalities. It generates high-level representations and corresponding text. The Talker operates like a human mouth, taking in the high-level representations and text produced by the Thinker and outputting discrete tokens of speech fluidly.
A comprehensive evaluation of Qwen2.5-Omni has been conducted, showing strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance.
In the near future, Alibaba plans to enhance the model's ability to follow voice commands and improve audio-visual collaborative understanding. The company also aims to integrate more modalities towards an omni-model.
The Qwen2.5-Omni model is now publicly available on platforms like Hugging Face, ModelScope, DashScope, and GitHub. Users can experience the model's interactive features through a demo or join discussions on Discord.
This content was originally published on Investing.com