Alibaba Unveils Qwen3-Omni Multimodal AI, Outperforms Rivals

tech360.tv
3 minutes ago
2 min read

Alibaba Group Holding on Tuesday unveiled a new suite of artificial intelligence models, including Qwen3-Omni, a flagship multimodal system. Developers stated two variants of Qwen3-Omni outperformed OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash in benchmark tests.

The new model rivals OpenAI’s GPT-4o, launched in May 2024, and Google’s popular "Nano Banana" image editor, intensifying competition domestically and internationally. Qwen3-Omni processes text, audio, image, and video inputs, responding with text and audio.

The development team stated on social media that Qwen3-Omni was the first native end-to-end multimodal system to unify text, images, audio, and video in one model. It competes directly with offerings such as OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash, also known as "Nano Banana."

Benchmark tests assessed audio recognition and comprehension, along with image and video understanding. These tests showed Qwen3-Omni variants surpassed their predecessor, Qwen2.5-Omni-7B, as well as GPT-4o and Gemini 2.5-Flash.

Lin Junyang, a researcher on the Qwen team under Alibaba’s cloud unit, attributed the improvements to various foundational audio and image projects. Lin stated, "This year, our audio team has spent great efforts on building large-scale audio data sets for both pretraining and post-training," adding that they "combined everything … to build our Qwen3-Omni."

Diagram of Qwen3-Omni framework with components including MTP Module, MoE Talker, and Vision Encoder. Shows text, icons, and color coding. — Credit: QWEN.AI

Qwen3-Omni supports inputs in 119 text languages and understands 19 spoken languages, including English, Chinese, Japanese, Spanish, Arabic, and Urdu. It can generate spoken responses in 10 languages, among them English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, and Korean.

The model’s multilingual and multimodal capabilities extend its utility beyond text-based conversations. According to an Alibaba video demonstration, hardware equipped with Qwen3-Omni, cameras, microphones, and speakers could perceive images and sounds, generating vocal responses.

Three variants of the Qwen3-Omni series are now available on open-source hosting platforms, including Hugging Face and GitHub. Alibaba also launched an updated open-source image tool, Qwen-Image-Edit-2509, on Tuesday.

Additionally, a proprietary speech model, Qwen3-TTS-Flash, was released, available exclusively through the company’s cloud computing platform. The team noted the new image tool improved image consistency during editing, while the speech model could produce "expressive voices" with humanlike timbres and adapt its tone to match input text.

These model releases precede Alibaba Cloud’s annual Apsara Conference, taking place from Wednesday to Friday in Hangzhou, eastern Zhejiang province.

Alibaba unveiled Qwen3-Omni, a new multimodal artificial intelligence model.
Two Qwen3-Omni variants reportedly outperformed OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash in benchmark tests.
Qwen3-Omni processes text, audio, image, and video inputs, and supports 119 text and 19 spoken languages.

Source: SCMP