Qwen 3.5 Omni: Alibaba's Open-Source Model Handles Text, Image, Audio and Video at Once

While everyone’s talking about Anthropic’s leak-filled week, Alibaba quietly released a model that beats Google’s Gemini 3.1 Pro on several benchmarks. And the best part? It’s fully open source.

What can Qwen 3.5 Omni do?

Qwen 3.5 Omni is a natively multimodal large language model — meaning text, images, audio, and video aren’t bolted on after the fact, but trained together from the start. That makes a real difference in quality.

The numbers are impressive: the model can process over 10 hours of audio input and analyze more than 400 seconds of 720p video at 1 FPS. Speech recognition works across 113 languages and dialects, speech generation covers 36 languages. And then there’s voice cloning — the model can imitate a voice from a sample.

Alibaba released three variants: Plus (the flagship), Flash (faster, more efficient), and Light (for edge devices). All with a 256,000 token context window.

Why does this matter?

Two reasons. First, Qwen 3.5 Omni outperforms Google’s Gemini 3.1 Pro on several audio understanding benchmarks. That’s remarkable for an open-source model anyone can run locally. Second, it shows how quickly the gap between proprietary and open models is closing.

For developers, this means you can now run a multimodal model on your own hardware that competes with the best proprietary solutions. No API costs, no data leaving your systems, full control.

The bigger picture

Qwen 3.5 Omni is part of an entire model family. The 397B parameter variant with Mixture-of-Experts architecture is the flagship, but the smaller models are impressive too. Alibaba has been consistently leveling up over the past months and now delivers one of the most complete open-source ecosystems in AI.

What impresses me most is the combination of breadth (text + image + audio + video) and depth (113 languages, 10h audio input, voice cloning). This isn’t just a checkbox feature — it’s a serious contender for anyone building multimodal AI applications.

Sources: