2 min read AI-generated

Mistral Voxtral TTS: An open-source speech model that fits on a smartwatch

Copy article as Markdown

Mistral releases a text-to-speech model that speaks nine languages, clones voices from five-second samples, and runs on edge devices. Open source, naturally.

Featured image for "Mistral Voxtral TTS: An open-source speech model that fits on a smartwatch"

Mistral has released a new open-source model — and this time it’s not about text, it’s about voice. Voxtral TTS is a text-to-speech model compact enough to run on a smartwatch. And yes, it’s open source.

What Voxtral TTS can do

The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. It’s built on Mistral’s Ministral 3B, making it small enough for smartphones, laptops, and other edge devices.

The most impressive specs:

  • 90 milliseconds time-to-first-audio (for 500 characters)
  • Voice cloning from less than 5 seconds of audio
  • Captures accents, intonation, and speech flow of the original voice
  • Seamless language switching without losing voice characteristics — useful for dubbing and real-time translation

What it’s for

Mistral is positioning Voxtral TTS squarely at the enterprise market: voice agents for sales and customer service that sound like real humans. This puts Mistral in direct competition with ElevenLabs, Deepgram, and OpenAI.

The advantage: open source and customizable. Companies can adapt the model however they want, run it on their own servers, and keep costs under control.

The bigger picture

Voxtral TTS is part of a larger strategy. Earlier this year, Mistral released transcription models — one for batch processing, one for real-time use. With the new TTS model, the circle is complete: input (transcription) and output (speech synthesis) now come from a single provider.

Pierre Stock, VP of Science Operations at Mistral, laid out the vision clearly: an end-to-end platform for multimodal streams — audio, text, and image as both input and output. That sounds like a complete agent stack.

My take

Mistral’s strength has always been packing big capabilities into small packages — while staying open source. Voxtral TTS fits that pattern perfectly. A speech model that runs on a smartwatch and can clone voices from five seconds of audio is genuinely impressive.

For the European market, this is particularly relevant: nine languages including German right from launch, and the option to run everything on-premises. That’s exactly what GDPR-conscious companies want to hear.


Sources: