Google just released Gemini 3.1 Flash TTS, a new text-to-speech model with a pretty clever approach: you control the voice via prompt. No complicated settings menus, no SSML tags — you just write how you want it to sound.
Audio Tags Instead of Sliders
The core feature is what Google calls audio tags. You embed natural language instructions directly into the text and tell the model whether to speak ‘enthusiastically’, ‘informatively’, or with ‘positive surprise’. This works across more than 70 languages.
What I find particularly impressive is the accent control. For English alone, there’s a whole range: American Valley, Southern, British RP, Brixton — and many more. Other languages have regional variants too. For developers building voice interfaces, this is a huge leap over the monotone TTS models we’ve been stuck with.
Benchmark Results
On the Artificial Analysis TTS Leaderboard, the model scored an Elo of 1,211 and was placed in the ‘most attractive quadrant’ — meaning high quality at low cost. That matters for anyone deploying TTS in production environments.
Availability
The model is available now via the Gemini API, Google AI Studio, and Vertex AI. All generated audio is watermarked with SynthID — Google’s approach to preventing deepfake misuse.
What This Means
Speech synthesis has long been a space where specialized providers like ElevenLabs set the standard. With Flash TTS, Google brings a model that’s significantly more flexible than traditional TTS APIs thanks to prompt-based control. For developers automating voice agents or podcasts, this is an exciting new tool to explore.
Sources: