OpenAI just dropped three new audio models for the API that should make voice-powered apps a lot more interesting. And this time it’s not about ChatGPT features for end users — it’s developer tooling.
The three new models
GPT-Realtime-2 succeeds GPT-Realtime-1.5 and brings GPT-5-class reasoning into real-time conversations. It handles more complex requests and carries conversations forward more naturally. Billing is token-based.
GPT-Realtime-Translate translates speech live from 70+ input languages into 13 output languages — in real time, while the speaker is still talking. This isn’t delayed machine translation. It’s simultaneous.
GPT-Realtime-Whisper transcribes speech as a live stream. No waiting for the recording to end — text appears as you speak.
Why this matters
Translate and Whisper are billed per minute, not per token. That makes cost planning significantly easier for developers — and potentially cheaper for long conversations.
The use cases span education, media, events, and creator platforms. If you’re building an app that works with voice today, you just got three building blocks that were either unavailable or required serious engineering effort before.
Safety guardrails
OpenAI built in protections to prevent abuse like spam, fraud, and other forms of online misuse. Conversations can be automatically halted if they violate content guidelines.
The bigger picture
Voice as an interface is getting attacked from all sides right now. Google recently launched Gemini 3.1 Flash TTS with its own speech model covering 70 languages. OpenAI is now countering with a full real-time stack for developers.
For us as users, this means the apps we’ll see in the coming months will handle voice like never before. Simultaneous translation, live transcription, natural conversations — all through a single API.
Sources: OpenAI: Advancing voice intelligence with new models in the API, TechCrunch: OpenAI launches new voice intelligence features in its API