Voice Models Catalog
Xantly exposes 33 voice models across 6 providers. Use the sttmodel / ttsmodel parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget
title: Voice Models Catalog description: All voice models available via the Xantly Voice API, grouped by type and provider.
Xantly exposes 33 voice models across 6 providers. Use the stt_model / tts_model parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget.
All models route through the same API key as chat — there is no separate voice subscription. Voice usage bills per-minute (provider pass-through) plus a per-minute platform fee. See voice-agents guide for the billing breakdown.
Speech-to-Text (STT) — 10 models
Use with POST /v1/voice/transcribe or as stt_model on POST /v1/voice/chat.
| Slug | Provider | Pricing Unit | Notes |
|---|---|---|---|
openai/whisper-1 | OpenAI | per minute | Classic Whisper, 50+ languages |
openai/gpt-4o-transcribe | OpenAI | per minute | Improved accuracy over Whisper |
openai/gpt-4o-mini-transcribe | OpenAI | per minute | Cost-optimized GPT-4o transcribe |
deepgram/nova-2 | Deepgram | per minute | Default for English, low latency |
deepgram/nova-3 | Deepgram | per minute | Latest Deepgram model, top accuracy |
deepgram/flux | Deepgram | per minute | Streaming-optimized |
groq/whisper-large-v3 | Groq | per hour | Whisper on Groq LPU (217x real-time) |
groq/whisper-large-v3-turbo | Groq | per hour | Faster Whisper variant (cheaper) |
elevenlabs/scribe_v2 | ElevenLabs | per hour | Scribe, long-form transcription |
Text-to-Speech (TTS) — 13 models
Use with POST /v1/voice/synthesize or as tts_model on POST /v1/voice/chat.
| Slug | Provider | Pricing Unit | Notes |
|---|---|---|---|
openai/tts-1 | OpenAI | per 1M characters | Standard OpenAI TTS |
openai/tts-1-hd | OpenAI | per 1M characters | Higher fidelity OpenAI TTS |
openai/gpt-4o-mini-tts | OpenAI | per mtok (mixed) | Default, cheap, good quality |
deepgram/aura-1 | Deepgram | per 1M characters | Deepgram Aura-1 voices |
deepgram/aura-2 | Deepgram | per 1M characters | Deepgram Aura-2 voices (latest) |
elevenlabs/eleven_flash_v2 | ElevenLabs | per 1M characters | Fast, natural, cost-effective |
elevenlabs/eleven_flash_v2_5 | ElevenLabs | per 1M characters | Improved Flash V2.5 |
elevenlabs/eleven_multilingual_v2 | ElevenLabs | per 1M characters | Multilingual, high quality |
elevenlabs/eleven_v3 | ElevenLabs | per 1M characters | Latest ElevenLabs flagship |
google/gemini-2.5-flash-preview-tts | per mtok (mixed) | Gemini 2.5 Flash TTS | |
google/gemini-2.5-pro-preview-tts | per mtok (mixed) | Gemini 2.5 Pro TTS |
Realtime (bidirectional audio) — 3 models
Use with GET /v1/voice/realtime (WebSocket upgrade).
| Slug | Provider | Notes |
|---|---|---|
openai/gpt-4o-realtime-preview | OpenAI | Full-duplex voice, tools, interruptions |
openai/gpt-4o-mini-realtime-preview | OpenAI | Cheaper realtime variant |
google/gemini-2.0-flash-live | Gemini Live API bidirectional |
Audio LLMs — 7 models
Text + audio input, text + audio output. Use via POST /v1/chat/completions with audio content parts, or via the voice endpoints.
| Slug | Provider (Route) | Notes |
|---|---|---|
openai/gpt-4o-audio-preview | OpenAI direct | Audio-aware GPT-4o |
openai/gpt-4o-mini-audio-preview | OpenAI direct | Cheaper audio-aware variant |
openrouter/openai/gpt-audio | OpenRouter → OpenAI | Routed via OpenRouter key pool |
openrouter/openai/gpt-audio-mini | OpenRouter → OpenAI | Routed via OpenRouter key pool |
openrouter/openai/gpt-4o-audio-preview | OpenRouter → OpenAI | Routed via OpenRouter key pool |
openrouter/mistralai/voxtral-small-24b | OpenRouter → Mistral | Voxtral for audio understanding |
openrouter/xiaomi/mimo-v2-omni | OpenRouter → Xiaomi | Omni audio-visual-text model |
Music Generation — 2 models
Text → audio (music). Use via POST /v1/chat/completions with the model slug.
| Slug | Provider | Notes |
|---|---|---|
openrouter/google/lyria-3-pro | Google (via OpenRouter) | Lyria 3 Pro music generation |
openrouter/google/lyria-3-clip | Google (via OpenRouter) | Lyria 3 Clip short music snippets |
Speech-to-Speech — 1 model
Audio → audio transformation (voice conversion).
| Slug | Provider | Notes |
|---|---|---|
elevenlabs/eleven_multilingual_sts_v2 | ElevenLabs | Voice conversion, multilingual |
Auto-routing defaults
If you omit stt_model / tts_model, Xantly auto-routes:
- STT: English →
deepgram/nova-2, other languages →openai/whisper-1 - TTS:
openai/gpt-4o-mini-ttsby default, or ElevenLabs/Deepgram/Groq whenprovideris set on the request
Key rotation and fallback
All models that go through OpenRouter, NVIDIA, or Groq use round-robin key rotation across your configured API key pool (e.g. GROQ_API_KEY_1..6). On a per-key 402 / 429, requests automatically waterfall to the next key in the pool. See the guides index for reliability details.
Usage example
# Transcribe with Groq Whisper (cheapest STT)
curl -X POST https://api.xantly.com/v1/voice/transcribe \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "audio=@input.wav" \
-F "stt_model=groq/whisper-large-v3-turbo"
# Synthesize with ElevenLabs Flash (low-latency TTS)
curl -X POST https://api.xantly.com/v1/voice/synthesize \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello from Xantly",
"model": "elevenlabs/eleven_flash_v2_5",
"voice": "21m00Tcm4TlvDq8ikWAM",
"output_format": "pcm_16000"
}' --output hello.pcm
# Full pipeline with explicit STT + TTS picks
curl -X POST https://api.xantly.com/v1/voice/chat \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "audio=@question.wav" \
-F "stt_model=groq/whisper-large-v3-turbo" \
-F "tts_model=deepgram/aura-2"Images
Generate images from text prompts using the image generation API. Requests are proxied to OpenAI's DALL-E API with automatic BYOK key resolution.
Voice Billing
Voice requests use the same API key, the same budget pool, and the same monthly invoice as chat. What differs is the pricing unit (minutes and characters, not tokens) and the addition of a per-minute