Voice Models Catalog

Xantly exposes 33 voice models across 6 providers. Use the sttmodel / ttsmodel parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget

title: Voice Models Catalog description: All voice models available via the Xantly Voice API, grouped by type and provider.

Xantly exposes 33 voice models across 6 providers. Use the stt_model / tts_model parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget.

All models route through the same API key as chat — there is no separate voice subscription. Voice usage bills per-minute (provider pass-through) plus a per-minute platform fee. See voice-agents guide for the billing breakdown.

Speech-to-Text (STT) — 10 models

Use with POST /v1/voice/transcribe or as stt_model on POST /v1/voice/chat.

Slug	Provider	Pricing Unit	Notes
`openai/whisper-1`	OpenAI	per minute	Classic Whisper, 50+ languages
`openai/gpt-4o-transcribe`	OpenAI	per minute	Improved accuracy over Whisper
`openai/gpt-4o-mini-transcribe`	OpenAI	per minute	Cost-optimized GPT-4o transcribe
`deepgram/nova-2`	Deepgram	per minute	Default for English, low latency
`deepgram/nova-3`	Deepgram	per minute	Latest Deepgram model, top accuracy
`deepgram/flux`	Deepgram	per minute	Streaming-optimized
`groq/whisper-large-v3`	Groq	per hour	Whisper on Groq LPU (217x real-time)
`groq/whisper-large-v3-turbo`	Groq	per hour	Faster Whisper variant (cheaper)
`elevenlabs/scribe_v2`	ElevenLabs	per hour	Scribe, long-form transcription

Text-to-Speech (TTS) — 13 models

Use with POST /v1/voice/synthesize or as tts_model on POST /v1/voice/chat.

Slug	Provider	Pricing Unit	Notes
`openai/tts-1`	OpenAI	per 1M characters	Standard OpenAI TTS
`openai/tts-1-hd`	OpenAI	per 1M characters	Higher fidelity OpenAI TTS
`openai/gpt-4o-mini-tts`	OpenAI	per mtok (mixed)	Default, cheap, good quality
`deepgram/aura-1`	Deepgram	per 1M characters	Deepgram Aura-1 voices
`deepgram/aura-2`	Deepgram	per 1M characters	Deepgram Aura-2 voices (latest)
`elevenlabs/eleven_flash_v2`	ElevenLabs	per 1M characters	Fast, natural, cost-effective
`elevenlabs/eleven_flash_v2_5`	ElevenLabs	per 1M characters	Improved Flash V2.5
`elevenlabs/eleven_multilingual_v2`	ElevenLabs	per 1M characters	Multilingual, high quality
`elevenlabs/eleven_v3`	ElevenLabs	per 1M characters	Latest ElevenLabs flagship
`google/gemini-2.5-flash-preview-tts`	Google	per mtok (mixed)	Gemini 2.5 Flash TTS
`google/gemini-2.5-pro-preview-tts`	Google	per mtok (mixed)	Gemini 2.5 Pro TTS

Realtime (bidirectional audio) — 3 models

Use with GET /v1/voice/realtime (WebSocket upgrade).

Slug	Provider	Notes
`openai/gpt-4o-realtime-preview`	OpenAI	Full-duplex voice, tools, interruptions
`openai/gpt-4o-mini-realtime-preview`	OpenAI	Cheaper realtime variant
`google/gemini-2.0-flash-live`	Google	Gemini Live API bidirectional

Audio LLMs — 7 models

Text + audio input, text + audio output. Use via POST /v1/chat/completions with audio content parts, or via the voice endpoints.

Slug	Provider (Route)	Notes
`openai/gpt-4o-audio-preview`	OpenAI direct	Audio-aware GPT-4o
`openai/gpt-4o-mini-audio-preview`	OpenAI direct	Cheaper audio-aware variant
`openrouter/openai/gpt-audio`	OpenRouter → OpenAI	Routed via OpenRouter key pool
`openrouter/openai/gpt-audio-mini`	OpenRouter → OpenAI	Routed via OpenRouter key pool
`openrouter/openai/gpt-4o-audio-preview`	OpenRouter → OpenAI	Routed via OpenRouter key pool
`openrouter/mistralai/voxtral-small-24b`	OpenRouter → Mistral	Voxtral for audio understanding
`openrouter/xiaomi/mimo-v2-omni`	OpenRouter → Xiaomi	Omni audio-visual-text model

Music Generation — 2 models

Text → audio (music). Use via POST /v1/chat/completions with the model slug.

Slug	Provider	Notes
`openrouter/google/lyria-3-pro`	Google (via OpenRouter)	Lyria 3 Pro music generation
`openrouter/google/lyria-3-clip`	Google (via OpenRouter)	Lyria 3 Clip short music snippets

Speech-to-Speech — 1 model

Audio → audio transformation (voice conversion).

Slug	Provider	Notes
`elevenlabs/eleven_multilingual_sts_v2`	ElevenLabs	Voice conversion, multilingual

Auto-routing defaults

If you omit stt_model / tts_model, Xantly auto-routes:

STT: English → deepgram/nova-2, other languages → openai/whisper-1
TTS: openai/gpt-4o-mini-tts by default, or ElevenLabs/Deepgram/Groq when provider is set on the request

Key rotation and fallback

All models that go through OpenRouter, NVIDIA, or Groq use round-robin key rotation across your configured API key pool (e.g. GROQ_API_KEY_1..6). On a per-key 402 / 429, requests automatically waterfall to the next key in the pool. See the guides index for reliability details.

Usage example

# Transcribe with Groq Whisper (cheapest STT)
curl -X POST https://api.xantly.com/v1/voice/transcribe \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "audio=@input.wav" \
  -F "stt_model=groq/whisper-large-v3-turbo"

# Synthesize with ElevenLabs Flash (low-latency TTS)
curl -X POST https://api.xantly.com/v1/voice/synthesize \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from Xantly",
    "model": "elevenlabs/eleven_flash_v2_5",
    "voice": "21m00Tcm4TlvDq8ikWAM",
    "output_format": "pcm_16000"
  }' --output hello.pcm

# Full pipeline with explicit STT + TTS picks
curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "audio=@question.wav" \
  -F "stt_model=groq/whisper-large-v3-turbo" \
  -F "tts_model=deepgram/aura-2"

On this page