XantlyANTLY
API Reference

Voice Models Catalog

Xantly exposes 33 voice models across 6 providers. Use the sttmodel / ttsmodel parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget


title: Voice Models Catalog description: All voice models available via the Xantly Voice API, grouped by type and provider.

Xantly exposes 33 voice models across 6 providers. Use the stt_model / tts_model parameters on the voice endpoints to pick a specific model, or let Xantly auto-route based on language and latency budget.

All models route through the same API key as chat — there is no separate voice subscription. Voice usage bills per-minute (provider pass-through) plus a per-minute platform fee. See voice-agents guide for the billing breakdown.


Speech-to-Text (STT) — 10 models

Use with POST /v1/voice/transcribe or as stt_model on POST /v1/voice/chat.

SlugProviderPricing UnitNotes
openai/whisper-1OpenAIper minuteClassic Whisper, 50+ languages
openai/gpt-4o-transcribeOpenAIper minuteImproved accuracy over Whisper
openai/gpt-4o-mini-transcribeOpenAIper minuteCost-optimized GPT-4o transcribe
deepgram/nova-2Deepgramper minuteDefault for English, low latency
deepgram/nova-3Deepgramper minuteLatest Deepgram model, top accuracy
deepgram/fluxDeepgramper minuteStreaming-optimized
groq/whisper-large-v3Groqper hourWhisper on Groq LPU (217x real-time)
groq/whisper-large-v3-turboGroqper hourFaster Whisper variant (cheaper)
elevenlabs/scribe_v2ElevenLabsper hourScribe, long-form transcription

Text-to-Speech (TTS) — 13 models

Use with POST /v1/voice/synthesize or as tts_model on POST /v1/voice/chat.

SlugProviderPricing UnitNotes
openai/tts-1OpenAIper 1M charactersStandard OpenAI TTS
openai/tts-1-hdOpenAIper 1M charactersHigher fidelity OpenAI TTS
openai/gpt-4o-mini-ttsOpenAIper mtok (mixed)Default, cheap, good quality
deepgram/aura-1Deepgramper 1M charactersDeepgram Aura-1 voices
deepgram/aura-2Deepgramper 1M charactersDeepgram Aura-2 voices (latest)
elevenlabs/eleven_flash_v2ElevenLabsper 1M charactersFast, natural, cost-effective
elevenlabs/eleven_flash_v2_5ElevenLabsper 1M charactersImproved Flash V2.5
elevenlabs/eleven_multilingual_v2ElevenLabsper 1M charactersMultilingual, high quality
elevenlabs/eleven_v3ElevenLabsper 1M charactersLatest ElevenLabs flagship
google/gemini-2.5-flash-preview-ttsGoogleper mtok (mixed)Gemini 2.5 Flash TTS
google/gemini-2.5-pro-preview-ttsGoogleper mtok (mixed)Gemini 2.5 Pro TTS

Realtime (bidirectional audio) — 3 models

Use with GET /v1/voice/realtime (WebSocket upgrade).

SlugProviderNotes
openai/gpt-4o-realtime-previewOpenAIFull-duplex voice, tools, interruptions
openai/gpt-4o-mini-realtime-previewOpenAICheaper realtime variant
google/gemini-2.0-flash-liveGoogleGemini Live API bidirectional

Audio LLMs — 7 models

Text + audio input, text + audio output. Use via POST /v1/chat/completions with audio content parts, or via the voice endpoints.

SlugProvider (Route)Notes
openai/gpt-4o-audio-previewOpenAI directAudio-aware GPT-4o
openai/gpt-4o-mini-audio-previewOpenAI directCheaper audio-aware variant
openrouter/openai/gpt-audioOpenRouter → OpenAIRouted via OpenRouter key pool
openrouter/openai/gpt-audio-miniOpenRouter → OpenAIRouted via OpenRouter key pool
openrouter/openai/gpt-4o-audio-previewOpenRouter → OpenAIRouted via OpenRouter key pool
openrouter/mistralai/voxtral-small-24bOpenRouter → MistralVoxtral for audio understanding
openrouter/xiaomi/mimo-v2-omniOpenRouter → XiaomiOmni audio-visual-text model

Music Generation — 2 models

Text → audio (music). Use via POST /v1/chat/completions with the model slug.

SlugProviderNotes
openrouter/google/lyria-3-proGoogle (via OpenRouter)Lyria 3 Pro music generation
openrouter/google/lyria-3-clipGoogle (via OpenRouter)Lyria 3 Clip short music snippets

Speech-to-Speech — 1 model

Audio → audio transformation (voice conversion).

SlugProviderNotes
elevenlabs/eleven_multilingual_sts_v2ElevenLabsVoice conversion, multilingual

Auto-routing defaults

If you omit stt_model / tts_model, Xantly auto-routes:

  • STT: English → deepgram/nova-2, other languages → openai/whisper-1
  • TTS: openai/gpt-4o-mini-tts by default, or ElevenLabs/Deepgram/Groq when provider is set on the request

Key rotation and fallback

All models that go through OpenRouter, NVIDIA, or Groq use round-robin key rotation across your configured API key pool (e.g. GROQ_API_KEY_1..6). On a per-key 402 / 429, requests automatically waterfall to the next key in the pool. See the guides index for reliability details.

Usage example

# Transcribe with Groq Whisper (cheapest STT)
curl -X POST https://api.xantly.com/v1/voice/transcribe \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "audio=@input.wav" \
  -F "stt_model=groq/whisper-large-v3-turbo"

# Synthesize with ElevenLabs Flash (low-latency TTS)
curl -X POST https://api.xantly.com/v1/voice/synthesize \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from Xantly",
    "model": "elevenlabs/eleven_flash_v2_5",
    "voice": "21m00Tcm4TlvDq8ikWAM",
    "output_format": "pcm_16000"
  }' --output hello.pcm

# Full pipeline with explicit STT + TTS picks
curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "audio=@question.wav" \
  -F "stt_model=groq/whisper-large-v3-turbo" \
  -F "tts_model=deepgram/aura-2"

On this page