Audio
Transcribe audio to text, translate audio to English, and generate speech from text. All endpoints proxy to OpenAI's audio APIs with automatic BYOK key resolution.
Transcribe audio to text, translate audio to English, and generate speech from text. All endpoints proxy to OpenAI's audio APIs with automatic BYOK key resolution.
- POST
/v1/audio/transcriptions— Speech-to-text (Whisper) - POST
/v1/audio/translations— Translate audio to English (Whisper) - POST
/v1/audio/speech— Text-to-speech - Auth:
Authorization: Bearer <token> - Drop-in compatible with the OpenAI Audio API.
Transcriptions (Speech-to-Text)
Convert audio files to text using OpenAI's Whisper model.
Quick start
curl -sS https://api.xantly.com/v1/audio/transcriptions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F file="@recording.mp3" \
-F model="whisper-1"Request body (multipart/form-data)
| Field | Type | Required | Description |
|---|---|---|---|
file | binary | Yes | Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm). Max 25 MB. |
model | string | Yes | Currently "whisper-1". |
language | string | No | ISO 639-1 language code (e.g. "en", "es"). Improves accuracy. |
prompt | string | No | Guide the model's style or continue a previous segment. |
response_format | string | No | "json" (default), "text", "srt", "verbose_json", "vtt". |
temperature | number | No | Sampling temperature (0.0–1.0). |
Response body
{
"text": "Hello, this is a transcription of the audio file."
}Translations (Audio to English)
Translate audio from any supported language into English text using Whisper.
Quick start
curl -sS https://api.xantly.com/v1/audio/translations \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F file="@german_audio.mp3" \
-F model="whisper-1"Request body (multipart/form-data)
| Field | Type | Required | Description |
|---|---|---|---|
file | binary | Yes | Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm). Max 25 MB. |
model | string | Yes | Currently "whisper-1". |
prompt | string | No | Guide the model's style or continue a previous segment. |
response_format | string | No | "json" (default), "text", "srt", "verbose_json", "vtt". |
temperature | number | No | Sampling temperature (0.0–1.0). |
Response body
{
"text": "Hello, this is the translated text in English."
}Note: Unlike transcriptions, translations always output English regardless of the source language. For same-language transcription, use
/v1/audio/transcriptionsinstead.
Speech (Text-to-Speech)
Generate natural-sounding audio from text.
Quick start
curl -sS https://api.xantly.com/v1/audio/speech \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello world! This is a test of the text to speech API.",
"voice": "alloy"
}' \
--output speech.mp3Request body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | "tts-1" or "tts-1-hd". |
input | string | Yes | Text to convert to speech. Max 4096 characters. |
voice | string | Yes | Voice to use: "alloy", "echo", "fable", "onyx", "nova", "shimmer". |
response_format | string | No | "mp3" (default), "opus", "aac", "flac", "wav", "pcm". |
speed | number | No | Speed multiplier (0.25–4.0). Default 1.0. |
Response body
Returns raw audio bytes with the appropriate Content-Type header (e.g. audio/mpeg for mp3).
Code examples
Transcription
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["XANTLY_API_KEY"],
base_url="https://api.xantly.com/v1",
)
with open("recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
print(transcript.text)import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({
apiKey: process.env.XANTLY_API_KEY,
baseURL: "https://api.xantly.com/v1",
});
const transcript = await client.audio.transcriptions.create({
model: "whisper-1",
file: fs.createReadStream("recording.mp3"),
});
console.log(transcript.text);Speech generation
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Welcome to the Xantly platform!",
)
with open("output.mp3", "wb") as f:
f.write(response.content)const response = await client.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: "Welcome to the Xantly platform!",
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);BYOK support
Audio endpoints automatically resolve your organization's BYOK OpenAI key. If no BYOK key is configured, the platform key is used. Both endpoints use a 120-second timeout for large audio files.
Errors
| HTTP | error.type | Typical trigger |
|---|---|---|
400 | invalid_request_error | Invalid multipart data, unsupported audio format. |
401 | authentication_error | Missing or invalid Bearer token. |
500 | provider_error | No OpenAI API key configured, or upstream error. |
Next steps
- Chat Completions — Main inference endpoint
- Models — List available models
Moderations
Classify text for potentially harmful content using the content moderation API. Requests are proxied to OpenAI's moderation endpoint with automatic BYOK key resolution.
Images
Generate images from text prompts using the image generation API. Requests are proxied to OpenAI's DALL-E API with automatic BYOK key resolution.