Audio

Transcribe audio to text, translate audio to English, and generate speech from text. All endpoints proxy to OpenAI's audio APIs with automatic BYOK key resolution.

POST /v1/audio/transcriptions — Speech-to-text (Whisper)
POST /v1/audio/translations — Translate audio to English (Whisper)
POST /v1/audio/speech — Text-to-speech
Auth: Authorization: Bearer <token>
Drop-in compatible with the OpenAI Audio API.

Transcriptions (Speech-to-Text)

Convert audio files to text using OpenAI's Whisper model.

Quick start

curl -sS https://api.xantly.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F file="@recording.mp3" \
  -F model="whisper-1"

Request body (multipart/form-data)

Field	Type	Required	Description
`file`	`binary`	Yes	Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm). Max 25 MB.
`model`	`string`	Yes	Currently `"whisper-1"`.
`language`	`string`	No	ISO 639-1 language code (e.g. `"en"`, `"es"`). Improves accuracy.
`prompt`	`string`	No	Guide the model's style or continue a previous segment.
`response_format`	`string`	No	`"json"` (default), `"text"`, `"srt"`, `"verbose_json"`, `"vtt"`.
`temperature`	`number`	No	Sampling temperature (`0.0`–`1.0`).

Response body

{
  "text": "Hello, this is a transcription of the audio file."
}

Translations (Audio to English)

Translate audio from any supported language into English text using Whisper.

Quick start

curl -sS https://api.xantly.com/v1/audio/translations \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F file="@german_audio.mp3" \
  -F model="whisper-1"

Request body (multipart/form-data)

Field	Type	Required	Description
`file`	`binary`	Yes	Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm). Max 25 MB.
`model`	`string`	Yes	Currently `"whisper-1"`.
`prompt`	`string`	No	Guide the model's style or continue a previous segment.
`response_format`	`string`	No	`"json"` (default), `"text"`, `"srt"`, `"verbose_json"`, `"vtt"`.
`temperature`	`number`	No	Sampling temperature (`0.0`–`1.0`).

Response body

{
  "text": "Hello, this is the translated text in English."
}

Note: Unlike transcriptions, translations always output English regardless of the source language. For same-language transcription, use /v1/audio/transcriptions instead.

Speech (Text-to-Speech)

Generate natural-sounding audio from text.

Quick start

curl -sS https://api.xantly.com/v1/audio/speech \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello world! This is a test of the text to speech API.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Request body

Field	Type	Required	Description
`model`	`string`	Yes	`"tts-1"` or `"tts-1-hd"`.
`input`	`string`	Yes	Text to convert to speech. Max 4096 characters.
`voice`	`string`	Yes	Voice to use: `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`.
`response_format`	`string`	No	`"mp3"` (default), `"opus"`, `"aac"`, `"flac"`, `"wav"`, `"pcm"`.
`speed`	`number`	No	Speed multiplier (`0.25`–`4.0`). Default `1.0`.

Response body

Returns raw audio bytes with the appropriate Content-Type header (e.g. audio/mpeg for mp3).

Code examples

Transcription

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XANTLY_API_KEY"],
    base_url="https://api.xantly.com/v1",
)

with open("recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
    )
print(transcript.text)

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  apiKey: process.env.XANTLY_API_KEY,
  baseURL: "https://api.xantly.com/v1",
});

const transcript = await client.audio.transcriptions.create({
  model: "whisper-1",
  file: fs.createReadStream("recording.mp3"),
});
console.log(transcript.text);

Speech generation

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Welcome to the Xantly platform!",
)

with open("output.mp3", "wb") as f:
    f.write(response.content)

const response = await client.audio.speech.create({
  model: "tts-1",
  voice: "alloy",
  input: "Welcome to the Xantly platform!",
});

const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

BYOK support

Audio endpoints automatically resolve your organization's BYOK OpenAI key. If no BYOK key is configured, the platform key is used. Both endpoints use a 120-second timeout for large audio files.

Errors

HTTP	`error.type`	Typical trigger
`400`	`invalid_request_error`	Invalid multipart data, unsupported audio format.
`401`	`authentication_error`	Missing or invalid Bearer token.
`500`	`provider_error`	No OpenAI API key configured, or upstream error.

Next steps

Chat Completions — Main inference endpoint
Models — List available models

On this page