Streaming Responses
Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.
Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.
Enabling Streaming
Add "stream": true to your request body:
curl -sS https://api.xantly.com/v1/chat/completions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"stream": true,
"messages": [
{"role": "user", "content": "Write a short poem about distributed systems."}
]
}'The response is a text/event-stream with chat.completion.chunk events:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":"Packets"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":" scatter"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]SSE Format
Every stream event follows the standard SSE framing:
| Part | Description |
|---|---|
data: <json> | A ChatCompletionChunk JSON object |
data: [DONE] | Terminal sentinel — always the last event |
ChatCompletionChunk fields
| Field | Type | Description |
|---|---|---|
id | string | Shared across all chunks in the same stream |
object | string | Always "chat.completion.chunk" |
created | integer | Unix timestamp of the stream start |
model | string | Model that served the request |
choices | array | Array with one ChunkChoice per n |
choices[].index | integer | Choice index (0-based) |
choices[].delta | object | Incremental content — may have role, content, tool_calls, or be empty {} |
choices[].finish_reason | string? | null until the final chunk; then "stop", "length", "tool_calls", etc. |
Streaming semantics: Non-voice stream mode is SSE-compatible but not guaranteed token-by-token. The gateway may batch tokens before flushing for efficiency on some provider paths.
Handling Stream Chunks
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
api_key="your-xantly-key",
base_url="https://api.xantly.com/v1"
)
stream = client.chat.completions.create(
model="auto",
stream=True,
messages=[{"role": "user", "content": "Explain async I/O in 3 bullets."}]
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline after stream endsNode.js (openai SDK)
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.XANTLY_API_KEY,
baseURL: 'https://api.xantly.com/v1',
});
const stream = await client.chat.completions.stream({
model: 'auto',
stream: true,
messages: [{ role: 'user', content: 'Explain async I/O in 3 bullets.' }],
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content ?? '';
process.stdout.write(content);
}
console.log(); // newlinecurl with manual parsing
curl -sS https://api.xantly.com/v1/chat/completions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"auto","stream":true,"messages":[{"role":"user","content":"Hello"}]}' \
| while IFS= read -r line; do
if [[ "$line" == data:* ]]; then
payload="${line#data: }"
[[ "$payload" == "[DONE]" ]] && break
echo "$payload" | python3 -c "
import sys, json
d = json.load(sys.stdin)
c = d['choices'][0]['delta'].get('content','')
print(c, end='', flush=True)
"
fi
doneStream Options
Include usage in the final chunk
Set stream_options.include_usage to receive a terminal usage chunk before [DONE]:
{
"model": "auto",
"stream": true,
"stream_options": {
"include_usage": true
},
"messages": [...]
}When supported by the provider, the last data: chunk before [DONE] will contain a usage field:
{
"id": "chatcmpl-abc",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 31,
"completion_tokens": 87,
"total_tokens": 118
}
}
stream_options.include_usageis forwarded to providers where supported. A terminal usage chunk is not guaranteed on all provider paths — do not treat its absence as an error.
Error Handling in Streams
If an error occurs before the stream starts, you receive a standard 4xx or 5xx JSON response (not SSE). If an error occurs mid-stream, the stream may terminate early without a [DONE] event.
Pre-stream errors
HTTP 400
{
"error": {
"message": "temperature (2.5) must be between 0 and 2",
"type": "invalid_request_error",
"code": "validation_error"
}
}Detecting truncated streams
Always check finish_reason on the last chunk:
finish_reason | Meaning |
|---|---|
stop | Normal completion |
length | Hit max_tokens limit — output may be truncated |
tool_calls | Model wants to call a tool |
content_filter | Output filtered by provider |
null | Stream may have been cut short by an error |
If you reach [DONE] without a chunk carrying a non-null finish_reason, treat the output as potentially incomplete.
SDK Examples
LangChain (Python)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="auto",
openai_api_key="your-xantly-key",
openai_api_base="https://api.xantly.com/v1",
streaming=True,
)
for chunk in llm.stream("Summarize streaming protocols in 2 sentences."):
print(chunk.content, end="", flush=True)Vercel AI SDK (TypeScript)
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';
const xantly = createOpenAI({
apiKey: process.env.XANTLY_API_KEY!,
baseURL: 'https://api.xantly.com/v1',
});
const result = await streamText({
model: xantly('auto'),
prompt: 'Summarize streaming protocols in 2 sentences.',
});
for await (const textPart of result.textStream) {
process.stdout.write(textPart);
}LiteLLM
import litellm
response = litellm.completion(
model="openai/auto",
api_base="https://api.xantly.com/v1",
api_key="your-xantly-key",
stream=True,
messages=[{"role": "user", "content": "Hello, stream this."}]
)
for chunk in response:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)Related
- Chat Completions API Reference — full parameter reference including
stream_options - Benchmark Results — streaming validated across 10 SDK clients