Chat Completions
Create completions with OpenAI-compatible request/response shapes plus optional Xantly orchestration controls.
Create completions with OpenAI-compatible request/response shapes plus optional Xantly orchestration controls.
- POST
/v1/chat/completions - HEAD
/v1/chat/completions(returns204 No Content; useful as a lightweight probe) - Auth:
Authorization: Bearer <token>
Recommended rollout: start with standard fields only (
model,messages,temperature,max_tokens), then add orchestration controls incrementally.
Quick start example
curl -sS https://api.xantly.com/v1/chat/completions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Explain vector databases in 2 bullets."}
]
}'Request body
Standard parameters (OpenAI-compatible)
| Field | Type | Required | Default | Validation / behavior |
|---|---|---|---|---|
model | string | Yes | — | Use "auto" (recommended) or a valid catalog model slug/upstream name. Unknown models are rejected when catalog is loaded. |
messages | array<ChatMessage> | Yes | — | Must contain at least 1 message. |
stream | boolean | No | false | Enables SSE output. |
n | integer | No | 1 | Allowed range: 1..=8. |
max_tokens | integer | No | provider/model dependent | Alias supported: max_completion_tokens. |
temperature | number | No | provider/model dependent | Validated range: 0.0..=2.0. |
top_p | number | No | provider/model dependent | Passed through to provider. |
frequency_penalty | number | No | provider/model dependent | Validated range: -2.0..=2.0. |
presence_penalty | number | No | provider/model dependent | Validated range: -2.0..=2.0. |
stop | string | array<string> | No | null | Passed through to provider. |
tools | array<object> | No | null | Tool definitions for function-calling workflows. |
tool_choice | string | object | No | provider dependent | Passed through to provider. |
parallel_tool_calls | boolean | No | provider dependent | Passed through to provider. |
response_format | object | No | null | Supports {"type":"json_object"} and {"type":"json_schema", "json_schema": {...}}. |
seed | integer (u64) | No | null | Determinism hint. |
user | string | No | null | End-user identifier, forwarded to provider. |
logprobs | boolean | No | null | Forwarded to provider. |
top_logprobs | integer (u8) | No | null | Validated max: 20. |
stream_options.include_usage | boolean | No | false | Forwarded to provider where supported. |
reasoning_effort | string | No | null | Accepted values: low, medium, high (provider-dependent behavior). |
service_tier | string | No | null | "batch" is accepted and treated as a cost-oriented routing hint. |
metadata | object<string,string> | No | null | Free-form metadata map used for tracing/routing context. |
ChatMessage object
| Field | Type | Required | Notes |
|---|---|---|---|
role | string | Yes | Common values: system, user, assistant, tool. |
content | string | array | Yes | String for text, array for multimodal content parts. |
name | string | No | Optional participant name. |
tool_calls | array<object> | No | Assistant tool call payloads. |
tool_call_id | string | No | Correlates tool output to a prior tool call. |
refusal | string | No | Optional refusal text. |
Proprietary parameters (Xantly)
These are optional. If omitted, the gateway uses defaults and automatic policy.
routing_hints (soft preferences)
| Field | Type | Value range / values | Backend status | Usage guidance |
|---|---|---|---|---|
mode | string | fast, balanced, quality, cost_optimized, free_models_only | Active | Coarse routing preset when explicit preference knobs are not set. |
preference_dial | number | Best used in 0.0..1.0 (values are clamped) | Active | Lower biases cost/speed; higher biases quality. |
prefer_latency | boolean | true/false | Active | true strongly biases low-latency execution. |
prefer_quality | boolean | true/false | Partial | Accepted; currently primarily suppresses mode preset behavior when explicit knobs are present. |
max_cost_per_token | number | positive float | Reserved / advisory | Accepted for forward compatibility. Do not rely on strict enforcement yet. |
max_latency_ms | integer | positive integer | Active | Sets latency budget; very low budgets bias faster lanes. |
max_tier | integer | 1, 2, 3 | Active | Tier guardrail hint. |
required_capabilities | array<string> | free-form strings | Reserved / advisory | Accepted for forward compatibility; not a strict hard filter in this handler path. |
task_complexity | string | trivial, standard, complex, expert | Active | Influences tier floor/ceiling behavior. |
chain_routing | string | sticky, mixed | Active | mixed disables sticky-route continuation behavior. |
allow_free_fallback | boolean | true/false | Active | Passed into metadata as an explicit fallback preference signal. |
routing_override (harder overrides)
| Field | Type | Values | Backend status | Usage guidance |
|---|---|---|---|---|
force_tier | string | T1, T2, T3 (also tier-1, tier1, etc.) | Active | Forces selected tier mapping. |
force_lane | string | smart, turbo | Active | Forces lane. |
force_model | string | model slug/upstream id | Active | Pins model after routing. |
force_provider | string | provider identifier | Reserved | Accepted in schema; not directly applied in this chat handler flow today. |
xantly orchestration block
| Field | Type | Values / range | Effective default | Backend status | Usage guidance |
|---|---|---|---|---|---|
workflow_type | string | single_turn, execution_task, multi_step_conversational, long_horizon_autonomous, voice_simple, voice_complex, creative | auto-classified | Active | Explicitly sets workflow class when recognized. |
chain_id | string | UUID recommended | null | Active | Signals chain continuation; invalid UUID is ignored by classifier. |
conversation_id | string | free-form id | null | Active | Used for memory/sticky context continuity. |
planning_mode | string | preact or planact | tenant/default heuristic | Active | Controls planner style where planning layer is active. |
max_chain_steps | integer (u16) | 1..65535 | workflow-dependent | Active (conditional) | Applies when workflow is long_horizon_autonomous. |
chain_timeout_secs | integer (u32) | 0..4294967295 | workflow-dependent | Active (conditional) | Applies when workflow is long_horizon_autonomous. |
chain_routing | string | e.g. sticky / mixed | null | Reserved | Accepted in payload; rely on routing_hints.chain_routing for current behavior. |
reliability_level | string | standard, high, critical | standard | Active | Influences reliability/verification activation. |
enable_memory | boolean | true/false | true | Active | Controls persistence into L1/L2 memory in this handler path. |
enable_speculation | boolean | true/false | router/tenant default | Active | Per-request override for speculation toggle. |
enable_hedging | boolean | true/false | router/tenant default | Active | Per-request override for hedging toggle. |
enable_cache | boolean | true/false | true | Active | Enables/disables cache lookup path for this request. |
cache_ttl_secs | integer | positive integer | null | Reserved | Accepted but not currently used to override cache TTL in this handler. |
output_verification | string | none, native, schema, cross_model | strategy auto-selection | Active | Per-request override for output verification strategy. |
compress_context | boolean | true/false | null | Reserved | Accepted for forward compatibility. |
redact_pii | boolean | true/false | false | Active | Sets request redaction signal (x-redact-pii=true metadata). |
voice_mode | string | typically "true" for voice path | null | Active | Enables voice-oriented handling when set. |
enable_tool_reranking | boolean | true/false | null | Reserved | Accepted for forward compatibility. |
intelligence_mode | string | proxy, cache, full | full (system default) | Active | Pipeline intelligence preset. See Intelligence Modes. |
Legacy request headers (still supported)
The gateway maps selected request headers into metadata for backward compatibility.
| Header | Mapped metadata key |
|---|---|
x-xantly-workflow | x-xantly-workflow |
x-xantly-voice | x-xantly-voice |
x-xantly-planning-mode | x-xantly-planning-mode |
x-xantly-preference | x-xantly-preference |
x-xantly-chain-routing | x-xantly-chain-routing |
x-xantly-lane | x-xantly-lane |
x-xantly-tier | x-xantly-tier |
x-xantly-run-id | run_id |
x-xantly-conversation-id | conversation_id |
x-intelligence-mode | x-intelligence-mode |
Response body
Non-stream (200 OK)
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1741400000,
"model": "deepseek-chat",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 31,
"completion_tokens": 18,
"total_tokens": 49
},
"xantly_metadata": {
"request_id": "req_01abc...",
"routing_decision": "Lane: Turbo (Source: BarpRouter)",
"provider": "deepseek",
"tier": "T2",
"provider_tier": "T2",
"latency_ms": 142,
"requested_model": "auto",
"decision_source": "BarpRouter",
"task_family": "analysis",
"cost_usd": 0.00032,
"baseline_cost_usd": 0.00120,
"savings_usd": 0.00088,
"savings_pct": 73.3,
"cost_attribution": "xantly",
"estimated_cost_usd": 0.00032,
"healer_report": null
}
}xantly_metadata fields
| Field | Type | Always present | Description |
|---|---|---|---|
request_id | string | Yes | Correlation ID for support and audit. |
routing_decision | string | Yes | Human-readable routing summary (lane + source). |
provider | string | Yes | Provider that served the request (e.g. "deepseek", "anthropic"). |
tier | string | Yes | Effective execution tier ("T1", "T2", "T3"). |
provider_tier | string? | No | Provider-level tier when different from effective tier. |
latency_ms | integer | Yes | End-to-end gateway latency in milliseconds. |
requested_model | string | Yes | The model string you sent (e.g. "auto", "gpt-4o"). Useful for debugging routing decisions when using "auto". |
decision_source | string? | No | Machine-readable routing engine source (e.g. "BarpRouter", "Pinned", "CacheHit"). |
task_family | string? | No | Task category detected by the smart router (e.g. "code", "writing", "analysis"). |
cost_usd | number? | No | Actual cost for this specific request in USD. Present when usage data is available. |
baseline_cost_usd | number? | No | GPT-4o reference cost for the same token counts (input $2.50/M, output $10/M). Lets you compute savings without a separate analytics call. |
savings_usd | number? | No | baseline_cost_usd - cost_usd. Positive value means Xantly saved money vs. GPT-4o. |
savings_pct | number? | No | Savings as a percentage of baseline (0–100+). |
cost_attribution | string | Yes | "xantly" when routed via Xantly-managed provider keys; "byok" when routed via your own API key. |
estimated_cost_usd | number? | No | Legacy alias for cost_usd. Kept for backward compatibility. |
healer_report | object? | No | Present when JSON repair was applied. Contains original, healed, stage, confidence, healing_time_ms. |
Notes
choices[].logprobsis part of schema, but may be omitted in normalized API responses.- For
n > 1, if upstream returns fewer choices, the gateway may replicate the first choice to satisfy requested cardinality.
Stream (text/event-stream)
- All stream responses emit
chat.completion.chunkevents and terminate withdata: [DONE]. - Non-voice streaming currently emits a compact sequence (role chunk + content chunk +
[DONE]) instead of token-by-token chunking. stream_options.include_usageis forwarded to providers where supported; do not assume a terminal usage chunk is always present.
Response headers
Common observability headers
| Header | Present when | Description |
|---|---|---|
x-xantly-request-id | standard non-stream path and semantic-cache responses | Request correlation id. |
x-xantly-cache-hit | standard non-stream path and cache responses | true/false. |
x-xantly-tier-used | non-stream and exact-cache responses | Effective execution tier. |
x-xantly-lane-used | non-stream and exact-cache responses | Effective lane (smart/turbo). |
x-xantly-provider | non-stream and exact-cache responses | Provider/model source label. |
x-xantly-speculation-accepted | non-stream | Currently emitted as 0 placeholder. |
x-xantly-latency-breakdown | non-stream | JSON string with stage-level latency metrics. |
x-xantly-audit-id | non-stream | Correlation id for audit trail. |
Cache-path headers
| Header | Value |
|---|---|
x-xantly-cache-type | exact or semantic |
x-xantly-latency-ms | end-to-end latency (cache path) |
x-xantly-semantic-similarity | similarity score (semantic cache only) |
Usage / cost headers
These are included when usage is available:
x-xantly-input-tokensx-xantly-output-tokensx-xantly-cost-usd
Errors
Error payloads use OpenAI-style shape:
{
"error": {
"message": "temperature (2.5) must be between 0 and 2",
"type": "invalid_request_error",
"code": "validation_error",
"param": null
}
}| HTTP | error.type | error.code (examples) | Typical trigger |
|---|---|---|---|
400 | invalid_request_error | validation_error | Invalid parameter range, unknown model, empty messages. |
401 | authentication_error | invalid_api_key, expired_api_key, revoked_api_key | Missing/invalid API key. |
403 | authorization_error | forbidden, insufficient_scopes, tenant_violation | Scope/policy denial. |
404 | not_found_error or governance_error | resource_not_found, tool_not_registered | Missing referenced resource or unregistered tool. |
422 | governance_error | tool_call_blocked | Governance blocked a tool call. |
429 | rate_limit_error | rate_limit_exceeded, upsteam_rate_limit | Tenant or upstream rate limiting. |
502 | upstream_error | provider_error | Upstream provider error. |
503 | governance_error | circuit_breaker_open | Circuit breaker open. |
504 | upstream_error | provider_timeout | Upstream timeout. |
500 | internal_error | internal_error | Internal platform failure. |
Error headers
x-error-idis returned on error responses for support correlation.retry-afteris returned on eligible rate-limit responses.
Edge cases and implementation notes
-
Model validation is catalog-aware
modelvalidation is strict when the model catalog is loaded.- During catalog cold-start/empty states, strict rejection may be relaxed.
-
top_logprobsvalidation- Server enforces
top_logprobs <= 20. - Provider-specific
logprobsbehavior may vary.
- Server enforces
-
Chain limit overrides are conditional
xantly.max_chain_stepsandxantly.chain_timeout_secsapply only when workflow resolves tolong_horizon_autonomous.
-
Streaming semantics
- Non-voice stream mode is SSE-compatible but not guaranteed token-by-token.
-
Reserved fields
- Some accepted proprietary fields are currently advisory/reserved for compatibility (
force_provider,cache_ttl_secs,compress_context,enable_tool_reranking, and certainrouting_hintsfields).
- Some accepted proprietary fields are currently advisory/reserved for compatibility (
Practical rollout checklist
- Start with
model: "auto"+ standard fields. - Enable
response_formatfor structured outputs. - Add one routing/orchestration control at a time.
- Track
x-xantly-*response headers in observability. - Add override fields only for test/debug traffic.
See also
- Benchmark Results — Every parameter on this page is individually validated. See the full 252-test scorecard including boundary and invalid-input cases.
- Streaming Responses — Complete guide to SSE format, chunk handling, and SDK examples.
- Rate Limits — RPM and TPM limits that apply to this endpoint.
- Billing & Quotas — Token quotas, budget caps, and cost visibility via
xantly_metadata.
API Reference
Complete reference for every Xantly API endpoint — chat, completions, embeddings, audio, images, voice, and more.
Responses API
The modern OpenAI Responses API endpoint. Clients that default to /v1/responses (including newer OpenAI SDK versions) work out of the box through the Xantly gateway.