Chat Completions

Create completions with OpenAI-compatible request/response shapes plus optional Xantly orchestration controls.

POST /v1/chat/completions
HEAD /v1/chat/completions (returns 204 No Content; useful as a lightweight probe)
Auth: Authorization: Bearer <token>

Recommended rollout: start with standard fields only (model, messages, temperature, max_tokens), then add orchestration controls incrementally.

Quick start example

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Explain vector databases in 2 bullets."}
    ]
  }'

Request body

Standard parameters (OpenAI-compatible)

Field	Type	Required	Default	Validation / behavior
`model`	`string`	Yes	—	Use `"auto"` (recommended) or a valid catalog model slug/upstream name. Unknown models are rejected when catalog is loaded.
`messages`	`array<ChatMessage>`	Yes	—	Must contain at least 1 message.
`stream`	`boolean`	No	`false`	Enables SSE output.
`n`	`integer`	No	`1`	Allowed range: `1..=8`.
`max_tokens`	`integer`	No	provider/model dependent	Alias supported: `max_completion_tokens`.
`temperature`	`number`	No	provider/model dependent	Validated range: `0.0..=2.0`.
`top_p`	`number`	No	provider/model dependent	Passed through to provider.
`frequency_penalty`	`number`	No	provider/model dependent	Validated range: `-2.0..=2.0`.
`presence_penalty`	`number`	No	provider/model dependent	Validated range: `-2.0..=2.0`.
`stop`	`string \| array<string>`	No	`null`	Passed through to provider.
`tools`	`array<object>`	No	`null`	Tool definitions for function-calling workflows.
`tool_choice`	`string \| object`	No	provider dependent	Passed through to provider.
`parallel_tool_calls`	`boolean`	No	provider dependent	Passed through to provider.
`response_format`	`object`	No	`null`	Supports `{"type":"json_object"}` and `{"type":"json_schema", "json_schema": {...}}`.
`seed`	`integer` (`u64`)	No	`null`	Determinism hint.
`user`	`string`	No	`null`	End-user identifier, forwarded to provider.
`logprobs`	`boolean`	No	`null`	Forwarded to provider.
`top_logprobs`	`integer` (`u8`)	No	`null`	Validated max: `20`.
`stream_options.include_usage`	`boolean`	No	`false`	Forwarded to provider where supported.
`reasoning_effort`	`string`	No	`null`	Accepted values: `low`, `medium`, `high` (provider-dependent behavior).
`service_tier`	`string`	No	`null`	`"batch"` is accepted and treated as a cost-oriented routing hint.
`metadata`	`object<string,string>`	No	`null`	Free-form metadata map used for tracing/routing context.

`ChatMessage` object

Field	Type	Required	Notes
`role`	`string`	Yes	Common values: `system`, `user`, `assistant`, `tool`.
`content`	`string \| array`	Yes	String for text, array for multimodal content parts.
`name`	`string`	No	Optional participant name.
`tool_calls`	`array<object>`	No	Assistant tool call payloads.
`tool_call_id`	`string`	No	Correlates tool output to a prior tool call.
`refusal`	`string`	No	Optional refusal text.

Proprietary parameters (Xantly)

These are optional. If omitted, the gateway uses defaults and automatic policy.

`routing_hints` (soft preferences)

Field	Type	Value range / values	Backend status	Usage guidance
`mode`	`string`	`fast`, `balanced`, `quality`, `cost_optimized`, `free_models_only`	Active	Coarse routing preset when explicit preference knobs are not set.
`preference_dial`	`number`	Best used in `0.0..1.0` (values are clamped)	Active	Lower biases cost/speed; higher biases quality.
`prefer_latency`	`boolean`	`true/false`	Active	`true` strongly biases low-latency execution.
`prefer_quality`	`boolean`	`true/false`	Partial	Accepted; currently primarily suppresses `mode` preset behavior when explicit knobs are present.
`max_cost_per_token`	`number`	positive float	Reserved / advisory	Accepted for forward compatibility. Do not rely on strict enforcement yet.
`max_latency_ms`	`integer`	positive integer	Active	Sets latency budget; very low budgets bias faster lanes.
`max_tier`	`integer`	`1`, `2`, `3`	Active	Tier guardrail hint.
`required_capabilities`	`array<string>`	free-form strings	Reserved / advisory	Accepted for forward compatibility; not a strict hard filter in this handler path.
`task_complexity`	`string`	`trivial`, `standard`, `complex`, `expert`	Active	Influences tier floor/ceiling behavior.
`chain_routing`	`string`	`sticky`, `mixed`	Active	`mixed` disables sticky-route continuation behavior.
`allow_free_fallback`	`boolean`	`true/false`	Active	Passed into metadata as an explicit fallback preference signal.

`routing_override` (harder overrides)

Field	Type	Values	Backend status	Usage guidance
`force_tier`	`string`	`T1`, `T2`, `T3` (also `tier-1`, `tier1`, etc.)	Active	Forces selected tier mapping.
`force_lane`	`string`	`smart`, `turbo`	Active	Forces lane.
`force_model`	`string`	model slug/upstream id	Active	Pins model after routing.
`force_provider`	`string`	provider identifier	Reserved	Accepted in schema; not directly applied in this chat handler flow today.

`xantly` orchestration block

Field	Type	Values / range	Effective default	Backend status	Usage guidance
`workflow_type`	`string`	`single_turn`, `execution_task`, `multi_step_conversational`, `long_horizon_autonomous`, `voice_simple`, `voice_complex`, `creative`	auto-classified	Active	Explicitly sets workflow class when recognized.
`chain_id`	`string`	UUID recommended	`null`	Active	Signals chain continuation; invalid UUID is ignored by classifier.
`conversation_id`	`string`	free-form id	`null`	Active	Used for memory/sticky context continuity.
`planning_mode`	`string`	`preact` or `planact`	tenant/default heuristic	Active	Controls planner style where planning layer is active.
`max_chain_steps`	`integer` (`u16`)	`1..65535`	workflow-dependent	Active (conditional)	Applies when workflow is `long_horizon_autonomous`.
`chain_timeout_secs`	`integer` (`u32`)	`0..4294967295`	workflow-dependent	Active (conditional)	Applies when workflow is `long_horizon_autonomous`.
`chain_routing`	`string`	e.g. `sticky` / `mixed`	`null`	Reserved	Accepted in payload; rely on `routing_hints.chain_routing` for current behavior.
`reliability_level`	`string`	`standard`, `high`, `critical`	`standard`	Active	Influences reliability/verification activation.
`enable_memory`	`boolean`	`true/false`	`true`	Active	Controls persistence into L1/L2 memory in this handler path.
`enable_speculation`	`boolean`	`true/false`	router/tenant default	Active	Per-request override for speculation toggle.
`enable_hedging`	`boolean`	`true/false`	router/tenant default	Active	Per-request override for hedging toggle.
`enable_cache`	`boolean`	`true/false`	`true`	Active	Enables/disables cache lookup path for this request.
`cache_ttl_secs`	`integer`	positive integer	`null`	Reserved	Accepted but not currently used to override cache TTL in this handler.
`output_verification`	`string`	`none`, `native`, `schema`, `cross_model`	strategy auto-selection	Active	Per-request override for output verification strategy.
`compress_context`	`boolean`	`true/false`	`null`	Reserved	Accepted for forward compatibility.
`redact_pii`	`boolean`	`true/false`	`false`	Active	Sets request redaction signal (`x-redact-pii=true` metadata).
`voice_mode`	`string`	typically `"true"` for voice path	`null`	Active	Enables voice-oriented handling when set.
`enable_tool_reranking`	`boolean`	`true/false`	`null`	Reserved	Accepted for forward compatibility.
`intelligence_mode`	`string`	`proxy`, `cache`, `full`	`full` (system default)	Active	Pipeline intelligence preset. See Intelligence Modes.

Legacy request headers (still supported)

The gateway maps selected request headers into metadata for backward compatibility.

Header	Mapped metadata key
`x-xantly-workflow`	`x-xantly-workflow`
`x-xantly-voice`	`x-xantly-voice`
`x-xantly-planning-mode`	`x-xantly-planning-mode`
`x-xantly-preference`	`x-xantly-preference`
`x-xantly-chain-routing`	`x-xantly-chain-routing`
`x-xantly-lane`	`x-xantly-lane`
`x-xantly-tier`	`x-xantly-tier`
`x-xantly-run-id`	`run_id`
`x-xantly-conversation-id`	`conversation_id`
`x-intelligence-mode`	`x-intelligence-mode`

Response body

Non-stream (`200 OK`)

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1741400000,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 18,
    "total_tokens": 49
  },
  "xantly_metadata": {
    "request_id": "req_01abc...",
    "routing_decision": "Lane: Turbo (Source: BarpRouter)",
    "provider": "deepseek",
    "tier": "T2",
    "provider_tier": "T2",
    "latency_ms": 142,
    "requested_model": "auto",
    "decision_source": "BarpRouter",
    "task_family": "analysis",
    "cost_usd": 0.00032,
    "baseline_cost_usd": 0.00120,
    "savings_usd": 0.00088,
    "savings_pct": 73.3,
    "cost_attribution": "xantly",
    "estimated_cost_usd": 0.00032,
    "healer_report": null
  }
}

`xantly_metadata` fields

Field	Type	Always present	Description
`request_id`	`string`	Yes	Correlation ID for support and audit.
`routing_decision`	`string`	Yes	Human-readable routing summary (lane + source).
`provider`	`string`	Yes	Provider that served the request (e.g. `"deepseek"`, `"anthropic"`).
`tier`	`string`	Yes	Effective execution tier (`"T1"`, `"T2"`, `"T3"`).
`provider_tier`	`string?`	No	Provider-level tier when different from effective tier.
`latency_ms`	`integer`	Yes	End-to-end gateway latency in milliseconds.
`requested_model`	`string`	Yes	The model string you sent (e.g. `"auto"`, `"gpt-4o"`). Useful for debugging routing decisions when using `"auto"`.
`decision_source`	`string?`	No	Machine-readable routing engine source (e.g. `"BarpRouter"`, `"Pinned"`, `"CacheHit"`).
`task_family`	`string?`	No	Task category detected by the smart router (e.g. `"code"`, `"writing"`, `"analysis"`).
`cost_usd`	`number?`	No	Actual cost for this specific request in USD. Present when usage data is available.
`baseline_cost_usd`	`number?`	No	GPT-4o reference cost for the same token counts (input $2.50/M, output $10/M). Lets you compute savings without a separate analytics call.
`savings_usd`	`number?`	No	`baseline_cost_usd - cost_usd`. Positive value means Xantly saved money vs. GPT-4o.
`savings_pct`	`number?`	No	Savings as a percentage of baseline (0–100+).
`cost_attribution`	`string`	Yes	`"xantly"` when routed via Xantly-managed provider keys; `"byok"` when routed via your own API key.
`estimated_cost_usd`	`number?`	No	Legacy alias for `cost_usd`. Kept for backward compatibility.
`healer_report`	`object?`	No	Present when JSON repair was applied. Contains `original`, `healed`, `stage`, `confidence`, `healing_time_ms`.

Notes

choices[].logprobs is part of schema, but may be omitted in normalized API responses.
For n > 1, if upstream returns fewer choices, the gateway may replicate the first choice to satisfy requested cardinality.

Stream (`text/event-stream`)

All stream responses emit chat.completion.chunk events and terminate with data: [DONE].
Non-voice streaming currently emits a compact sequence (role chunk + content chunk + [DONE]) instead of token-by-token chunking.
stream_options.include_usage is forwarded to providers where supported; do not assume a terminal usage chunk is always present.

Response headers

Common observability headers

Header	Present when	Description
`x-xantly-request-id`	standard non-stream path and semantic-cache responses	Request correlation id.
`x-xantly-cache-hit`	standard non-stream path and cache responses	`true`/`false`.
`x-xantly-tier-used`	non-stream and exact-cache responses	Effective execution tier.
`x-xantly-lane-used`	non-stream and exact-cache responses	Effective lane (`smart`/`turbo`).
`x-xantly-provider`	non-stream and exact-cache responses	Provider/model source label.
`x-xantly-speculation-accepted`	non-stream	Currently emitted as `0` placeholder.
`x-xantly-latency-breakdown`	non-stream	JSON string with stage-level latency metrics.
`x-xantly-audit-id`	non-stream	Correlation id for audit trail.

Cache-path headers

Header	Value
`x-xantly-cache-type`	`exact` or `semantic`
`x-xantly-latency-ms`	end-to-end latency (cache path)
`x-xantly-semantic-similarity`	similarity score (semantic cache only)

Usage / cost headers

These are included when usage is available:

x-xantly-input-tokens
x-xantly-output-tokens
x-xantly-cost-usd

Errors

Error payloads use OpenAI-style shape:

{
  "error": {
    "message": "temperature (2.5) must be between 0 and 2",
    "type": "invalid_request_error",
    "code": "validation_error",
    "param": null
  }
}

HTTP	`error.type`	`error.code` (examples)	Typical trigger
`400`	`invalid_request_error`	`validation_error`	Invalid parameter range, unknown model, empty messages.
`401`	`authentication_error`	`invalid_api_key`, `expired_api_key`, `revoked_api_key`	Missing/invalid API key.
`403`	`authorization_error`	`forbidden`, `insufficient_scopes`, `tenant_violation`	Scope/policy denial.
`404`	`not_found_error` or `governance_error`	`resource_not_found`, `tool_not_registered`	Missing referenced resource or unregistered tool.
`422`	`governance_error`	`tool_call_blocked`	Governance blocked a tool call.
`429`	`rate_limit_error`	`rate_limit_exceeded`, `upsteam_rate_limit`	Tenant or upstream rate limiting.
`502`	`upstream_error`	`provider_error`	Upstream provider error.
`503`	`governance_error`	`circuit_breaker_open`	Circuit breaker open.
`504`	`upstream_error`	`provider_timeout`	Upstream timeout.
`500`	`internal_error`	`internal_error`	Internal platform failure.

Error headers

x-error-id is returned on error responses for support correlation.
retry-after is returned on eligible rate-limit responses.

Edge cases and implementation notes

Model validation is catalog-aware
- model validation is strict when the model catalog is loaded.
- During catalog cold-start/empty states, strict rejection may be relaxed.
top_logprobs validation
- Server enforces top_logprobs <= 20.
- Provider-specific logprobs behavior may vary.
Chain limit overrides are conditional
- xantly.max_chain_steps and xantly.chain_timeout_secs apply only when workflow resolves to long_horizon_autonomous.
Streaming semantics
- Non-voice stream mode is SSE-compatible but not guaranteed token-by-token.
Reserved fields
- Some accepted proprietary fields are currently advisory/reserved for compatibility (force_provider, cache_ttl_secs, compress_context, enable_tool_reranking, and certain routing_hints fields).

Practical rollout checklist

Start with model: "auto" + standard fields.
Enable response_format for structured outputs.
Add one routing/orchestration control at a time.
Track x-xantly-* response headers in observability.
Add override fields only for test/debug traffic.