XantlyANTLY
Guides

Guide: Cost-Optimized Routing

Use this guide to reduce spend while keeping output quality predictable.

Use this guide to reduce spend while keeping output quality predictable.

Routing is automatic by default (model: "auto"). You only need routing controls when you want stricter cost/latency behavior.


Start with auto-routing

Use this as your baseline:

{
  "model": "auto",
  "messages": [{"role": "user", "content": "Summarize this customer feedback."}]
}

Then inspect response headers:

  • x-xantly-tier-used
  • x-xantly-lane-used
  • x-xantly-provider
  • x-xantly-cost-usd (when usage is available)
  • x-xantly-cache-hit

Routing controls, ordered by strength

1) Presets (routing_hints.mode)

Simple one-knob routing:

ModeTypical intent
fastFavor speed / lower latency
balancedLet gateway balance speed, quality, and cost
qualityFavor stronger-quality paths
cost_optimizedFavor lower-cost paths
free_models_onlyPrefer free-tier style behavior
"routing_hints": { "mode": "cost_optimized" }

2) Fine-grained hints (routing_hints.*)

Use these when preset behavior is not enough.

FieldStatusNotes
preference_dialActiveClamped to 0.0..1.0. Lower = cheaper/faster bias; higher = quality bias.
prefer_latencyActiveStrong low-latency bias.
max_latency_msActiveVery low values (for example <500) strongly bias fast routing.
task_complexityActivetrivial, standard, complex, expert.
max_tierActive (best-effort)Tier guardrail hint.
chain_routingActivesticky or mixed; mixed disables sticky continuation behavior.
allow_free_fallbackActiveAdds explicit fallback preference signal.
prefer_qualityPartialAccepted; currently mainly affects how presets are applied.
max_cost_per_tokenAdvisoryAccepted for compatibility; not a strict enforcement path.
required_capabilitiesAdvisoryAccepted for compatibility; not a strict hard filter in this handler path.

Example:

"routing_hints": {
  "mode": "balanced",
  "preference_dial": 0.15,
  "max_latency_ms": 700,
  "task_complexity": "standard"
}

3) Hard overrides (routing_override.*)

Use only for controlled tests and debugging.

"routing_override": {
  "force_tier": "T3",
  "force_lane": "turbo",
  "force_model": "your-model-slug"
}

force_provider is accepted for forward compatibility, but is not directly enforced in the current chat handler path.


Cost levers that matter most

  1. Keep model: "auto" unless you truly need pinning.
  2. Enable caching for repeatable prompts (xantly.enable_cache: true, default).
  3. Use service_tier: "batch" for cost-sensitive workloads.
  4. Use low preference_dial + latency budget for high-volume pipelines.
  5. Avoid unnecessary high reliability modes for non-critical tasks.

Caching patterns

Exact and semantic cache can reduce cost significantly for repeat/similar traffic.

"xantly": {
  "enable_cache": true
}

Observe cache outcomes with headers:

  • x-xantly-cache-hit
  • x-xantly-cache-type (exact or semantic on cache-hit paths)
  • x-xantly-semantic-similarity (semantic hits)

Copy-paste templates

High-volume lightweight classification

{
  "model": "auto",
  "max_tokens": 32,
  "routing_hints": {
    "mode": "fast",
    "preference_dial": 0.1,
    "task_complexity": "trivial"
  },
  "messages": [
    {"role": "system", "content": "Return one label: positive, neutral, or negative."},
    {"role": "user", "content": "I love this feature."}
  ]
}

Cost-sensitive extraction with structured output

{
  "model": "auto",
  "service_tier": "batch",
  "response_format": {"type": "json_object"},
  "routing_hints": {
    "mode": "cost_optimized",
    "task_complexity": "standard"
  },
  "messages": [
    {"role": "user", "content": "Extract invoice_id, amount, and due_date from this text: ..."}
  ]
}

Validation checklist

  • Compare cost and latency before/after routing hints.
  • Verify routing intent via x-xantly-tier-used and x-xantly-lane-used.
  • Track cache hit rate over time.
  • Add overrides only in non-production experiments.

Next steps

On this page