Platform Overview
Xantly is an AI infrastructure layer that sits between your application and LLM providers. One API call. Automatic routing. Built-in caching. Persistent memory. Zero lock-in.
Xantly is an AI infrastructure layer that sits between your application and LLM providers. One API call. Automatic routing. Built-in caching. Persistent memory. Zero lock-in.
How Xantly fits in your stack
You change one line of code — your base_url — and get intelligent routing, caching, memory, and cost optimization across every major LLM provider.
What happens when you make a request
Every call to /v1/chat/completions goes through six stages:
1. Authentication
Your API key is verified and mapped to your organization. Rate limits, budgets, and permissions are enforced.
2. Cache check
The gateway checks if an identical or semantically similar request has been answered recently. Cache hits return instantly with zero token cost. Response headers tell you what happened: x-xantly-cache-hit: true and x-xantly-cache-type: exact or semantic.
3. Intelligent routing
If the request isn't cached, the routing engine analyzes it — complexity, intent, required capabilities — and selects the optimal model. Simple factual queries go to fast, cheap models. Complex reasoning goes to frontier models. The routing model improves continuously from feedback.
4. Provider call
The selected provider receives your request. The gateway handles retries, failover, and hedging automatically. If a provider is down, traffic waterfall to the next best option.
5. Response delivery
Responses stream back to your application in real-time. Response headers include the model used, cost, latency, and routing metadata — full transparency.
6. Async learning
After the response is delivered (never blocking your request), the gateway:
- Stores the exchange in your organization's memory
- Extracts knowledge (entities, facts, patterns) for future context enrichment
- Sends feedback to the routing model so it improves over time
Provider ecosystem
Xantly routes across major LLM providers. You don't need API keys for each — Xantly manages provider relationships. Or bring your own keys (BYOK) for any provider.
| Provider | Models | Strengths |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini | General purpose, tool calling |
| Anthropic | Claude 3.5 Sonnet, Haiku | Long context, reasoning |
| Gemini 1.5 Flash, Pro | Speed, multimodal | |
| Groq | Llama 3.3 70B | Ultra-low latency |
| DeepSeek | DeepSeek V3 | Cost efficiency, code |
| Open-weight | Via OpenRouter | Maximum flexibility |
Use model: "auto" and let the gateway choose. Or specify a model directly — it works either way.
Key differentiators
Intelligent routing
Not all tasks need the most expensive model. Xantly's routing engine matches each request to the right model based on complexity, latency requirements, and cost constraints. The result: same quality output at 40-70% lower cost.
Multi-layer caching
Identical and similar requests are served from cache — zero tokens, zero cost, sub-millisecond latency. For agentic workflows with repetitive patterns, this eliminates redundant LLM calls entirely.
Persistent memory
Every conversation enriches an organization-level knowledge base. The gateway learns your domain, remembers context across sessions, and assembles relevant knowledge automatically. The longer you use Xantly, the smarter it gets.
Zero integration effort
Xantly is fully compatible with the OpenAI SDK. Migration is one line:
from openai import OpenAI
client = OpenAI(
base_url="https://api.xantly.com/v1", # ← only change
api_key="your-xantly-api-key"
)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}]
)Every OpenAI SDK feature works: streaming, tool calling, structured output, function calling, vision. No code changes beyond base_url.
Quality & reliability
Xantly is continuously validated through a multi-layer testing framework:
- 423 unit tests covering routing, caching, memory, normalization, and provider logic
- Deep Grading v3 — 17 end-to-end test suites (protocol compliance, agentic workflows, streaming, concurrency, RAG, structured output)
- Benchmark suites — reasoning quality, safety/jailbreak resistance, endpoint coverage
- Waterfall fallback with 36+ provider API keys ensures zero single-point-of-failure routing
- Mission Control telemetry on every request — full observability from request to response
What's next
- Intelligent Routing — How model selection works and how to control it
- Caching & Performance — How multi-layer caching saves cost and latency
- Memory & Context — How persistent memory enriches your AI workflows
- Quickstart — Send your first request in under 5 minutes
Frequently Asked Questions
What is an AI gateway?
An AI gateway is an infrastructure layer that sits between your application and LLM providers, handling routing, authentication, caching, and observability in a single unified API. Xantly acts as this gateway — your app sends one standard OpenAI-format request, and the gateway authenticates, checks cache, selects the optimal model across 15 providers, executes the call with automatic failover, and streams the response back with full cost and latency metadata.
How does Xantly reduce costs?
Xantly reduces costs through three reinforcing mechanisms: intelligent routing selects the cheapest model capable of handling each request (saving 40-70% vs. always using a frontier model), semantic caching serves repeat and near-repeat queries at zero token cost (62% hit rate, sub-5ms responses), and persistent memory reduces per-request token counts by curating only the relevant context. Combined, these deliver up to 80% cost reduction for production workloads.
What happens if a provider goes down?
Xantly's waterfall fallback mechanism automatically retries with the next best model when a provider fails, times out, or returns an error. The gateway maintains 36+ provider API keys and performs multi-key rotation, so a single provider outage never blocks your requests. Failover decisions happen in under 2ms, and circuit breakers proactively route around degraded providers before they fail completely.
How much latency does Xantly add?
Xantly adds a median overhead of 12ms for routed requests — covering authentication, cache check, task classification, and model selection. Cache hits are even faster: exact matches return in under 5ms and semantic matches in under 20ms, both at zero token cost. The cache check itself adds less than 2ms of overhead on a miss, so there is no performance penalty for having caching enabled.
Architecture
How Xantly works internally — platform overview, intelligent routing, caching, and the memory cascade.
Intelligent Routing
Xantly's routing engine automatically selects the optimal model for each request based on task complexity, latency requirements, cost constraints, and historical performance. No configuration required