Platform Overview

Xantly is an AI infrastructure layer that sits between your application and LLM providers. One API call. Automatic routing. Built-in caching. Persistent memory. Zero lock-in.

How Xantly fits in your stack

You change one line of code — your base_url — and get intelligent routing, caching, memory, and cost optimization across every major LLM provider.

What happens when you make a request

Every call to /v1/chat/completions goes through six stages:

1. Authentication

Your API key is verified and mapped to your organization. Rate limits, budgets, and permissions are enforced.

The gateway checks if an identical or semantically similar request has been answered recently. Cache hits return instantly with zero token cost. Response headers tell you what happened: x-xantly-cache-hit: true and x-xantly-cache-type: exact or semantic.

3. Intelligent routing

If the request isn't cached, the routing engine analyzes it — complexity, intent, required capabilities — and selects the optimal model. Simple factual queries go to fast, cheap models. Complex reasoning goes to frontier models. The routing model improves continuously from feedback.

4. Provider call

The selected provider receives your request. The gateway handles retries, failover, and hedging automatically. If a provider is down, traffic waterfall to the next best option.

5. Response delivery

Responses stream back to your application in real-time. Response headers include the model used, cost, latency, and routing metadata — full transparency.

6. Async learning

After the response is delivered (never blocking your request), the gateway:

Stores the exchange in your organization's memory
Extracts knowledge (entities, facts, patterns) for future context enrichment
Sends feedback to the routing model so it improves over time

Provider ecosystem

Xantly routes across major LLM providers. You don't need API keys for each — Xantly manages provider relationships. Or bring your own keys (BYOK) for any provider.

Provider	Models	Strengths
OpenAI	GPT-4o, GPT-4o-mini	General purpose, tool calling
Anthropic	Claude 3.5 Sonnet, Haiku	Long context, reasoning
Google	Gemini 1.5 Flash, Pro	Speed, multimodal
Groq	Llama 3.3 70B	Ultra-low latency
DeepSeek	DeepSeek V3	Cost efficiency, code
Open-weight	Via OpenRouter	Maximum flexibility

Use model: "auto" and let the gateway choose. Or specify a model directly — it works either way.

Key differentiators

Intelligent routing

Not all tasks need the most expensive model. Xantly's routing engine matches each request to the right model based on complexity, latency requirements, and cost constraints. The result: same quality output at 40-70% lower cost.

Multi-layer caching

Identical and similar requests are served from cache — zero tokens, zero cost, sub-millisecond latency. For agentic workflows with repetitive patterns, this eliminates redundant LLM calls entirely.

Persistent memory

Every conversation enriches an organization-level knowledge base. The gateway learns your domain, remembers context across sessions, and assembles relevant knowledge automatically. The longer you use Xantly, the smarter it gets.

Zero integration effort

Xantly is fully compatible with the OpenAI SDK. Migration is one line:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.xantly.com/v1",  # ← only change
    api_key="your-xantly-api-key"
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}]
)

Every OpenAI SDK feature works: streaming, tool calling, structured output, function calling, vision. No code changes beyond base_url.

Quality & reliability

Xantly is continuously validated through a multi-layer testing framework:

423 unit tests covering routing, caching, memory, normalization, and provider logic
Deep Grading v3 — 17 end-to-end test suites (protocol compliance, agentic workflows, streaming, concurrency, RAG, structured output)
Benchmark suites — reasoning quality, safety/jailbreak resistance, endpoint coverage
Waterfall fallback with 36+ provider API keys ensures zero single-point-of-failure routing
Mission Control telemetry on every request — full observability from request to response

What's next

Intelligent Routing — How model selection works and how to control it
Caching & Performance — How multi-layer caching saves cost and latency
Memory & Context — How persistent memory enriches your AI workflows
Quickstart — Send your first request in under 5 minutes

Frequently Asked Questions

What is an AI gateway?

An AI gateway is an infrastructure layer that sits between your application and LLM providers, handling routing, authentication, caching, and observability in a single unified API. Xantly acts as this gateway — your app sends one standard OpenAI-format request, and the gateway authenticates, checks cache, selects the optimal model across 15 providers, executes the call with automatic failover, and streams the response back with full cost and latency metadata.

How does Xantly reduce costs?

Xantly reduces costs through three reinforcing mechanisms: intelligent routing selects the cheapest model capable of handling each request (saving 40-70% vs. always using a frontier model), semantic caching serves repeat and near-repeat queries at zero token cost (62% hit rate, sub-5ms responses), and persistent memory reduces per-request token counts by curating only the relevant context. Combined, these deliver up to 80% cost reduction for production workloads.

What happens if a provider goes down?

Xantly's waterfall fallback mechanism automatically retries with the next best model when a provider fails, times out, or returns an error. The gateway maintains 36+ provider API keys and performs multi-key rotation, so a single provider outage never blocks your requests. Failover decisions happen in under 2ms, and circuit breakers proactively route around degraded providers before they fail completely.

How much latency does Xantly add?

Xantly adds a median overhead of 12ms for routed requests — covering authentication, cache check, task classification, and model selection. Cache hits are even faster: exact matches return in under 5ms and semantic matches in under 20ms, both at zero token cost. The cache check itself adds less than 2ms of overhead on a miss, so there is no performance penalty for having caching enabled.