Caching & Performance

Xantly's multi-layer caching infrastructure eliminates redundant LLM calls. Identical requests return instantly. Similar requests return from semantic cache. Agentic workflows with repetitive patterns

Why caching matters for AI

LLM calls are expensive and slow. A single GPT-4o request costs $5-15 per million tokens and takes 1-3 seconds. In production agentic workflows, the same questions get asked repeatedly — customer support bots answering FAQ variations, code assistants generating similar patterns, data pipelines running identical extractions.

Without caching, you pay full price every time. With Xantly's caching, repeat and near-repeat requests return in under 5ms at zero token cost.

Two types of cache hits

Exact match

When the same request (same messages, same parameters) is sent again, the cached response is returned immediately.

Latency: Under 5ms
Cost: Zero tokens consumed
Response header: x-xantly-cache-type: exact

This catches identical requests from retry loops, page refreshes, and duplicate webhook triggers.

Semantic match

When a similar request is sent — same intent, slightly different wording — the gateway detects the similarity and returns the cached response.

Latency: Under 20ms
Cost: Zero tokens consumed
Response header: x-xantly-cache-type: semantic

This catches rephrased questions, agentic loops with minor prompt variations, and template-based queries where only small details change.

# These would be semantic cache hits for each other:
"What is the capital of France?"
"Tell me the capital city of France"
"Capital of France?"

Multi-layer cache architecture

The caching infrastructure uses multiple layers optimized for different access patterns. Each layer is checked in order — the fastest layers are checked first.

Layer	Latency	What it catches
In-process cache	< 1ms	Hot repeated requests within the same gateway instance
Distributed cache	< 5ms	Exact matches across all gateway instances
Semantic cache	< 20ms	Near-identical requests with different wording

If all cache layers miss, the request proceeds to the routing engine and LLM provider. After a successful response, the result is stored in all applicable cache layers for future hits.

Cross-conversation deduplication

In agentic workflows, the same question often appears across different conversation sessions. For example:

A customer support agent handles 100 similar password reset requests per day
A code assistant generates the same boilerplate across different projects
A data pipeline extracts the same fields from similar documents

Xantly automatically detects these cross-conversation patterns. Even without explicit conversation_id linking, the system normalizes request intents and deduplicates across sessions. This means your 100th password-reset question is served from cache — not sent to the LLM.

Cache controls

Enabling/disabling cache

Cache is enabled by default. To disable for a specific request:

{
  "model": "auto",
  "messages": [...],
  "xantly": {
    "enable_cache": false
  }
}

Or use the header: X-Xantly-Memory-Skip-Cache: true

Checking cache status

Every response includes cache headers:

Header	Values	Description
`x-xantly-cache-hit`	`true` / `false`	Whether this response was served from cache
`x-xantly-cache-type`	`exact` / `semantic`	Type of cache match (only present on hits)
`x-xantly-semantic-similarity`	`0.0` - `1.0`	Similarity score for semantic matches

What's not cached

Some request types are never cached to ensure correctness:

Streaming responses (stream: true)
Requests with tool definitions (function calling)
Requests with structured output schemas (response_format.type: "json_schema")
Requests with explicit cache-skip headers

Cost impact

Cached responses consume zero tokens and incur zero cost. They don't count against your rate limits either.

For typical production workloads:

Workload type	Typical cache hit rate	Cost reduction
Customer support bots	40-60%	40-60%
Code generation assistants	20-35%	20-35%
Data extraction pipelines	50-70%	50-70%
FAQ / knowledge base queries	60-80%	60-80%

Cache hit rates improve over time as more responses are stored. The first few hundred requests build the cache; subsequent requests increasingly benefit.

Performance characteristics

Metric	Value
In-process cache hit latency	< 1ms
Distributed cache hit latency	< 5ms
Semantic cache hit latency	< 20ms
Cache miss overhead	< 2ms (negligible)
Maximum cached response age	1 hour (configurable)
Cache isolation	Per-organization (strict tenant isolation)

Cache misses add negligible overhead — the cache check itself is extremely fast. There is no performance penalty for having caching enabled.

What's next

Memory & Context — How persistent memory goes beyond caching
Intelligent Routing — How routing works with cache signals
Cost-Optimized Routing Guide — Advanced cost reduction strategies

Frequently Asked Questions

What is semantic caching?

Semantic caching matches requests by meaning rather than exact text — so "What is the capital of France?" and "Tell me the capital city of France" return the same cached response. Xantly's semantic cache achieves a 62% hit rate for typical production workloads, returning results in under 20ms at zero token cost. It uses cosine similarity with Jaccard safety thresholds to ensure only genuinely equivalent queries are matched.

How does caching reduce costs?

Cache hits consume zero tokens and incur zero cost — they don't count against your rate limits either. For production workloads with repetitive patterns (customer support bots, code assistants, data pipelines), caching delivers 40-80% cost reduction automatically. Cache hit rates improve over time as more responses are stored, and cross-conversation deduplication means your 100th similar question is served from cache, not sent to the LLM.

Can I disable caching?

Yes. You can disable caching per-request by setting xantly.enable_cache: false in the request body, or by sending the header X-Xantly-Memory-Skip-Cache: true. You can also use intelligence_mode: proxy to bypass the entire caching and memory pipeline for maximum raw speed with zero overhead. Response headers (x-xantly-cache-hit, x-xantly-cache-type) always tell you whether a response was served from cache.

What's the difference between exact and semantic cache?

Exact cache matches identical requests (same messages, same parameters) and returns in under 1ms from the in-process layer or under 5ms from the distributed layer — it catches retry loops, page refreshes, and duplicate webhook triggers. Semantic cache matches requests with the same intent but different wording and returns in under 20ms — it catches rephrased questions, FAQ variations, and template-based queries where only small details change.

On this page