XantlyANTLY
Architecture

Caching & Performance

Xantly's multi-layer caching infrastructure eliminates redundant LLM calls. Identical requests return instantly. Similar requests return from semantic cache. Agentic workflows with repetitive patterns

Xantly's multi-layer caching infrastructure eliminates redundant LLM calls. Identical requests return instantly. Similar requests return from semantic cache. Agentic workflows with repetitive patterns see 40-70% cost reduction — automatically.


Why caching matters for AI

LLM calls are expensive and slow. A single GPT-4o request costs $5-15 per million tokens and takes 1-3 seconds. In production agentic workflows, the same questions get asked repeatedly — customer support bots answering FAQ variations, code assistants generating similar patterns, data pipelines running identical extractions.

Without caching, you pay full price every time. With Xantly's caching, repeat and near-repeat requests return in under 5ms at zero token cost.


Two types of cache hits

Exact match

When the same request (same messages, same parameters) is sent again, the cached response is returned immediately.

  • Latency: Under 5ms
  • Cost: Zero tokens consumed
  • Response header: x-xantly-cache-type: exact

This catches identical requests from retry loops, page refreshes, and duplicate webhook triggers.

Semantic match

When a similar request is sent — same intent, slightly different wording — the gateway detects the similarity and returns the cached response.

  • Latency: Under 20ms
  • Cost: Zero tokens consumed
  • Response header: x-xantly-cache-type: semantic

This catches rephrased questions, agentic loops with minor prompt variations, and template-based queries where only small details change.

# These would be semantic cache hits for each other:
"What is the capital of France?"
"Tell me the capital city of France"
"Capital of France?"

Multi-layer cache architecture

The caching infrastructure uses multiple layers optimized for different access patterns. Each layer is checked in order — the fastest layers are checked first.

LayerLatencyWhat it catches
In-process cache< 1msHot repeated requests within the same gateway instance
Distributed cache< 5msExact matches across all gateway instances
Semantic cache< 20msNear-identical requests with different wording

If all cache layers miss, the request proceeds to the routing engine and LLM provider. After a successful response, the result is stored in all applicable cache layers for future hits.


Cross-conversation deduplication

In agentic workflows, the same question often appears across different conversation sessions. For example:

  • A customer support agent handles 100 similar password reset requests per day
  • A code assistant generates the same boilerplate across different projects
  • A data pipeline extracts the same fields from similar documents

Xantly automatically detects these cross-conversation patterns. Even without explicit conversation_id linking, the system normalizes request intents and deduplicates across sessions. This means your 100th password-reset question is served from cache — not sent to the LLM.


Cache controls

Enabling/disabling cache

Cache is enabled by default. To disable for a specific request:

{
  "model": "auto",
  "messages": [...],
  "xantly": {
    "enable_cache": false
  }
}

Or use the header: X-Xantly-Memory-Skip-Cache: true

Checking cache status

Every response includes cache headers:

HeaderValuesDescription
x-xantly-cache-hittrue / falseWhether this response was served from cache
x-xantly-cache-typeexact / semanticType of cache match (only present on hits)
x-xantly-semantic-similarity0.0 - 1.0Similarity score for semantic matches

What's not cached

Some request types are never cached to ensure correctness:

  • Streaming responses (stream: true)
  • Requests with tool definitions (function calling)
  • Requests with structured output schemas (response_format.type: "json_schema")
  • Requests with explicit cache-skip headers

Cost impact

Cached responses consume zero tokens and incur zero cost. They don't count against your rate limits either.

For typical production workloads:

Workload typeTypical cache hit rateCost reduction
Customer support bots40-60%40-60%
Code generation assistants20-35%20-35%
Data extraction pipelines50-70%50-70%
FAQ / knowledge base queries60-80%60-80%

Cache hit rates improve over time as more responses are stored. The first few hundred requests build the cache; subsequent requests increasingly benefit.


Performance characteristics

MetricValue
In-process cache hit latency< 1ms
Distributed cache hit latency< 5ms
Semantic cache hit latency< 20ms
Cache miss overhead< 2ms (negligible)
Maximum cached response age1 hour (configurable)
Cache isolationPer-organization (strict tenant isolation)

Cache misses add negligible overhead — the cache check itself is extremely fast. There is no performance penalty for having caching enabled.


What's next


Frequently Asked Questions

What is semantic caching?

Semantic caching matches requests by meaning rather than exact text — so "What is the capital of France?" and "Tell me the capital city of France" return the same cached response. Xantly's semantic cache achieves a 62% hit rate for typical production workloads, returning results in under 20ms at zero token cost. It uses cosine similarity with Jaccard safety thresholds to ensure only genuinely equivalent queries are matched.

How does caching reduce costs?

Cache hits consume zero tokens and incur zero cost — they don't count against your rate limits either. For production workloads with repetitive patterns (customer support bots, code assistants, data pipelines), caching delivers 40-80% cost reduction automatically. Cache hit rates improve over time as more responses are stored, and cross-conversation deduplication means your 100th similar question is served from cache, not sent to the LLM.

Can I disable caching?

Yes. You can disable caching per-request by setting xantly.enable_cache: false in the request body, or by sending the header X-Xantly-Memory-Skip-Cache: true. You can also use intelligence_mode: proxy to bypass the entire caching and memory pipeline for maximum raw speed with zero overhead. Response headers (x-xantly-cache-hit, x-xantly-cache-type) always tell you whether a response was served from cache.

What's the difference between exact and semantic cache?

Exact cache matches identical requests (same messages, same parameters) and returns in under 1ms from the in-process layer or under 5ms from the distributed layer — it catches retry loops, page refreshes, and duplicate webhook triggers. Semantic cache matches requests with the same intent but different wording and returns in under 20ms — it catches rephrased questions, FAQ variations, and template-based queries where only small details change.

On this page