Memory & Context
Xantly maintains persistent, per-organization memory that enriches every AI interaction. Conversations build knowledge over time. Context is assembled intelligently — not brute-force. The longer you u
Xantly maintains persistent, per-organization memory that enriches every AI interaction. Conversations build knowledge over time. Context is assembled intelligently — not brute-force. The longer you use Xantly, the smarter your AI workflows become.
The problem with stateless LLMs
Every LLM call starts from scratch. The model has no memory of previous conversations, no knowledge of your domain, and no understanding of your users' patterns. This forces developers to:
- Send full conversation history every request — wasting tokens and money
- Re-explain context that the system should already know
- Build custom RAG pipelines to provide domain knowledge
- Miss optimization opportunities because each request is isolated
Xantly solves this at the infrastructure level. Memory is built-in, automatic, and zero-configuration.
How memory works
Automatic session detection
Xantly detects multi-turn conversations automatically. You don't need to pass a conversation_id — the system infers session continuity from the message history. When the same conversation continues, previous context is available instantly.
If you want explicit control, you can scope memory with:
{
"model": "auto",
"messages": [...],
"xantly": {
"conversation_id": "support-ticket-12345"
}
}Intelligent context assembly
Instead of sending your entire conversation history to the LLM, Xantly selects the most relevant context and injects it. This means:
- Token savings: Only high-relevance context is included — not the full history
- Better responses: The model sees curated, relevant information instead of noise
- Automatic budget management: Context injection respects the model's context window limits
The context assembly engine considers:
- Recent conversation turns (what was just discussed)
- Semantic relevance (what's related to the current question)
- Known facts and entities (domain knowledge from your organization's history)
- Task patterns (solutions to similar problems from past conversations)
Knowledge extraction
After every response, the system asynchronously extracts knowledge:
- Entities: People, products, organizations, concepts mentioned in conversations
- Facts: Relationships and statements that can be recalled later
- Patterns: Recurring task types and their solutions
This extraction never blocks your response — it happens in the background after the reply is delivered.
Knowledge persistence
Extracted knowledge persists across sessions. When a future request mentions a known entity or relates to a learned fact, that knowledge is automatically injected into context. This creates a compounding effect:
Day 1: Customer asks about Product X → LLM answers → system learns about Product X
Day 30: Customer asks about Product X → system already knows about Product X
→ smaller context needed → faster, cheaper, more accurate responseMemory controls
Scoping memory
{
"xantly": {
"conversation_id": "session-abc", // Scope to a specific conversation
"enable_memory": true // Enable/disable memory (default: true)
}
}HTTP headers
| Header | Description |
|---|---|
X-Xantly-Memory-Enabled | true / false — enable or disable memory for this request |
X-Xantly-Memory-Conversation-Id | Scope memory to a specific conversation |
X-Xantly-Memory-Context-Budget | Maximum tokens for injected context (default: 2048) |
Disabling memory
If you don't want memory for a specific request:
{
"xantly": { "enable_memory": false }
}This is useful for one-off queries where conversation context would be irrelevant.
Memory API
In addition to automatic memory enrichment through v1/chat/completions, you can interact with the memory system directly.
Store knowledge explicitly
curl -X POST https://api.xantly.com/v1/memory/store \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"content": "Our refund policy allows returns within 30 days of purchase.",
"metadata": { "category": "policy", "source": "handbook" }
}'Search memory
curl -X POST https://api.xantly.com/v1/memory/search \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What is our refund policy?",
"limit": 5
}'Check memory health
curl https://api.xantly.com/v1/memory/health \
-H "Authorization: Bearer $XANTLY_API_KEY"Returns a health score (0-100) with breakdowns for entity coverage, fact freshness, and retrieval relevance.
How memory improves routing
Memory signals feed back into the routing engine. When the system has high-confidence context for a request, it can:
- Route to cheaper models — strong context compensates for model capability, reducing cost
- Skip LLM calls entirely — for factual queries with high-confidence cached answers
- Reduce context window usage — assembled context is more compact than full history
This creates a virtuous cycle: more memory → better routing → lower cost → more value.
Pattern learning for agentic workflows
For workflows that handle repetitive tasks — customer support, data processing, code generation — the memory system identifies recurring patterns:
- Pattern detection: After seeing similar requests multiple times, the system recognizes the pattern
- Solution templates: Common resolution paths are learned and stored
- Faster resolution: Future similar requests benefit from learned patterns — sometimes without needing an LLM call at all
For example, a customer support workflow that handles 1,000 password reset queries per month will see the system learn the resolution pattern early on. Subsequent queries matching that pattern get faster, cheaper responses.
Privacy and isolation
- Strict tenant isolation: Your organization's memory is never shared with or accessible to other organizations
- Data stays yours: Memory is scoped to your API key / organization
- Opt-out available: Disable memory entirely per-request or per-organization
- No cross-contamination: Different
conversation_idscopes are isolated from each other
Integration with agent frameworks
Memory works automatically with any framework that uses the OpenAI SDK:
# LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.xantly.com/v1",
api_key="your-xantly-key",
default_headers={"X-Xantly-Memory-Enabled": "true"}
)# Direct OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="https://api.xantly.com/v1",
api_key="your-xantly-key"
)
# Memory is enabled by default — no extra configuration needed
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What's our refund policy?"}]
)No special SDKs, no additional dependencies, no migration effort.
What's next
- Platform Overview — How all the pieces fit together
- Intelligent Routing — How memory makes routing smarter
- Multi-Agent Orchestration — Build agent pipelines with built-in memory
- Chat Completions API — Full API reference with memory parameters
Caching & Performance
Xantly's multi-layer caching infrastructure eliminates redundant LLM calls. Identical requests return instantly. Similar requests return from semantic cache. Agentic workflows with repetitive patterns
API Reference
Complete reference for every Xantly API endpoint — chat, completions, embeddings, audio, images, voice, and more.