Memory & Context

Xantly maintains persistent, per-organization memory that enriches every AI interaction. Conversations build knowledge over time. Context is assembled intelligently — not brute-force. The longer you u

The problem with stateless LLMs

Every LLM call starts from scratch. The model has no memory of previous conversations, no knowledge of your domain, and no understanding of your users' patterns. This forces developers to:

Send full conversation history every request — wasting tokens and money
Re-explain context that the system should already know
Build custom RAG pipelines to provide domain knowledge
Miss optimization opportunities because each request is isolated

Xantly solves this at the infrastructure level. Memory is built-in, automatic, and zero-configuration.

How memory works

Automatic session detection

Xantly detects multi-turn conversations automatically. You don't need to pass a conversation_id — the system infers session continuity from the message history. When the same conversation continues, previous context is available instantly.

If you want explicit control, you can scope memory with:

{
  "model": "auto",
  "messages": [...],
  "xantly": {
    "conversation_id": "support-ticket-12345"
  }
}

Intelligent context assembly

Instead of sending your entire conversation history to the LLM, Xantly selects the most relevant context and injects it. This means:

Token savings: Only high-relevance context is included — not the full history
Better responses: The model sees curated, relevant information instead of noise
Automatic budget management: Context injection respects the model's context window limits

The context assembly engine considers:

Recent conversation turns (what was just discussed)
Semantic relevance (what's related to the current question)
Known facts and entities (domain knowledge from your organization's history)
Task patterns (solutions to similar problems from past conversations)

Knowledge extraction

After every response, the system asynchronously extracts knowledge:

Entities: People, products, organizations, concepts mentioned in conversations
Facts: Relationships and statements that can be recalled later
Patterns: Recurring task types and their solutions

This extraction never blocks your response — it happens in the background after the reply is delivered.

Knowledge persistence

Extracted knowledge persists across sessions. When a future request mentions a known entity or relates to a learned fact, that knowledge is automatically injected into context. This creates a compounding effect:

Day 1:  Customer asks about Product X → LLM answers → system learns about Product X
Day 30: Customer asks about Product X → system already knows about Product X
        → smaller context needed → faster, cheaper, more accurate response

Memory controls

Scoping memory

{
  "xantly": {
    "conversation_id": "session-abc",    // Scope to a specific conversation
    "enable_memory": true                 // Enable/disable memory (default: true)
  }
}

HTTP headers

Header	Description
`X-Xantly-Memory-Enabled`	`true` / `false` — enable or disable memory for this request
`X-Xantly-Memory-Conversation-Id`	Scope memory to a specific conversation
`X-Xantly-Memory-Context-Budget`	Maximum tokens for injected context (default: 2048)

Disabling memory

If you don't want memory for a specific request:

{
  "xantly": { "enable_memory": false }
}

This is useful for one-off queries where conversation context would be irrelevant.

Memory API

In addition to automatic memory enrichment through v1/chat/completions, you can interact with the memory system directly.

Store knowledge explicitly

curl -X POST https://api.xantly.com/v1/memory/store \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Our refund policy allows returns within 30 days of purchase.",
    "metadata": { "category": "policy", "source": "handbook" }
  }'

Search memory

curl -X POST https://api.xantly.com/v1/memory/search \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is our refund policy?",
    "limit": 5
  }'

Check memory health

curl https://api.xantly.com/v1/memory/health \
  -H "Authorization: Bearer $XANTLY_API_KEY"

Returns a health score (0-100) with breakdowns for entity coverage, fact freshness, and retrieval relevance.

How memory improves routing

Memory signals feed back into the routing engine. When the system has high-confidence context for a request, it can:

Route to cheaper models — strong context compensates for model capability, reducing cost
Skip LLM calls entirely — for factual queries with high-confidence cached answers
Reduce context window usage — assembled context is more compact than full history

This creates a virtuous cycle: more memory → better routing → lower cost → more value.

Pattern learning for agentic workflows

For workflows that handle repetitive tasks — customer support, data processing, code generation — the memory system identifies recurring patterns:

Pattern detection: After seeing similar requests multiple times, the system recognizes the pattern
Solution templates: Common resolution paths are learned and stored
Faster resolution: Future similar requests benefit from learned patterns — sometimes without needing an LLM call at all

For example, a customer support workflow that handles 1,000 password reset queries per month will see the system learn the resolution pattern early on. Subsequent queries matching that pattern get faster, cheaper responses.

Privacy and isolation

Strict tenant isolation: Your organization's memory is never shared with or accessible to other organizations
Data stays yours: Memory is scoped to your API key / organization
Opt-out available: Disable memory entirely per-request or per-organization
No cross-contamination: Different conversation_id scopes are isolated from each other

Integration with agent frameworks

Memory works automatically with any framework that uses the OpenAI SDK:

# LangChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://api.xantly.com/v1",
    api_key="your-xantly-key",
    default_headers={"X-Xantly-Memory-Enabled": "true"}
)

# Direct OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="https://api.xantly.com/v1",
    api_key="your-xantly-key"
)

# Memory is enabled by default — no extra configuration needed
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's our refund policy?"}]
)

No special SDKs, no additional dependencies, no migration effort.

What's next

Platform Overview — How all the pieces fit together
Intelligent Routing — How memory makes routing smarter
Multi-Agent Orchestration — Build agent pipelines with built-in memory
Chat Completions API — Full API reference with memory parameters

On this page