Glossary

Key terms and concepts used throughout the Xantly platform and documentation.

AI Gateway

AI Gateway — A unified API layer that sits between applications and multiple LLM providers, providing routing, caching, failover, and observability. Xantly is an AI gateway that routes to 10,000+ models across 15 providers.

Intelligent Routing

Intelligent Routing — The process of analyzing each incoming request across 15 parameters (task type, context length, cost constraints, latency targets) and selecting the optimal model automatically. Xantly routes with 12ms median overhead.

Semantic Caching

Semantic Caching — A caching layer that matches requests by meaning rather than exact string match, using embedding similarity to serve cached responses for paraphrased or equivalent queries. Xantly achieves a 62% semantic cache hit rate with sub-5ms response times.

Waterfall Fallback

Waterfall Fallback — An automatic failover mechanism that tries the next provider when the primary one fails, rotating through providers and API keys based on error classification (rate limit, billing exhausted, context window exceeded). Ensures 99.999% routing reliability.

BYOK (Bring Your Own Key)

BYOK (Bring Your Own Key) — A mode where developers supply their own provider API keys (OpenAI, Anthropic, Groq, NVIDIA) through Xantly, getting direct billing from the provider while still benefiting from Xantly routing, caching, and observability.

FastLane / Smart Lane

FastLane / Smart Lane — The low-latency voice processing path in Xantly's two-lane hybrid architecture. FastLane handles simple, latency-critical voice requests with sub-300ms response times, while Smart Lane (DelegationLane) handles complex multi-step voice chains.

Mission Control

Mission Control — Xantly's built-in observability dashboard for production LLM workloads. Provides request tracing, cost analytics, provider health monitoring, routing decision visualization, cache performance metrics, and anomaly detection.

BaRP (Bayesian adaptive Routing and Preference)

BaRP (Bayesian adaptive Routing and Preference) — Xantly's continuous learning system that uses Bayesian methods to improve routing decisions over time. BaRP collects feedback from every request outcome (latency, quality, cost) and adjusts model selection probability distributions.

Preference Dial

Preference Dial — A continuous parameter (0.0 to 1.0) that controls the tradeoff between speed/cost and quality when Xantly selects a model. Lower values favor faster and cheaper models; higher values favor higher-quality models. Voice defaults to 0.3 for sub-300ms latency.

Intelligence Modes

Intelligence Modes — Three pipeline stages that control how much processing Xantly applies per request: Proxy (raw passthrough, ~0ms overhead), Cache (adds semantic caching for 40-60% cost savings), and Full (adds memory and personalization).

Provider Ecosystem

Provider Ecosystem — The set of LLM providers Xantly routes to, currently 15 providers including OpenAI, Anthropic, Google Gemini, Groq, NVIDIA, DeepSeek, and OpenRouter. Each provider supports multiple models, totaling 10,000+ available models.

Task Classification

Task Classification — Automatic analysis of each incoming request to determine its type (coding, creative writing, reasoning, summarization, conversation) and characteristics, used by the routing engine to select the most suitable model.

Context Window Escalation

Context Window Escalation — Automatic fallback behavior where Xantly retries a request with a model that has a larger context window when the original model returns a context-too-long error. Handled transparently by the waterfall fallback system.

Routing Tiers (Speed, Value, Quality)

Routing Tiers — Predefined routing profiles that set the preference dial and model selection constraints. Speed tier prioritizes low-latency models, Value tier balances cost and quality, and Quality tier selects the highest-capability models regardless of cost.

Token Quota

Token Quota — The maximum number of input and output tokens an organization can consume per billing period. Quotas vary by plan (Free, Pro, Enterprise) and are enforced per-request with clear error responses when exceeded.

Rate Limiting

Rate Limiting — Per-organization limits on requests per minute (RPM) and tokens per minute (TPM) that protect both the platform and upstream providers from overload. Rate limit status is communicated through standard response headers.

Multi-Agent Orchestration

Multi-Agent Orchestration — Building AI pipelines where multiple agents collaborate on complex tasks, using tool calls, shared persistent memory, automatic handoffs, and workflow type hints. Xantly supports chain-of-thought agent patterns across 10,000+ models.

Semantic Cache Hit Rate

Semantic Cache Hit Rate — The percentage of requests that match a previously cached response through semantic similarity (meaning-based matching rather than exact string matching). Xantly achieves a 62% semantic cache hit rate across typical production workloads.

Exact Cache Hit

Exact Cache Hit — A cache match where the incoming request is identical (byte-for-byte) to a previously cached request. Exact matches are served in sub-5ms with zero provider cost. Exact hits are checked before semantic matching.

Cross-Conversation Deduplication

Cross-Conversation Deduplication — A caching optimization where semantically identical requests from different users or conversations share the same cached response, avoiding redundant provider calls across an entire organization.

On this page