Embeddings
Create vector embeddings for text. Embeddings are dense numerical representations of text useful for semantic search, clustering, classification, and retrieval-augmented generation (RAG).
Create vector embeddings for text. Embeddings are dense numerical representations of text useful for semantic search, clustering, classification, and retrieval-augmented generation (RAG).
- POST
/v1/embeddings - Auth:
Authorization: Bearer <token> - Drop-in compatible with the OpenAI Embeddings API — same request/response shape.
Quick start
curl -sS https://api.xantly.com/v1/embeddings \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-3-small",
"input": "The quick brown fox jumps over the lazy dog."
}'Request body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Embedding model slug (e.g. text-embedding-3-small, text-embedding-ada-002). Use GET /v1/models to list available embedding models. |
input | string | array<string> | Yes | Text to embed. Pass a string for a single input, or an array of strings for batch embedding. |
Response body
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023064255, -0.009327292, 0.015797347, "..."]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 9,
"total_tokens": 9
}
}| Field | Type | Description |
|---|---|---|
object | string | Always "list". |
data | array | One embedding object per input string. |
data[].object | string | Always "embedding". |
data[].index | integer | Position in the input array (0-indexed). |
data[].embedding | array<float> | Dense vector representation. Dimensionality depends on the model (e.g. 1536 for text-embedding-3-small). |
model | string | Model that generated the embeddings. |
usage.prompt_tokens | integer | Tokens consumed by the input. |
usage.total_tokens | integer | Same as prompt_tokens for embeddings. |
Code examples
Single input
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["XANTLY_API_KEY"],
base_url="https://api.xantly.com/v1",
)
response = client.embeddings.create(
model="text-embedding-3-small",
input="Semantic search is powerful for RAG applications.",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.XANTLY_API_KEY,
baseURL: "https://api.xantly.com/v1",
});
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: "Semantic search is powerful for RAG applications.",
});
const vector = response.data[0].embedding;
console.log(`Dimensions: ${vector.length}`);Batch input
texts = [
"What is a vector database?",
"How does semantic search work?",
"Explain cosine similarity.",
]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
for item in response.data:
print(f"[{item.index}] {len(item.embedding)}-dim vector")const texts = [
"What is a vector database?",
"How does semantic search work?",
"Explain cosine similarity.",
];
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
response.data.forEach(({ index, embedding }) => {
console.log(`[${index}] ${embedding.length}-dim vector`);
});Automatic caching
Xantly automatically caches embedding responses using an exact-match in-process cache. Identical requests (same model + same input text) return instantly from cache with zero provider cost.
- Cache capacity: 5,000 entries
- TTL: 5 minutes
- Key: BLAKE3 hash of
model:input - Scope: Per-instance (not shared across pods)
This is transparent — no configuration needed. Repeated calls to the same text within 5 minutes are free and sub-millisecond.
Errors
| HTTP | error.type | error.code | Typical trigger |
|---|---|---|---|
400 | invalid_request_error | validation_error | Missing model or input, or empty input. |
401 | authentication_error | invalid_api_key | Missing or invalid Bearer token. |
429 | rate_limit_error | rate_limit_exceeded | Rate limit exceeded — see Rate Limits. |
402 | billing_error | budget_exceeded | Monthly token quota or budget exceeded — see Billing & Quotas. |
500 | internal_error | internal_error | No embedding provider available or provider error. |
Next steps
- Models — List all available models including embedding models
- Rate Limits — Understand limits that apply to embedding calls
- Chat Completions — Main gateway endpoint for LLM inference
Completions (Legacy)
Create text completions using the legacy prompt-based API. This endpoint translates requests into the chat completions pipeline internally, giving you access to the full Xantly routing engine while ma
Models
List and retrieve models available through the Xantly gateway. The response is OpenAI-compatible and works with any tool or SDK that calls GET /v1/models.