TL;DR: An embedding is a list of numbers that represents the meaning of a piece of content — a word, sentence, document, or image. Similar content gets similar numbers, so AI can find related things by comparing these number lists. They're the engine behind semantic search, RAG systems, recommendation features, and every other AI feature that involves matching content by meaning rather than exact words.

The Problem Embeddings Solve

Imagine you run a support chat for a construction software company. A customer types: "my project costs keep changing every time I refresh."

Your help center has an article titled "Why do budget figures update automatically?" That article answers exactly this question. But a keyword search won't find it — there's no overlap between "costs keep changing" and "budget figures update automatically."

This is the fundamental problem that keyword search has never solved well. Words are just symbols. The same idea can be expressed a hundred different ways. If your search system can only match the exact words in the query to the exact words in the document, it misses most of what people actually mean.

Embeddings fix this. Instead of comparing words, the system compares meaning. "Costs keep changing" and "budget figures update automatically" end up very close to each other in the embedding space — even though they share no words — because they mean the same thing.

GPS Coordinates for Meaning

Here's an analogy that makes embeddings click for most people.

Think about GPS coordinates. Every location on Earth has a pair of numbers — latitude and longitude. New York City is roughly (40.7, -74.0). London is roughly (51.5, -0.1). The numbers themselves don't look like cities — they're just coordinates. But you can use those coordinates to ask: which cities are close to each other? Paris and London are close. New York and Los Angeles are far apart. Tokyo and Sydney are far from both.

Embeddings do the same thing for meaning. Every piece of text gets a set of coordinates. Not two coordinates like GPS — typically hundreds or even thousands of them, because meaning has more dimensions than geography. But the core idea is identical: things that mean similar things end up with coordinates that are close together. Things with different meanings end up with coordinates that are far apart.

"Dog" and "puppy" end up close together. "Dog" and "spreadsheet" end up far apart. "I can't log in" and "access denied error" end up close together, even though they share no words.

The AI system doesn't need to understand English grammar or logic. It just needs to find the coordinates that are closest to your query. That's fast, straightforward math — and it works across every language and every content type that has an embedding model trained on it.

What Embeddings Actually Look Like

You'll never look at raw embeddings in your daily work — they're not human-readable. But it's worth knowing what they actually are so the concept doesn't feel like magic.

An embedding is an array of floating-point numbers. Here's a tiny slice of what an embedding for the sentence "the dog chased the ball" might look like:

What an Embedding Looks Like (First 10 of 1,536 Numbers)

[0.0023, -0.0187, 0.0412, -0.0091, 0.0734,
 -0.0298, 0.0156, -0.0443, 0.0821, 0.0037, ...]
# ... 1,526 more numbers follow

Each number captures some tiny aspect of the meaning. No single number means anything on its own. The meaning lives in the pattern of all of them together — the same way a photograph isn't one pixel, it's millions of pixels forming a pattern your eye reads as an image.

OpenAI's text-embedding-3-small model produces 1,536-dimensional embeddings. That means every piece of text you embed gets represented as a list of 1,536 numbers. When you compare two embeddings to see how similar they are, you're comparing two lists of 1,536 numbers using a calculation called cosine similarity — which basically measures whether the two lists are pointing in the same direction.

How You Actually Use Embeddings

Using embeddings in practice follows the same pattern regardless of what you're building. There are always two phases: storing embeddings ahead of time, and then searching them at query time.

Phase 1: Embed Your Content (Do This Once)

Take all the content you want to make searchable — documents, articles, product descriptions, support tickets, whatever — and run each one through an embedding model. Store the resulting number arrays alongside the original content. You can store these in a vector database, or even just a regular Postgres table if you're using the pgvector extension.

Embedding Documents Ahead of Time

from openai import OpenAI

client = OpenAI()

# Your documents — could be help articles, product descriptions, anything
documents = [
    "Why do budget figures update automatically?",
    "How to export your project timeline to PDF",
    "Setting up user permissions for your team",
    "Understanding the cost tracking dashboard",
]

# Embed each document — you only do this once (or when content changes)
def embed(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

document_embeddings = [embed(doc) for doc in documents]

# Now store documents + their embeddings in your database
# document_embeddings[0] is a list of 1,536 numbers for the first document

Phase 2: Embed the Query and Find the Closest Match

When a user searches for something, embed their query using the same model. Then find which stored document embeddings are closest to the query embedding. Those are your results.

Searching With Embeddings

import numpy as np

def cosine_similarity(a, b):
    # Measures how similar two embeddings are — returns 0.0 (different) to 1.0 (identical)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# User types their search query
user_query = "my project costs keep changing every time I refresh"

# Embed the query using the same model
query_embedding = embed(user_query)

# Compare query embedding to every document embedding
scores = [
    (documents[i], cosine_similarity(query_embedding, document_embeddings[i]))
    for i in range(len(documents))
]

# Sort by similarity score — highest first
scores.sort(key=lambda x: x[1], reverse=True)

print("Best match:", scores[0][0])
# → "Why do budget figures update automatically?"
# Score: ~0.87  (very similar)
# Even though query and document share zero words!

This is the core loop. Everything else — vector databases, RAG pipelines, recommendation engines — is built on top of this same idea: embed content, store it, embed the query, find the closest stored embeddings.

Where You See Embeddings Every Day

You've been using embedding-powered features your whole life without realizing it. The technology just didn't have this name in most consumer products until recently.

Semantic Search

When you search Google for "car won't start" and get results about dead batteries, bad alternators, and starter motors — even though none of those pages say "car won't start" — that's semantic understanding powered by embeddings. The search engine isn't matching your words. It's matching your meaning.

Recommendation Engines

When Netflix suggests "you might also like..." or Spotify generates Discover Weekly, embeddings are doing the work. Your watch history gets embedded. Each show gets embedded. The system finds shows whose embeddings are close to your history's embedding. "You watched three crime dramas set in Scandinavia" gets translated into a point in embedding space, and the system finds other points (shows) nearby.

RAG — Giving AI Access to Your Documents

RAG (Retrieval-Augmented Generation) is the pattern behind every "chat with your documents" or "AI that knows your knowledge base" feature. You embed all your documents and store them. When a user asks a question, you embed the question, find the most relevant document chunks, and then send those chunks plus the original question to an LLM. The LLM answers using your actual content instead of its training data. Embeddings are what makes the "retrieval" part of RAG work. You can read the full breakdown in the vector database article, which covers how those retrieved embeddings get stored and queried at scale.

Duplicate Detection

Support platforms use embeddings to detect when two tickets describe the same problem. Two tickets saying "the export button doesn't work" and "I click export and nothing happens" will have very similar embeddings, so the system can automatically group or deduplicate them — even though a keyword search would see them as unrelated.

Embedding Models: Your Options

You don't train your own embedding model — you use one that's already been trained on massive amounts of text. Here are the main options for someone building their first AI-powered feature.

OpenAI text-embedding-3-small

The easiest starting point. You call OpenAI's API, pass your text, and get back a list of 1,536 numbers. It's fast, cheap (under $0.03 per million tokens), and produces high-quality embeddings for English text. If you're already using the OpenAI API for other features, this is the zero-friction choice. The text-embedding-3-large variant produces better embeddings for complex tasks at roughly 10x the cost — worth it for specialized search, overkill for most projects.

Sentence Transformers (Free, Local)

If you don't want to pay per API call, or you need to run embeddings locally without sending data to a third party, Hugging Face's sentence-transformers library is the standard choice. The all-MiniLM-L6-v2 model is fast and surprisingly good for general-purpose English search. You download it once, run it on your own machine or server, and it generates embeddings for free forever.

Using sentence-transformers Locally

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The dog chased the ball across the yard.",
    "A puppy ran after a toy in the garden.",
    "Quarterly revenue increased by 14 percent."
]

embeddings = model.encode(sentences)

# embeddings[0] and embeddings[1] will be very similar (dogs + running + playing)
# embeddings[2] will be far from both (completely different topic)
print(embeddings[0].shape)  # (384,) — 384-dimensional vectors

Voyage AI and Cohere

These providers offer embedding APIs that specialize in particular domains. Voyage AI's models are particularly strong for code search (finding relevant code by natural language query). Cohere's Embed v3 has excellent multilingual support. If you're building for a specific domain or non-English content, they're worth evaluating against OpenAI's defaults.

Anthropic (No Embedding API — Yet)

Worth noting: Anthropic (the company behind Claude) does not currently offer a public embedding API. Claude is exceptional at understanding and reasoning over text, but if you need embeddings in your pipeline, you'll use OpenAI, Cohere, Voyage, or a local model for that specific step.

Embeddings and Vector Databases: How They Fit Together

Once you have more than a few thousand documents, searching through embeddings with a simple loop like the example above gets slow. A vector database solves this by indexing embeddings so similarity search runs in milliseconds even across millions of documents.

Think of it like this: a regular database is great at finding exact matches ("find all users where city = 'Austin'"). A vector database is great at finding approximate matches ("find the 10 documents whose meaning is closest to this query"). They're different tools built for different problems.

Popular vector database options:

  • Pinecone — hosted, no infrastructure to manage, free tier for getting started
  • Weaviate — open source, can run locally or hosted, good developer experience
  • pgvector — a Postgres extension that adds vector search to a regular Postgres database — great if you're already using Postgres and don't want another service to manage
  • Chroma — lightweight, designed for local development and prototyping

The vector database deep-dive article covers the tradeoffs between these options in detail. The short version: start with pgvector if you have Postgres, or Chroma if you're prototyping locally. Graduate to Pinecone when you need scale without infrastructure headaches.

Embeddings in a RAG System: The Full Picture

RAG is the most common reason builders first need to understand embeddings. Here's how the whole system fits together so you can see where embeddings live.

RAG System: Step by Step

Step 1 — Index your documents (done once, or when docs change)
  └─ Split documents into chunks (paragraphs, sections)
  └─ Embed each chunk → get list of numbers
  └─ Store chunks + embeddings in vector database

Step 2 — User asks a question
  └─ Embed the question → get list of numbers
  └─ Search vector database for the closest chunk embeddings
  └─ Retrieve the top 3–5 most relevant chunks

Step 3 — Generate an answer
  └─ Send retrieved chunks + original question to Claude or GPT-4
  └─ LLM answers using your actual document content, not just training data
  └─ Return answer to user, with source links if desired

Embeddings power Step 1 and Step 2. The LLM only comes in at Step 3. This is why understanding embeddings matters even if you're primarily working with Claude or GPT-4 — you need embeddings to build the retrieval layer that makes those models genuinely useful over your own content.

To understand how tokens and context interact with the chunks you retrieve in Step 3, see the article on AI tokens and context limits and the explainer on context windows.

What AI Gets Wrong About Embeddings

If you ask an AI assistant to explain embeddings, it'll probably explain them correctly — but then give you advice that looks authoritative and is actually problematic for real projects. Here's what to watch out for.

"Just embed your whole document at once"

AI tools often suggest treating each document as a single unit to embed. In practice, this produces mediocre search results for anything longer than a few paragraphs. A 20-page technical manual embedded as a single vector will produce one averaged-out set of coordinates that doesn't represent any specific topic well. The right approach is to split documents into smaller chunks (often 200–500 words with some overlap) and embed each chunk separately. This way, a question about page 12 can actually surface the content from page 12, rather than a blurry representation of the whole document.

"Cosine similarity is all you need"

Cosine similarity is the standard way to compare embeddings, and it works well. But it only measures whether two vectors point in the same direction — it's blind to absolute scale. For most search tasks this doesn't matter. But if you're building a system where you need to detect exact duplicates, or where you care about the difference between "very similar" and "somewhat similar," you'll want to understand when cosine similarity can mislead you. Dot product similarity is sometimes more appropriate for dense retrieval use cases.

"Your embedding model doesn't matter much"

AI tools will often default to whatever embedding model is easiest to demo, then suggest you can swap models freely later. You can't — not without re-embedding everything. Different embedding models produce incompatible vector spaces. If you embed documents with OpenAI's model, you must also embed queries with OpenAI's model. Switching models means re-processing every document you've ever indexed. Choose your embedding model deliberately early, and factor that into your architecture.

"One chunk size fits all"

The right chunk size depends entirely on your use case. Chunking by 200 words works well for factual Q&A (you want precise snippets). Chunking by 1,000 words works better for topic summaries (you want broader context). AI tools rarely mention this tradeoff, and getting it wrong is one of the most common reasons a RAG system returns good-looking results that actually miss what the user needed.

"Semantic search replaces keyword search"

Semantic search handles meaning-based queries better than keyword search. But keyword search handles exact-match queries — product SKUs, names, error codes, URLs — better than semantic search. The best production search systems use both: semantic embeddings for meaning-based queries, keyword search for exact matches, then combine the scores. This is called hybrid search. Building a search system? Plan for both from the start.

Frequently Asked Questions

What are embeddings in AI?

Embeddings are lists of numbers that represent the meaning of a piece of content — text, image, audio, or anything else with a trained model. Similar content gets similar numbers. The AI then finds related items by comparing these number lists — like finding nearby GPS coordinates on a map, but for meaning instead of geography.

Why do embeddings matter if I'm building apps?

Almost every AI feature involving search, matching, or recommendations uses embeddings under the hood. Semantic search, RAG systems (giving an AI access to your documents), "find similar items" features, and duplicate detection all depend on them. If you're building anything beyond a basic chatbot, you will need to understand this concept.

What is the difference between an embedding and a token?

Tokens are how an AI breaks text into chunks — roughly word-pieces. Embeddings are how the AI represents the meaning of text as a list of numbers. You can embed individual tokens, but you can also embed whole sentences or entire documents into a single vector. Tokens are about how text gets split; embeddings are about what that text means.

What is a vector and how does it relate to embeddings?

A vector is just a list of numbers. An embedding is a vector where the numbers represent the meaning of some content. The term "vector" gets used in "vector database" and "vector search" precisely because embeddings are the vectors being stored. Don't let the word intimidate you — a vector is a list of numbers, and an embedding is a list of numbers that encodes meaning.

How does semantic search work with embeddings?

You embed both the search query and all your stored content using the same model. Then you measure which stored embeddings are closest to the query embedding — using cosine similarity, which calculates how closely aligned two number lists are. The closest stored embeddings are the most semantically similar results, returned to the user. This is why "car won't start" can surface articles about dead batteries — the meanings are close even though the words are different.

What embedding model should I use?

For most projects starting out, OpenAI's text-embedding-3-small is the pragmatic choice — cheap, fast, and high quality for English text. For free local embeddings, all-MiniLM-L6-v2 from sentence-transformers is the community default. The most important rule: whatever model you embed your documents with, you must use that same model to embed search queries. Mixing models breaks everything.

Do I need a vector database to use embeddings?

Not for small projects. Under a few thousand documents, you can store embeddings in a regular database or even a Python list and search them with a loop. Once you reach tens of thousands of documents, a vector database — Pinecone, Weaviate, or pgvector in Postgres — handles similarity search efficiently. Start simple, upgrade when the need is real.

Can embeddings work with content other than text?

Yes. Specialized models exist for images, audio, video, and code. Multimodal models like CLIP can embed text and images into the same space — enabling text-to-image search, where "dog playing in snow" finds matching images even without keyword tags. The concept is identical regardless of content type: convert content to numbers that capture meaning, then compare numbers to find similar content.

What to Learn Next

Embeddings are the foundation. These articles cover what you build on top of them.