TL;DR: RAG (Retrieval Augmented Generation) is the technique that lets AI apps look up information from your own documents before answering questions. Instead of the AI guessing from its training data, it searches your knowledge base, grabs the relevant passages, and uses those to generate an accurate answer. It's the #1 architecture pattern for building AI apps that know about your specific data.
The Problem RAG Solves
Imagine you hire a brilliant new foreman. Incredibly experienced — knows construction codes, materials science, scheduling theory. But he's never set foot on your specific job site. You ask him: "What's the rebar spec for the east foundation pour we planned last Tuesday?"
He might give you a confident answer. It might even sound right. But it won't be your spec — it'll be his best guess based on similar projects he's seen before. That's a problem when someone pours concrete based on that answer.
That's exactly how AI models work out of the box. They're trained on enormous amounts of text from the internet, books, and code repositories — but that training ended at a cutoff date, and it definitely didn't include your documents, your codebase, or your company's internal knowledge. When you ask a plain AI model about your specific project, it has two choices: admit it doesn't know, or make something up that sounds plausible.
Models frequently choose option two. That's what AI people call hallucination — a confident, fluent answer that isn't true.
RAG is the solution the industry landed on. Instead of hoping the model happens to know your information, you build a system that looks it up first — exactly like a foreman who walks over to the filing cabinet, pulls the actual spec sheet, and reads the answer directly to you. The AI still writes the response, but it's working from your real documents, not from memory.
What RAG Actually Stands For
RAG stands for Retrieval Augmented Generation. Break it apart:
- Retrieval — searching your documents to find the relevant piece of information
- Augmented — adding that retrieved information to what you send the AI
- Generation — the AI writes a response using the retrieved information as its source
The term was coined in a 2020 research paper from Facebook AI, but the concept is simple enough that you've probably built a primitive version of it without realizing — any time you copied text from a document into a Claude chat and asked a question about it, you were doing manual RAG. The system just automates that copy-paste at scale.
How RAG Works: The Three Steps
Every RAG system, no matter how fancy, does three things in sequence. Think of it as the construction equivalent of: find the right blueprint, get it in front of the worker, let them do their job.
Step 1: Retrieve — Find the Right Pages
When a user asks a question, the system searches your document library to find the most relevant pieces. This isn't keyword search — it's semantic search, which means it finds documents that mean the same thing as the question, even if they use different words.
Your policy document might say "termination of employment." A user might ask "what happens if I get fired?" A keyword search finds nothing. Semantic search finds the right section because it understands the meaning is the same.
To make semantic search work, your documents need to be stored as embeddings — numerical representations of meaning — in a vector database. Think of embeddings like GPS coordinates for meaning: related concepts end up numerically close to each other, so a search finds neighbors, not just exact matches.
Step 2: Augment — Put the Right Pages In Front of the AI
Once the system retrieves the relevant document chunks, it stuffs them into the prompt it sends to the AI model. The prompt now looks something like:
What the AI Actually Receives
Use the following context to answer the question.
If the answer isn't in the context, say you don't know.
CONTEXT:
[Retrieved chunk 1: paragraph from your employee handbook]
[Retrieved chunk 2: relevant section from HR policy doc]
[Retrieved chunk 3: related FAQ entry]
QUESTION: What happens to my PTO balance if I leave the company?
Answer based only on the context provided.
The AI now has the actual source material in front of it. It's not guessing from training data — it's reading your documents and writing a response. This is the "augmented" part: the context window gets augmented with retrieved information. If you want to understand context windows better, see our guide on what context windows are and how they work.
Step 3: Generate — The AI Writes the Answer
With the retrieved context in place, the AI generates a response. Because it has the actual source material, the answer reflects your documents rather than the model's training data. Good RAG systems also include citations — they tell the user which document and section the answer came from, so the user can verify it.
That's the whole loop. User asks question → system retrieves relevant chunks → chunks go into the prompt → AI answers using the chunks → user gets an accurate, grounded response.
The Full RAG Pipeline: Before the App Runs
There's a step that happens before any user asks any question: you have to build the knowledge base in the first place. This is called indexing, and it's the part most tutorials gloss over. Here's what actually happens:
1. Load Your Documents
You collect all the source material your AI should know about — PDFs, Word docs, Notion pages, web pages, Markdown files, database records. Tools like LangChain and LlamaIndex have hundreds of pre-built loaders for different file types so you don't have to parse them yourself.
2. Chunk the Documents
You can't store a 300-page manual as one blob and retrieve it as one result. You break each document into smaller pieces called chunks. A chunk might be a paragraph, a page, or a fixed number of characters (typically 500–1,500 characters is a common starting range).
Chunking is genuinely one of the most important decisions in a RAG system. Chunks too small: the retrieved piece lacks enough context to be useful. Chunks too large: you hit context window limits and waste space on irrelevant content. Getting chunking right is where experienced RAG builders spend a lot of their time.
Think of it like a filing system. A filing cabinet full of complete 300-page manuals isn't useful when you need one spec. A cabinet organized by section, sub-section, and topic — that's what good chunking looks like.
3. Embed the Chunks
Each chunk gets converted into a vector — a list of numbers representing the chunk's meaning — using an embedding model. OpenAI's text-embedding-3-small and Cohere's embedding models are common choices. The embedding model is separate from the chat model; its only job is turning text into numbers.
Two chunks about similar topics will have numerically similar vectors. That's how the retrieval step finds related content — it converts the user's question into a vector too, then finds the chunks whose vectors are closest.
4. Store in a Vector Database
The chunks (and their vectors) get stored in a vector database, which is built for fast nearest-neighbor search. When you query it with a vector, it returns the most similar stored vectors in milliseconds — even across millions of chunks.
Popular options include Pinecone (managed, easy to start), Weaviate (open source, feature-rich), Chroma (local, great for development), and Supabase with the pgvector extension (if you're already using Postgres). See our full breakdown of vector databases to understand which to pick.
What RAG Looks Like in Code
Here's a stripped-down RAG implementation so you can see what the pieces actually look like. This uses LangChain with a local Chroma vector store — no paid services required for the prototype.
Part 1: Indexing Your Documents (Runs Once)
# pip install langchain langchain-openai chromadb
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Load all PDFs and text files from a folder
loader = DirectoryLoader("./company-docs/", glob="**/*.pdf")
documents = loader.load()
# Split into chunks (500 chars each, 50 char overlap)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
# Convert chunks to vectors and store them
embedding_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="./my-knowledge-base"
)
vectorstore.persist()
print(f"Indexed {len(chunks)} chunks from {len(documents)} documents")
Part 2: Answering Questions (Runs on Every Query)
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
# Load the stored knowledge base
vectorstore = Chroma(
persist_directory="./my-knowledge-base",
embedding_function=OpenAIEmbeddings()
)
# Set up the retriever (returns top 4 most relevant chunks)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Connect retriever + LLM into a QA chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True # include citations
)
# Answer a question
result = qa_chain.invoke({"query": "What is our policy on overtime pay?"})
print(result["result"]) # the answer
print(result["source_documents"]) # the chunks it used
That's a working RAG system. Not production-ready — you'd add error handling, a nicer interface, and better chunking — but it demonstrates every part of the pipeline. Your documents go in, questions come out, and the answers are grounded in your actual content.
RAG vs. Fine-Tuning: Which One Do You Need?
This is the most common question when people first learn about RAG. Here's the straight answer:
Use RAG When:
✓ Your data changes frequently (new docs, updated policies, live data)
✓ You need answers that are accurate and verifiable with citations
✓ Your knowledge base is large (thousands of documents)
✓ You want to ship quickly without expensive retraining
✓ You need to know WHERE the answer came from
✓ You're building a Q&A tool, chatbot, or knowledge assistant
Use Fine-Tuning When:
✓ You want the model to adopt a specific writing style or tone
✓ You're teaching the model a specialized skill (e.g., write in our brand voice)
✓ Your knowledge is static and won't change for months
✓ You need the model to follow a specific output format reliably
✓ Speed matters and you want to reduce prompt length
In practice: most production AI apps use RAG for factual accuracy and skip fine-tuning entirely. Fine-tuning is expensive, slow to iterate on, and requires significant labeled data. RAG is live the moment you add a document to the knowledge base.
The "fine-tune the model on your company data" pitch sounds good but rarely makes sense. Imagine hiring a worker, sending them to school to memorize your current project specs, then having to send them back every time anything changes. RAG is the filing cabinet that's always up to date — the worker just looks it up.
RAG vs. Just Using a Long Context Window
Models like Claude and Gemini now have massive context windows — up to a million tokens. So why not just dump all your documents into one giant prompt and skip RAG entirely?
For small document sets, that actually works. If you have 20 Markdown files totaling 50 pages, pasting them all into the context might be completely reasonable. But it breaks down fast:
- Cost: Every API call charges for every token in the context. Sending 500 pages every time someone asks "what's the office Wi-Fi password?" is wasteful.
- Scale: Most real knowledge bases are enormous — thousands of documents, millions of words. No context window holds that.
- Accuracy: Research shows that LLMs perform worse when the relevant information is buried deep in a massive context. Retrieval surfaces the right chunks and focuses the model's attention.
- Speed: Processing a huge context takes longer and costs more than processing a small retrieved set.
RAG and long contexts aren't competing — they work together. You use the context window to hold the retrieved chunks plus the conversation history. RAG decides which chunks are worth putting there.
Real RAG Use Cases You've Already Seen
RAG is everywhere. You've been using RAG-powered products without knowing the term.
Customer Support Chatbots
Every "chat with our help center" widget on a SaaS product is RAG. The chatbot retrieves relevant help articles based on your question, then generates a conversational response using those articles as context. Without RAG, it would either give generic answers or hallucinate product features that don't exist.
AI Coding Tools
Cursor, GitHub Copilot, and similar tools use RAG constantly. When you're editing a file and the tool references your other files, your component library, or your test suite — that's retrieval. The tool searches your codebase for relevant context, augments the prompt, and generates completions that fit your actual project structure rather than generic patterns.
"Chat With Your PDF" Tools
Every tool that lets you upload a document and ask questions about it — Notion AI, Adobe Acrobat AI, Claude's document upload, custom PDF chatbots — is RAG. The document gets chunked and embedded on upload. Your questions retrieve the relevant chunks. The model answers from those chunks.
Internal Knowledge Bases
Companies build RAG systems on top of their internal wikis, Confluence spaces, Slack archives, and Notion databases. Employees ask questions in natural language and get answers sourced directly from internal documentation — no more digging through outdated wikis to find the right page.
Legal and Compliance Tools
Legal teams use RAG to search case law, contracts, and regulatory filings. The model doesn't need to memorize all of securities law — it retrieves the relevant statute sections and generates a grounded response. The citation trail matters here too: lawyers need to show exactly what text the answer came from.
What AI Gets Wrong About RAG
If you ask an AI coding assistant to "build me a RAG system," you'll get working code quickly. But there are common mistakes the generated code will make that cost you later. Here's what to watch for:
Chunk Size Set to an Arbitrary Default
Most starter code uses a fixed chunk size like 1,000 characters. That might be fine, might be terrible — it depends entirely on your documents. Dense technical specifications need smaller chunks than narrative prose. Legal contracts need chunks that preserve clause boundaries, not arbitrary character counts. Default chunk sizes are a guess; your chunk size should be a decision.
No Overlap Between Chunks
When you split a document at a boundary, the sentence context that spans the split is lost. A chunk that ends mid-sentence gives the retriever an incomplete piece. Good chunking uses overlap — the end of one chunk repeats at the start of the next — so retrieved pieces are always coherent. AI-generated RAG code often skips this or uses an overlap so small it doesn't help.
Retrieval Without Re-Ranking
Basic retrieval returns the top-k chunks by vector similarity. But vector similarity isn't the same as "most useful for answering the question." A re-ranking step — using a smaller, faster model to score retrieved chunks for actual relevance — dramatically improves answer quality. Most starter RAG code doesn't include this step.
No Source Attribution
If the AI gives a wrong answer and you can't see what source it used, you have no way to know whether the problem is in your documents or in the model's generation. Always build in source citations. It's not just good UX — it's how you debug a RAG system.
Treating RAG as a One-Time Setup
Documents go stale. Policies change. Products get updated. AI-generated RAG setups often lack any mechanism for keeping the knowledge base current. Production RAG systems need a pipeline for detecting when source documents change and re-indexing the affected chunks. Build this from the start, not as an afterthought.
Ignoring Query Transform
Users don't always phrase questions in a way that retrieves the best chunks. "What's the deal with the PTO thing?" might not retrieve the same chunks as "What is the company policy on paid time off?" Advanced RAG systems run a query transformation step — using the LLM to rephrase or expand the question — before hitting the vector database. Generated starter code almost never includes this.
Advanced RAG Patterns (When You're Ready)
Once you have a basic RAG system working, there's a whole ecosystem of techniques for improving it. You don't need these on day one, but knowing they exist helps you understand why production RAG systems look different from tutorials.
Hybrid Search
Combines vector (semantic) search with traditional keyword search. You get the synonym-matching power of semantics plus the precision of exact keyword matching. Weaviate and Pinecone both support hybrid search natively. This is often the first upgrade teams make after their initial RAG build.
HyDE (Hypothetical Document Embeddings)
Instead of embedding the user's question and searching for similar chunks, HyDE first asks the LLM to write a hypothetical answer, then embeds that hypothetical answer and uses it to search. Hypothetical answers are more similar in style to real document chunks than questions are — so the retrieval is more accurate.
Multi-Query Retrieval
The LLM generates multiple versions of the user's question — different phrasings, angles, and specificity levels — then runs retrieval on all of them. The results get merged and deduplicated. Catches more relevant chunks than a single-query approach.
Agentic RAG
Instead of a fixed retrieve-then-generate pipeline, an agent decides whether to retrieve, what to search for, and whether to retrieve again after seeing the first results. If you're building something where the AI needs to look up multiple facts and combine them, agentic RAG produces better answers than the static pipeline. Tools like DSPy and frameworks built around MCP servers make agentic retrieval more manageable.
Metadata Filtering
Store metadata alongside each chunk — document date, author, department, product version — and filter at retrieval time. "Only retrieve from documents published after 2024" or "only search the engineering team's docs." This dramatically improves precision for large, heterogeneous knowledge bases.
RAG and Prompting
A RAG system is only as good as the prompt that wraps the retrieved context. The retrieval step can surface perfect chunks and the generation step can still produce garbage if the prompt instructions are vague, contradictory, or missing constraints.
The most important prompt instruction in any RAG system: tell the model what to do when the context doesn't contain the answer. "If the answer isn't in the provided context, say you don't know and suggest who the user might contact" is far better than letting the model fall back on its training data.
For a deeper grounding in how to write instructions that actually work, see the AI Prompting Guide for Coders — the principles apply directly to the system prompt you write for your RAG app.
Tools for Building RAG Apps
The RAG ecosystem is mature and well-tooled. You don't need to build any of this from scratch.
Orchestration Frameworks
LangChain is the most widely used — it has loaders, splitters, retrievers, and chain types for every common RAG pattern. The documentation is extensive and there are thousands of community examples. LlamaIndex is the main alternative; it's more focused on data ingestion and indexing, with a cleaner API for document processing. Both are Python-first but have JavaScript versions.
Vector Databases
Chroma runs locally in Python — no account, no cost, great for development. Pinecone is managed, scales well, and has a generous free tier. Weaviate is open source and adds hybrid search and multi-modal support. Supabase with pgvector works if you're already using Postgres and don't want another service. See the full comparison in the vector database guide.
Embedding Models
OpenAI's text-embedding-3-small is cheap, fast, and works well for general English text. Cohere's Embed models support multi-lingual embedding and re-ranking. Sentence Transformers are open-source models you can run locally for zero API cost — good for prototyping on sensitive data.
All-in-One Options
If you want to skip the infrastructure entirely, services like Vertex AI Search (Google), Azure AI Search, and Amazon Kendra handle indexing, retrieval, and generation as a managed service. Higher cost, but much less to build and maintain.
Frequently Asked Questions
What is RAG in AI?
RAG stands for Retrieval Augmented Generation. It's how AI apps access information from your own documents or databases before generating an answer. Rather than relying on the model's training data, a RAG system searches your knowledge base, finds the most relevant passages, and puts them into the AI's context window before it responds. The result is answers grounded in your actual content rather than the model's best guess.
Why do AI apps use RAG instead of just training the model on your data?
Fine-tuning a model is expensive, slow, and has to be redone every time your data changes. RAG is immediate — add a document to your knowledge base and the AI can reference it on the next query, no retraining needed. RAG also keeps a clear audit trail: you can see exactly which document the answer came from. Fine-tuning buries knowledge inside model weights where it's hard to inspect or update.
Does RAG stop AI hallucination?
RAG dramatically reduces hallucination for questions within your document set, because the model is working from retrieved source text rather than generating from memory. But it's not a complete cure. If retrieval fails to surface the right chunk, or if the question is about something not in your documents, the model can still guess. Good RAG systems include source citations and instructions to say "I don't know" when the answer isn't in the context.
What is chunking in RAG?
Chunking is splitting your documents into smaller pieces before storing them. You can't retrieve a 200-page manual as a single search result — context windows have limits and big chunks are too noisy. Chunking breaks documents into sections (typically 300–1,500 characters each) so retrieval can surface the specific passage that answers the question. Chunk size and overlap are two of the most important tuning knobs in any RAG system.
What is the difference between RAG and fine-tuning?
Fine-tuning bakes knowledge into the model's weights — the model "memorizes" your data. RAG keeps knowledge external and looks it up at query time. Fine-tuning is better for teaching style, tone, and behavior. RAG is better for keeping factual answers accurate and up to date. For most apps that need to reference specific data, RAG is the right choice — fine-tuning is expensive and doesn't handle knowledge that changes.
Is RAG the same as giving the AI a long context window?
Related, but not the same. A long context window lets you paste more text into a single AI call. RAG selectively retrieves only the relevant pieces and puts those in the context. For small document sets, dumping everything into a long context can work fine. For large knowledge bases — thousands of documents — RAG is necessary because no context window is big enough, and targeted retrieval produces more accurate answers than hunting through a massive context.
How do I build a RAG app without machine learning knowledge?
You don't need ML knowledge for a basic RAG app. The hard parts — converting text to vectors, storing them, running similarity search — are handled by tools like LangChain, LlamaIndex, Chroma, and Pinecone. You connect those to a model API (Claude, OpenAI, Gemini) and write the application logic on top. Most vibe coders get a working RAG prototype running in an afternoon. The challenge is less "how do I build it" and more "how do I tune it to produce accurate results."
What tools do vibe coders use to build RAG apps?
Common stack: LangChain or LlamaIndex for orchestration (wiring retrieval to the AI call), Chroma locally for development or Pinecone for production, and OpenAI or Anthropic APIs for the embedding and generation models. Cursor and Claude are great for scaffolding the boilerplate. Many builders start with LangChain + Chroma locally, then switch to a managed vector database when they deploy.
What is a vector database and why does RAG need one?
A vector database stores your document chunks as numerical vectors representing meaning. When you query it with a question (also converted to a vector), it finds the chunks with the most similar meaning — even if they use completely different words. This semantic matching is what makes RAG accurate: it finds the right passage even when the user's phrasing doesn't match the document's exact wording. Standard SQL databases can't do this efficiently at scale.
What to Learn Next
RAG sits at the intersection of several concepts. These articles cover the pieces that make RAG tick.