What Is RAG? How AI Apps That Know Your Stuff Actually Work

TL;DR: RAG stands for Retrieval-Augmented Generation. It's how you make AI apps that actually know your stuff — your company docs, your codebase, your personal notes. Instead of hoping the AI was trained on your data (it wasn't), RAG fetches the relevant pieces of your data at the moment someone asks a question, then feeds those pieces to the AI so it can generate an accurate answer. Three steps: (1) split your documents into chunks, (2) convert those chunks into searchable vectors called embeddings, (3) when a question comes in, find the matching chunks and hand them to the AI. Tools like LangChain, LlamaIndex, Supabase pgvector, and Pinecone make this accessible to vibe coders — no ML degree required.

The Problem RAG Solves

Let's start with a scenario every vibe coder hits eventually.

You're building a customer support chatbot for a small business. The owner wants customers to ask questions and get accurate answers about their products, shipping policies, and return process. You ask Claude to build it. Claude builds a beautiful chat interface. The customer types "What's your return policy?" and the AI responds with... a completely made-up return policy.

It sounds plausible. It's well-written. It's also 100% fiction.

This isn't a bug. It's how AI models work. Claude, GPT, Gemini — they were trained on the general internet. They know what return policies typically look like. They don't know what this specific business's return policy says. So they do what large language models do: generate something that sounds right based on patterns they've seen.

You could paste the return policy into the system prompt. That works for one page. But what about the entire product catalog? The shipping FAQ? The troubleshooting guide? The employee handbook? You quickly run into the context window limit — the maximum amount of text you can feed to the AI at once. Even with models that accept 200K tokens, you can't stuff an entire company's documentation into every single API call. It would be slow, expensive, and wasteful.

RAG solves this elegantly. Instead of cramming everything into the prompt, RAG finds just the relevant pieces of your data for each question, and feeds only those pieces to the AI. Customer asks about returns? RAG finds the return policy document. Customer asks about a specific product? RAG finds that product's page. The AI gets exactly the context it needs — nothing more, nothing less.

Think of it like a reference librarian. You don't read the entire library before answering a question. You know where to look, you pull the right books off the shelf, you read the relevant sections, and then you give an answer. RAG is the reference librarian for your AI app.

How RAG Works — The Three-Step Process

RAG sounds complex because the name is academic. The actual process has three steps, and each one makes intuitive sense once you see it.

Step 1: Chunk Your Documents

First, you take all the documents you want your AI to know about and split them into smaller pieces called chunks. A chunk might be a paragraph, a section, or a fixed number of words — usually 200 to 1,000 tokens (roughly 150 to 750 words).

Why chunks? Because when someone asks a question, you don't want to feed the AI an entire 50-page manual. You want to feed it the two or three paragraphs that actually answer the question. Smaller pieces = more precise retrieval.

Think of it like a filing system. You wouldn't file your entire company handbook as one document in a filing cabinet. You'd organize it into sections — "Vacation Policy," "Expense Reporting," "Remote Work Guidelines" — so you can pull just the section you need.

Prompt to Use with Your AI

"I have a folder of markdown documents (about 50 files, mostly company policies and product docs). I need to split them into chunks for a RAG system. Write a Python script that reads all .md files from a folder, splits them into chunks of roughly 500 tokens with 50 tokens of overlap between chunks, and saves each chunk as a JSON object with the original filename, chunk index, and text content."

Step 2: Turn Chunks Into Embeddings (Vectors)

Here's where the "magic" happens — and I'm putting magic in quotes because it's actually straightforward once you understand what's going on.

Each chunk of text gets converted into a list of numbers called an embedding (also called a vector). A typical embedding is a list of 1,536 numbers. These numbers represent the meaning of the text — not the exact words, but what it's actually about.

You don't calculate these numbers yourself. You send the text to an embedding model (like OpenAI's text-embedding-3-small or Cohere's embed-v4) through an API call, and it returns the vector. One API call per chunk. That's it.

Why numbers? Because computers are incredibly fast at comparing numbers. If "What's your return policy?" gets turned into a vector, and your return policy document gets turned into a vector, those two vectors will be close together in mathematical space — even though the words are completely different. The embedding model understands that the question and the document are about the same topic.

This is called semantic search. Instead of searching for exact word matches (like a traditional search engine), you're searching for meaning matches. "How do I send stuff back?" would match your return policy just as well as "What is the return policy?" — because they mean the same thing, even though they share almost no words.

All these vectors get stored in a vector database — a special database designed to answer the question "Which of my stored vectors is most similar to this new vector?" really fast. That's the entire job of a vector database.

The Construction Analogy

Think of embeddings like GPS coordinates for ideas. Every document chunk gets a "location" in meaning-space. When someone asks a question, their question also gets a location. Then you just find which document chunks are closest to the question. It's like saying "I need supplies for plumbing work" — you don't search for the word "plumbing" in a store directory. You go to the section of the store where plumbing stuff is located. Embeddings give every piece of text a location based on what it means.

Step 3: Retrieve and Generate

Now someone asks your AI app a question. Here's what happens in real time:

The question becomes a vector. The same embedding model converts the user's question into a vector — that same list of numbers.
Find the closest chunks. The vector database compares the question's vector against all stored chunk vectors and returns the top 3–10 most similar chunks. This takes milliseconds.
Build the prompt. Your app takes those retrieved chunks and packages them into a prompt: "Here is relevant context from our documents: [chunks]. Based on this context, answer the following question: [user's question]."
The AI generates an answer. The language model (Claude, GPT, whatever you're using) reads the context and the question, then generates an answer grounded in your actual data.

That's it. That's RAG. Retrieve the relevant context, then let the AI generate an answer based on that context. Retrieval-Augmented Generation.

Prompt to Use with Your AI

"Build me a simple RAG system using Python, OpenAI's embedding API, and Supabase pgvector. I want to: (1) upload markdown files and split them into chunks, (2) generate embeddings and store them in Supabase, (3) have a chat endpoint where users can ask questions and get answers based on my documents. Use the text-embedding-3-small model for embeddings and Claude for generation. Include the SQL to set up the Supabase table with pgvector."

Real-World Examples: What People Actually Build with RAG

RAG isn't theoretical. Vibe coders are building with it right now. Here are the most common patterns:

Customer Support Bots

The classic use case. You feed your product docs, FAQ, and support articles into a RAG system. Customers ask questions in natural language. The bot retrieves the relevant documentation and generates a helpful, accurate response — with citations pointing to the original docs. Companies like Intercom, Zendesk, and Crisp have all added RAG-powered features to their support platforms.

Codebase Q&A

This one hits home for vibe coders. You chunk your entire codebase — every file, every README, every comment — and build a RAG system on top of it. Then you can ask questions like "How does authentication work in this app?" or "Where is the payment processing logic?" and get answers that reference your actual code. Tools like Cursor and Claude Code are essentially doing a form of RAG when they read your codebase to answer questions.

Personal Knowledge Bases

Feed your notes, bookmarks, research papers, and saved articles into a RAG system. Build a personal AI assistant that can answer questions based on everything you've ever saved. "What was that article about progressive web apps I read last month?" Your RAG system finds it and summarizes the key points.

Internal Company Tools

HR policies, onboarding docs, engineering runbooks, meeting notes, project specs — companies are building internal RAG tools so employees can ask "What's the process for requesting PTO?" or "What did we decide in the Q3 planning meeting?" and get accurate answers pulled from actual company documents.

Legal and Compliance

Law firms and compliance teams use RAG to search through thousands of contracts, regulations, and case files. Instead of manually searching through PDFs, they ask questions in plain English and get answers with citations to specific clauses and documents.

What Breaks: The RAG Problems Nobody Warns You About

RAG is powerful. RAG is also fragile in ways that aren't obvious until you've built one and watched it give wrong answers with complete confidence. Here's what goes wrong and how to fix it.

Chunking: Too Small or Too Large

Too small (50–100 tokens): The chunks lose context. A sentence about your return policy might get separated from the sentence that specifies the 30-day window. The AI retrieves the first sentence but not the second, and generates an incomplete answer.

Too large (2,000+ tokens): The chunks are so big that the vector doesn't clearly represent any single topic. A chunk about both shipping AND returns gets retrieved for both topics, diluting the relevance. The AI gets noisy context and produces vague answers.

The fix: Start with 300–500 token chunks with 50–100 tokens of overlap between chunks. The overlap ensures that context at chunk boundaries isn't lost. Test with real questions from real users and adjust based on answer quality.

Wrong Embedding Model

Not all embedding models are created equal. Older models (like the original text-embedding-ada-002) are cheaper but less accurate. Newer models (like text-embedding-3-large or Cohere's embed-v4) understand meaning better but cost more per embed.

The fix: Use the best embedding model you can afford. For most vibe coders, OpenAI's text-embedding-3-small hits the sweet spot of cost and quality. If accuracy is critical (medical, legal, financial), upgrade to text-embedding-3-large or a specialized model.

Hallucination Even WITH Context

This is the one that surprises people. "But I gave the AI the right context — why did it still make stuff up?"

It happens. The AI might combine information from two different chunks incorrectly. It might "fill in" details that aren't in the context because the question asks for something the retrieved documents don't fully cover. Or the retrieved chunks might be almost relevant but not quite, giving the AI enough material to construct a plausible but wrong answer.

The fix: Add explicit instructions in your system prompt: "Only answer based on the provided context. If the context doesn't contain enough information to fully answer the question, say 'I don't have enough information about that in my documents.'" Also, always include source citations — show users which documents the answer came from so they can verify.

Prompt to Use with Your AI

"Here's my RAG system prompt. I'm getting hallucinated answers even when relevant context is provided. Rewrite this system prompt to be stricter about only answering from the provided context. Include instructions to: (1) cite specific document sources for each claim, (2) explicitly say when the context doesn't contain enough information, (3) never combine information from different documents without flagging it. Here's my current prompt: [paste your prompt]"

The "Needle in a Haystack" Problem

Your RAG system has 10,000 chunks. The answer to the user's question is in exactly one of them. The embedding similarity search returns the top 5 chunks — but that one critical chunk ranks #7 and doesn't make the cut. The AI generates an answer from the wrong chunks.

The fix: Increase the number of retrieved chunks (try top 10 or top 20). Use a re-ranking model — a second pass that takes the top 20 results from vector search and re-ranks them using a more sophisticated (and slower) model to find the truly relevant ones. Cohere's re-ranker and cross-encoder models are popular choices. Also consider hybrid search: combine vector search with traditional keyword search. Sometimes the exact keyword match finds what semantic search misses.

Cost at Scale

Embedding 100 documents is nearly free. Embedding 100,000 documents starts adding up. And every user query requires an embedding API call (for the question) plus a language model API call (for the answer). At scale, this gets expensive.

The fix: Use cheaper embedding models for initial indexing. Cache frequent queries so identical questions don't trigger new API calls. Use a self-hosted vector database like Supabase pgvector instead of a managed service to reduce per-query costs. And consider using a smaller, faster model for simple questions, reserving the expensive model for complex ones.

Stale Data

Your documents change. Policies get updated. Products get discontinued. Code gets refactored. But your RAG system still has the old chunks with the old embeddings. The AI confidently answers based on outdated information.

The fix: Build a re-indexing pipeline. When documents change, re-chunk and re-embed the updated sections. Most vector databases support upsert operations — update existing vectors without rebuilding the entire index. Set up a scheduled job that checks for document changes and updates the vector store automatically.

Tools Vibe Coders Actually Use

You don't need to build RAG from scratch. These tools handle the hard parts so you can focus on your app.

Vector Databases

Supabase pgvector

PostgreSQL extension that adds vector search to your existing Supabase database. Best for vibe coders who already use Supabase — you get a vector database and a regular database in one place. Free tier available. SQL-based, so you can query vectors alongside your regular data.

Pinecone

Purpose-built managed vector database. Fast, scalable, and the most popular dedicated option. Free tier with 100K vectors. Best for production apps with large datasets. Downside: it's only a vector database — you'll need a separate database for everything else.

ChromaDB

Open-source, runs locally, dead simple to set up. pip install chromadb and you're running. Best for prototyping and personal projects. Stores everything on your machine — no cloud, no API keys, no cost. Scales poorly for production but perfect for learning and small apps.

Weaviate

Open-source vector database with built-in vectorization — it can generate embeddings for you, so you skip the separate embedding API step. Cloud-hosted or self-hosted. Good middle ground between Pinecone's ease and ChromaDB's openness.

RAG Frameworks

LangChain

The most popular framework for building LLM-powered apps. Handles document loading, chunking, embedding, vector store integration, and the retrieve-then-generate pipeline. Available for Python and JavaScript. Huge community, tons of examples. Can feel over-engineered for simple use cases, but it's the industry standard.

LlamaIndex

Specifically designed for RAG. While LangChain is a general-purpose LLM framework, LlamaIndex focuses on connecting LLMs with your data. Simpler for RAG-specific use cases. Handles complex document structures (PDFs, spreadsheets, databases) better than most alternatives. Python and TypeScript.

Prompt to Use with Your AI

"I want to build a RAG-powered Q&A bot for my company's internal docs. We use Supabase for our database. Build it with: Python backend, Supabase pgvector for the vector store, OpenAI text-embedding-3-small for embeddings, and Claude for answering questions. Include: a script to ingest markdown files, a FastAPI endpoint for asking questions, and a simple React frontend with a chat interface. Use LangChain for the RAG pipeline. The system prompt should instruct the AI to cite which document each answer comes from."

RAG vs. the Alternatives

RAG isn't the only way to make AI know your data. Here's how it compares to the alternatives — and when each approach makes sense.

RAG vs. Long Context Windows

Modern models like Claude and Gemini support context windows of 200K+ tokens. Why not just paste all your documents into the prompt?

You can — for small datasets. If your total documentation fits within the context window, this is actually simpler than RAG. The AI sees everything and can answer any question.

The problems: cost (you pay per token, so sending 200K tokens on every query gets expensive fast), speed (more tokens = slower responses), and accuracy (models actually perform worse at finding specific information buried in very long contexts — the "lost in the middle" problem). RAG is more cost-effective and often more accurate because it sends only the relevant context.

RAG vs. Fine-Tuning

Fine-tuning trains the AI model itself on your data. Instead of feeding context at query time, you bake the knowledge into the model's weights.

Fine-tuning is better for: teaching the AI a specific style (write like our brand voice), a specific format (always output in our JSON schema), or domain-specific vocabulary (medical terminology, legal jargon).

RAG is better for: factual recall from specific documents, data that changes frequently, and situations where you need citations back to source documents. For most vibe coders building apps that need to answer questions about specific data, RAG is the right choice.

RAG + Fine-Tuning (The Advanced Play)

The most sophisticated systems use both. Fine-tune the model to understand your domain and follow your format. Use RAG to provide the specific facts and data points. This is overkill for most projects but worth knowing about as you scale.

Building Your First RAG App — The Vibe Coder Way

Here's the fastest path from "I have documents" to "I have a working RAG app" for vibe coders:

Pick your vector database: If you're prototyping, use ChromaDB (local, free, zero setup). If you're building for production and already use Supabase, use pgvector. If you want a managed service, use Pinecone.
Pick your framework: LlamaIndex if your app is primarily RAG. LangChain if RAG is one part of a larger LLM application. Both work — LlamaIndex is simpler for pure RAG.
Pick your embedding model: text-embedding-3-small from OpenAI for the best cost/quality balance. Free alternatives exist if you run them locally.
Pick your LLM: Claude Sonnet for the best balance of quality and speed. GPT-4o if you prefer OpenAI. Use OpenRouter to easily switch between models without changing your code.
Tell your AI to build it: Use the prompt below. Seriously — this is the vibe coder way. You describe what you want, the AI builds it, you test it, you iterate.

Prompt to Use with Your AI — The Full Build

"Build me a complete RAG application with these specs:

Backend: Python with FastAPI
Vector DB: ChromaDB (local, for prototyping)
Embedding model: OpenAI text-embedding-3-small
LLM: Claude via Anthropic API
Framework: LlamaIndex

Features I need:
1. An ingestion script that reads all .md and .txt files from a /docs folder, chunks them (500 tokens, 50 token overlap), generates embeddings, and stores them in ChromaDB
2. A /query endpoint that takes a question, retrieves the top 5 relevant chunks, sends them to Claude with instructions to only answer from context and cite sources, and returns the answer
3. A simple HTML chat interface
4. Include .env for API keys
5. Include a README with setup instructions

Make the system prompt strict about not hallucinating — if the answer isn't in the retrieved context, say so."

What AI Gets Wrong About RAG

If you ask AI tools to explain RAG, they'll often give you explanations that are technically correct but practically misleading. Here's what to watch for:

"You Need a Deep Understanding of Vector Mathematics"

You don't. You need to understand what embeddings do (convert text to searchable number lists) — not how they do it. You don't need to understand cosine similarity formulas any more than you need to understand TCP/IP to browse the web. The tools handle the math.

"RAG Eliminates Hallucination"

It reduces hallucination dramatically. It doesn't eliminate it. Always include source citations and "I don't know" fallbacks. Any explanation that says RAG "solves" hallucination is oversimplifying.

"Just Use LangChain for Everything"

LangChain is powerful but complex. For a simple RAG app, you can build the entire pipeline in 50 lines of Python without any framework. Frameworks add value when your app grows — don't start with one unless you need it. LlamaIndex is simpler if RAG is your only use case.

"You Need an Expensive Vector Database"

For prototyping and small apps (under 10,000 documents), ChromaDB running locally is free and fast. Supabase pgvector's free tier handles most hobby projects. You only need Pinecone or a dedicated service when you're dealing with millions of vectors and need low-latency queries at scale.

How to Debug RAG with AI

When your RAG app gives wrong answers, the problem is in one of three places. Here's how to find it:

1. Bad Retrieval (Wrong Chunks)

The AI got irrelevant or incomplete context. How to check: Log the retrieved chunks for every query. Read them yourself. Are they actually relevant to the question? If not, the problem is in your chunking strategy or embedding model.

Prompt to Use with Your AI

"My RAG app is returning irrelevant results for user queries. Here are 5 example queries and the chunks that were retrieved for each: [paste examples]. Analyze the retrieval quality and suggest improvements to my chunking strategy, embedding model choice, or retrieval parameters."

2. Bad Generation (AI Misinterpreted the Chunks)

The right chunks were retrieved, but the AI still gave a wrong answer. How to check: Look at the full prompt that was sent to the LLM (context + question). Could a human answer correctly from that context? If yes, the problem is in your system prompt — the AI needs clearer instructions about how to use the context.

3. Bad Data (Garbage In, Garbage Out)

Your source documents are incomplete, outdated, contradictory, or poorly formatted. No amount of RAG engineering fixes bad source data. How to check: When the AI gives a wrong answer, trace it back to the source document. Is the source document actually correct and complete?

What to Learn Next

RAG touches several concepts that are worth understanding as you build more sophisticated AI apps:

What Is OpenRouter? — Route your RAG app's LLM calls through OpenRouter to easily switch between AI models without changing code. Use cheap models for simple queries, expensive models for complex ones.
Supabase vs Firebase — Supabase includes pgvector for vector search, making it a natural choice for RAG apps that also need a traditional database. Understand when to choose it over Firebase.
What Is an API? — Every piece of the RAG pipeline communicates through APIs — the embedding model, the vector database, the LLM. Understanding APIs makes debugging RAG much easier.

Next Step

Start small. Take 5–10 markdown documents — your own notes, a project README, some documentation you reference frequently — and build a RAG app with ChromaDB and LlamaIndex. Ask it questions. See what it gets right and what it gets wrong. Debug the wrong answers using the three-step process above. You'll learn more from 30 minutes of hands-on debugging than from reading another 10 articles about RAG theory. And when you're ready for production, swap ChromaDB for Supabase pgvector and you're live.

Read: What Is OpenRouter? Read: Supabase vs Firebase Read: What Is an API?

FAQ

What is RAG in plain English?

RAG (Retrieval-Augmented Generation) is a technique that lets AI apps answer questions using YOUR specific data instead of just their general training. It works in three steps: your documents get split into chunks, those chunks get converted into searchable vectors (embeddings), and when someone asks a question, the system finds the most relevant chunks and feeds them to the AI along with the question. The AI then generates an answer based on your actual data — not guesses.

Do I need to know machine learning to build a RAG app?

No. Modern tools like LangChain, LlamaIndex, and Supabase pgvector handle the complex parts for you. You don't need to understand the math behind embeddings or vector similarity search. You need to know what to tell your AI coding tool to build, how to structure your documents, and how to test whether the answers are accurate. Think of it like using a database — you don't need to understand B-trees to run a SQL query.

How much does it cost to run a RAG app?

Costs vary widely depending on scale. For a small personal knowledge base (a few hundred documents), you can run RAG almost free using Supabase's free tier for pgvector and a cheap embedding model. For production apps with thousands of documents and many users, expect $50–500/month for vector database hosting plus embedding and LLM API costs. The biggest cost drivers are the number of documents you embed and how many queries users make.

What's the difference between RAG and fine-tuning?

RAG retrieves your data at query time and feeds it to a general-purpose AI model. Fine-tuning permanently changes the AI model's weights by training it on your data. RAG is cheaper, faster to set up, easier to update (just add new documents), and doesn't require ML expertise. Fine-tuning produces a specialized model that "knows" your data natively but costs more, takes longer, and requires retraining when data changes. For most vibe coders, RAG is the right choice.

Can RAG apps still hallucinate?

Yes. RAG dramatically reduces hallucination by giving the AI real source material, but it doesn't eliminate it completely. The AI can still misinterpret the retrieved context, combine information from different chunks incorrectly, or fill gaps with made-up details. The fix: always include source citations in your RAG app's output so users can verify answers, and instruct the AI to say "I don't have information about that" when the retrieved context doesn't contain a clear answer.