Large language models are remarkable at reasoning, writing, and synthesizing information. They are also, by design, frozen in time. A model trained on data through a certain date knows nothing about what happened after that date. More importantly for most business applications, it knows nothing about your organization’s internal documents, your product catalog, your customer data, your support tickets, or any of the private information that makes an AI assistant genuinely useful for your specific context.
The naive solution is to paste all your relevant documents into the prompt. This works at small scale and breaks at larger scale for two reasons. Context windows, while growing, still have limits. And sending thousands of documents in every prompt is expensive and slow even when it is technically possible.
The more fundamental solution is retrieval. Instead of giving the model everything that might be relevant, you retrieve only the pieces that are most relevant to the specific question being asked, include those in the prompt, and let the model reason over that focused context.
This is retrieval-augmented generation, commonly called RAG. It is the architectural pattern that has become the standard approach for grounding language model outputs in specific, up-to-date, private, or domain-specific knowledge. Understanding how it works is essential for anyone building AI applications today.
The Core Problem RAG Solves
To understand why RAG exists, it helps to be specific about the limitations it addresses.
Knowledge cutoff. Every language model has a training cutoff date. Events, publications, product releases, and organizational changes that occurred after that date are unknown to the model. For applications where currency matters, a model answering questions from stale training data produces wrong answers confidently. RAG allows you to connect a language model to current information by retrieving it at query time rather than baking it into training.
Private and proprietary knowledge. Language models are trained on publicly available data. Your internal documentation, proprietary research, customer records, and organizational knowledge were not part of their training. No amount of prompting can make a model know things it was never trained on. RAG provides a mechanism for making this private knowledge available to the model at inference time without retraining.
Hallucination reduction. Language models sometimes generate plausible-sounding but factually incorrect information, particularly on specific factual questions. When a model is given retrieved context that contains the correct answer, it is far more likely to use that answer and far less likely to fabricate one. RAG reduces hallucination by grounding the model’s responses in retrieved source material.
Auditability and citation. When a model answers from training data alone, it is difficult to trace where the answer came from or verify it. When a model answers from retrieved context, the source documents are known and can be surfaced to the user as citations. This auditability is critical for enterprise applications where users need to verify responses.
The RAG Pipeline Overview
A RAG pipeline has two distinct phases that operate at different times.
The indexing phase happens once, or periodically when the knowledge base changes. It prepares your documents for retrieval by converting them into a searchable vector index. This phase is where chunking, embedding, and index construction happen.
The retrieval and generation phase happens at query time, once for each user question. It retrieves the most relevant chunks from the index, constructs a prompt containing the retrieved context and the user’s question, and sends that prompt to the language model to generate a response.
Understanding the separation between these phases is important because they have different performance characteristics and different failure modes. Problems with retrieval quality almost always originate in indexing phase decisions, particularly chunking strategy and embedding model choice. Problems with answer quality can originate in either phase.
Phase One: Indexing
Document Loading
The first step of indexing is loading your source documents into a format the pipeline can process. Documents come in many formats: PDFs, Word documents, HTML pages, Markdown files, plain text, database records, and API responses. Each format requires different handling to extract clean text.
python
from langchain_community.document_loaders import (
PyPDFLoader,
WebBaseLoader,
TextLoader
)
# Load a PDF
pdf_loader = PyPDFLoader("company_handbook.pdf")
pdf_docs = pdf_loader.load()
# Load a web page
web_loader = WebBaseLoader("https://docs.example.com/api")
web_docs = web_loader.load()
# Load a text file
text_loader = TextLoader("product_faq.txt")
text_docs = text_loader.load()
# Combine all documents
all_documents = pdf_docs + web_docs + text_docs
Document loading is more nuanced than it appears. PDFs with complex layouts, tables, or scanned images require specialized handling. HTML pages contain navigation, headers, and footer content that should be stripped before indexing. Database records need to be formatted as coherent text before embedding. The quality of the text extracted at this stage directly affects retrieval quality downstream.
Chunking
Chunking is the process of splitting documents into smaller pieces that will be individually embedded and stored in the vector index. It is one of the most consequential decisions in a RAG pipeline and one that beginners consistently underinvest in.
The reason chunking is necessary is that embedding models have token limits, typically 512 to 8192 tokens depending on the model, and embedding an entire document as a single vector loses the fine-grained information needed for precise retrieval. A single vector representing a fifty-page document cannot encode enough detail to retrieve the specific paragraph answering a specific question. Smaller chunks produce more precise embeddings and more relevant retrieval results.
The tension in chunking is between too small and too large. Chunks that are too small may lack the surrounding context needed to make them meaningful or useful as standalone pieces. Chunks that are too large lose precision in the embedding and may contain irrelevant information alongside relevant information.
Fixed-size chunking splits documents into chunks of a fixed number of tokens or characters, with an overlap between consecutive chunks to avoid cutting information at boundaries.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
length_function=len,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(all_documents)
print(f"Created {len(chunks)} chunks from {len(all_documents)} documents")
The RecursiveCharacterTextSplitter tries to split on natural boundaries first, paragraphs, then sentences, then spaces, before resorting to arbitrary character splits. This produces more semantically coherent chunks than splitting purely on character count.
Semantic chunking uses embedding similarity to split documents at points where the semantic content shifts, rather than at fixed character counts. Consecutive sentences are compared and a split is inserted when the similarity between adjacent sentences drops below a threshold, indicating a topic transition.
python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
semantic_chunks = semantic_splitter.split_documents(all_documents)
Semantic chunking produces more coherent chunks but is slower and more expensive because it requires embedding sentences during the chunking step itself. For knowledge bases where retrieval quality is critical, the investment is often worthwhile.
Document structure-aware chunking respects the inherent structure of the source document. For Markdown documents, split on heading boundaries. For code documentation, keep function definitions and their docstrings together. For legal documents, preserve section boundaries. This requires document-type-specific logic but produces chunks that are coherent in the way human readers expect.
Chunk overlap is an important parameter regardless of chunking strategy. When a chunk boundary falls in the middle of a sentence or concept, a small overlap between adjacent chunks ensures that the content near the boundary appears in both chunks, reducing the chance that important information is split across two chunks without appearing in either one completely.
Embedding
Once documents are chunked, each chunk is converted into a vector using an embedding model. This vector numerically represents the semantic content of the chunk and is what enables similarity-based retrieval.
python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store and embed all chunks
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Indexed {len(chunks)} chunks into vector store")
The choice of embedding model affects retrieval quality significantly. Key considerations are the model’s performance on semantic similarity benchmarks for your content type, the dimensionality of its output vectors (higher dimensions capture more information but cost more in storage and compute), the maximum token length it can embed (chunks longer than the model’s limit get truncated), and whether it is optimized for asymmetric search where short queries retrieve longer passages.
Popular embedding model choices are OpenAI’s text-embedding-3-small (1536 dimensions, good balance of quality and cost), text-embedding-3-large (3072 dimensions, higher quality at higher cost), Cohere’s embed-english-v3.0 (1024 dimensions, strong performance), and open source options like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, fast and free to run locally).
Metadata Enrichment
Before chunks are stored in the vector index, attaching metadata to each chunk dramatically improves retrieval quality and enables filtered search.
python
from langchain.schema import Document
enriched_chunks = []
for i, chunk in enumerate(chunks):
enriched_chunk = Document(
page_content=chunk.page_content,
metadata={
# Preserve original source metadata
**chunk.metadata,
# Add derived metadata
"chunk_index": i,
"chunk_length": len(chunk.page_content),
"document_section": extract_section(chunk),
"content_type": classify_content(chunk),
"created_date": "2024-01-15",
"department": "engineering"
}
)
enriched_chunks.append(enriched_chunk)
Metadata enables filtered retrieval. Instead of searching all chunks, you can restrict the search to chunks from a specific document, department, date range, or content type. A question about engineering processes retrieves only from engineering documentation. A question about recent policy changes retrieves only from documents updated in the past month.
Phase Two: Retrieval and Generation
Retrieval
At query time, the user’s question is embedded using the same model used to embed the chunks, and the vector store is queried for the most similar chunks.
python
# Basic similarity search
query = "What is the company's remote work policy?"
retrieved_docs = vectorstore.similarity_search(
query=query,
k=5 # Return top 5 most similar chunks
)
# Similarity search with relevance scores
retrieved_docs_with_scores = vectorstore.similarity_search_with_relevance_scores(
query=query,
k=5
)
for doc, score in retrieved_docs_with_scores:
print(f"Score: {score:.3f} | Source: {doc.metadata.get('source', 'unknown')}")
print(f"Content: {doc.page_content[:200]}...")
print()
Filtered retrieval restricts the search to a subset of the index based on metadata:
python
# Search only within HR department documents
retrieved_docs = vectorstore.similarity_search(
query=query,
k=5,
filter={"department": "hr"}
)
Maximum marginal relevance (MMR) retrieval improves result diversity. Standard similarity search returns the top k most similar chunks, which may be redundant if they are all very similar to each other. MMR balances similarity to the query against dissimilarity from already-selected results, producing a more diverse and informative set of retrieved chunks.
python
# MMR retrieval for diverse results
retrieved_docs = vectorstore.max_marginal_relevance_search(
query=query,
k=5,
fetch_k=20, # Candidate pool size
lambda_mult=0.5 # Balance between relevance and diversity (0=max diversity, 1=max relevance)
)
Hybrid Search
Pure vector similarity search works well for semantic queries but can miss exact keyword matches that matter. A query for a specific product model number or a person’s name is better served by keyword search than by semantic search. Hybrid search combines both approaches and typically outperforms either alone.
python
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Vector similarity retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Ensemble combining both
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # Weight vector search more heavily
)
hybrid_results = ensemble_retriever.invoke(query)
The weights between keyword and vector search are tunable and the optimal balance depends on your content and query patterns. Queries that tend to be conceptual benefit from more vector search weight. Queries that tend to reference specific entities or terms benefit from more keyword search weight.
Reranking
The initial retrieval step retrieves candidates quickly using approximate nearest neighbor search. Reranking takes those candidates and scores them with a more accurate but slower model to reorder them before passing to the language model.
python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Retrieve more candidates than needed
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
# Rerank and keep top results
compressor = CohereRerank(
model="rerank-english-v3.0",
top_n=5
)
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
reranked_docs = reranking_retriever.invoke(query)
Cross-encoder rerankers like Cohere Rerank and Jina Reranker evaluate the query and each candidate chunk together, producing a more accurate relevance score than the embedding similarity used for initial retrieval. The pattern of retrieving twenty candidates and reranking to keep five consistently improves the quality of the final context sent to the language model.
Query Transformation
User queries are often poorly formed for retrieval. A user who asks a vague question, a question that relies on conversation history for context, or a multi-part question may get poor retrieval results because the raw query does not match the relevant chunks well.
Query transformation techniques rewrite or expand the query before retrieval to improve results.
Hypothetical document embedding (HyDE) asks the language model to generate a hypothetical answer to the question and then uses that hypothetical answer as the retrieval query. Since the hypothetical answer is in the same style as the documents in the index, it often retrieves more relevant chunks than the original question.
python
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
hyde_prompt = ChatPromptTemplate.from_template("""
Generate a hypothetical document that would answer the following question.
Write it as if it were an excerpt from an official company document.
Question: {question}
Hypothetical document excerpt:
""")
hyde_chain = hyde_prompt | llm
hypothetical_doc = hyde_chain.invoke({"question": query})
retrieved_docs = vectorstore.similarity_search(
hypothetical_doc.content,
k=5
)
Multi-query retrieval generates multiple variations of the original question and retrieves for each, then deduplicates the results. This improves recall when the original query is ambiguous or when relevant chunks might use different terminology.
python
from langchain.retrievers.multi_query import MultiQueryRetriever
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
llm=llm
)
# Generates multiple query variations internally
multi_query_results = multi_query_retriever.invoke(query)
Prompt Construction
Once relevant chunks are retrieved, they are assembled into a prompt for the language model. The structure of this prompt significantly affects the quality of the generated response.
python
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def format_context(docs):
formatted = []
for i, doc in enumerate(docs):
source = doc.metadata.get("source", "Unknown")
formatted.append(
f"[Source {i+1}: {source}]\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant answering questions based on the provided context.
Answer the question using only the information in the context below.
If the context does not contain enough information to answer the question,
say so clearly rather than guessing or using outside knowledge.
Always cite the source document for key claims.
Context:
{context}
Question: {question}
Answer:
""")
rag_chain = (
{
"context": lambda x: format_context(
vectorstore.similarity_search(x["question"], k=5)
),
"question": RunnablePassthrough()
}
| rag_prompt
| llm
| StrOutputParser()
)
response = rag_chain.invoke({"question": query})
print(response)
Key principles for RAG prompt construction are being explicit that the model should answer from the provided context, instructing the model to acknowledge when the context is insufficient rather than hallucinating, including source attribution in the context so the model can cite sources in its response, and keeping the instruction section concise to leave maximum context window space for retrieved chunks.
Putting the Full Pipeline Together
Here is a complete, minimal RAG pipeline that illustrates all the components working together:
python
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# ── INDEXING PHASE ──────────────────────────────────────────
# 1. Load documents
loader = TextLoader("knowledge_base.txt")
documents = loader.load()
# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64
)
chunks = splitter.split_documents(documents)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# ── RETRIEVAL AND GENERATION PHASE ──────────────────────────
# 4. Define prompt
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context.
If you cannot find the answer in the context, say so clearly.
Context:
{context}
Question: {question}
Answer:
""")
# 5. Initialize language model
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# 6. Assemble RAG chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
# 7. Query
question = "What is the refund policy for digital products?"
answer = rag_chain.invoke(question)
print(answer)
RAG vs Fine-Tuning
A common question when considering RAG is whether fine-tuning the language model would be a better approach. The two techniques address different problems and are often complementary rather than competing.
Fine-tuning modifies the model’s weights by training it on domain-specific data. It is the right approach when you want the model to learn a specific style, tone, or format of output, or when you want the model to internalize a specific way of reasoning about a domain. It is not an efficient way to inject factual knowledge because facts memorized in weights are difficult to update and models still hallucinate even on content they were fine-tuned on.
RAG keeps the model’s weights unchanged and provides knowledge through the context window at inference time. It is the right approach when you need the model to have access to specific, up-to-date, or private factual information. The knowledge is always current because you control the index. Retrieval is auditable because you can show which chunks were retrieved. Adding new information requires only re-indexing the new documents, not retraining.
The practical guidance for most teams is to start with RAG. It is faster to implement, requires no training infrastructure, produces auditable results, and handles the knowledge injection problem better than fine-tuning. Fine-tuning becomes relevant when RAG is working but the model’s behavior or output style needs to be adjusted.
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge currency | Always current | Frozen at training time |
| Implementation speed | Fast | Slow, requires training |
| Auditability | High, sources visible | Low, knowledge in weights |
| Best for | Factual knowledge injection | Style and behavior adjustment |
| Cost | Inference cost + retrieval | Training cost + inference cost |
| Knowledge updates | Re-index documents | Retrain the model |
| Hallucination risk | Lower with good retrieval | Remains significant |
Advanced RAG Techniques
Contextual Retrieval
Anthropic published a technique called contextual retrieval that prepends a short chunk-specific context to each chunk before embedding it. This context explains where the chunk sits within the larger document, which helps the embedding model produce more informative vectors for chunks that are ambiguous or rely on surrounding context for meaning.
python
from anthropic import Anthropic
client = Anthropic()
def add_context_to_chunk(document_text, chunk_text):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Here is a document:
<document>
{document_text}
</document>
Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>
Provide a short 2-3 sentence context that situates this chunk
within the broader document. Focus on what makes this chunk
retrievable for relevant queries. Reply with only the context,
no preamble.
"""
}]
)
context = response.content[0].text
return f"{context}\n\n{chunk_text}"
Parent Document Retrieval
Parent document retrieval indexes small child chunks for precise retrieval but returns the larger parent chunk as context to the language model. The small chunk gives precise retrieval. The large parent gives the model sufficient surrounding context to generate a good answer.
python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Small chunks for retrieval precision
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Larger chunks as context for the LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
parent_retriever.add_documents(documents)
results = parent_retriever.invoke(query)
Self-Query Retrieval
Self-query retrieval uses a language model to parse a natural language query into both a semantic search query and a structured metadata filter. A question like “what did the engineering team publish about Kubernetes in the last six months” is automatically decomposed into a semantic query about Kubernetes and a metadata filter on department equals engineering and date greater than six months ago.
python
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
metadata_field_info = [
AttributeInfo(
name="department",
description="The department that produced the document",
type="string"
),
AttributeInfo(
name="created_date",
description="Date the document was created, in YYYY-MM-DD format",
type="string"
),
AttributeInfo(
name="content_type",
description="Type of content: policy, guide, faq, or announcement",
type="string"
)
]
self_query_retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents="Company internal documents and policies",
metadata_field_info=metadata_field_info,
verbose=True
)
results = self_query_retriever.invoke(
"What engineering policies were updated this year?"
)
RAG Pipeline Cheat Sheet
| Component | What It Does | Key Decision |
|---|---|---|
| Document loader | Extracts text from source files | Handle format-specific quirks cleanly |
| Chunker | Splits documents into indexable pieces | Chunk size and overlap strategy |
| Embedding model | Converts chunks to vectors | Model quality vs cost tradeoff |
| Vector store | Stores and searches vectors | Scale, filtering, and hosting needs |
| Retriever | Finds relevant chunks at query time | k, MMR, hybrid, or reranking |
| Query transformer | Rewrites query for better retrieval | HyDE or multi-query for complex queries |
| Reranker | Reorders retrieved results by accuracy | Cross-encoder for quality-critical use cases |
| Prompt template | Structures context and question for LLM | Instruction clarity and source attribution |
| Language model | Generates response from context | Model capability vs cost |
| Output parser | Extracts structured output if needed | Depends on application requirements |
Common RAG Failure Modes
Retrieval missing relevant chunks is the most common problem and usually originates in chunking or embedding decisions. Chunks that are too large dilute the embedding signal. Chunks that cut important context at boundaries miss the relevant content. Embedding models that are not well-suited to the content type produce vectors that do not cluster semantically related content correctly. Diagnosis: manually inspect retrieved chunks for representative queries. If the right chunk is not in the top five results, the problem is retrieval, not generation.
Retrieved chunks lacking context causes the language model to generate incomplete or incorrect answers even when the right chunk is retrieved. A chunk that says “the exception to this rule applies when the account is flagged” is not useful without knowing what rule is being discussed. Parent document retrieval and contextual retrieval directly address this failure mode.
Answer ignoring retrieved context happens when the language model uses its parametric knowledge instead of the provided context. This is often a prompt issue. Being more explicit in the prompt that the model must answer from the context and must acknowledge uncertainty when the context is insufficient reduces this failure.
Retrieval of irrelevant chunks pollutes the context window with noise that confuses the language model. This is often a threshold or k-value issue. Retrieving too many chunks includes ones that are only weakly related. Adding relevance score thresholds to filter out low-scoring results, or reducing k, often helps.
Latency too high for production is an operational issue. Embedding the query, performing vector search, optionally reranking, and calling the language model are all sequential steps that add latency. Caching embeddings for common queries, using faster embedding models, reducing k, and choosing a vector database optimized for low-latency search are the primary levers.
FAQs
What is RAG in simple terms?
RAG stands for retrieval-augmented generation. It is a technique for giving a language model access to specific knowledge by retrieving relevant documents at query time and including them in the prompt. Instead of relying on the model’s training data alone, a RAG system searches a knowledge base for information relevant to the user’s question and provides that information as context for the model to reason over. This allows language models to answer questions about private, proprietary, or recent information they were never trained on.
What is the difference between RAG and fine-tuning?
Fine-tuning modifies a model’s weights by training it on domain-specific data. It is best for teaching a model a specific style, format, or reasoning approach. RAG keeps the model’s weights unchanged and provides knowledge through retrieved context at inference time. It is best for giving a model access to specific factual knowledge that needs to be current, auditable, or frequently updated. Most production AI applications use RAG for knowledge injection and consider fine-tuning only when the model’s behavior or output style needs adjustment beyond what prompting achieves.
What is chunking in RAG and why does it matter?
Chunking is the process of splitting source documents into smaller pieces before embedding and indexing them. It matters because embedding an entire document as a single vector loses the precision needed to retrieve specific relevant passages. Smaller chunks produce more precise embeddings and more accurate retrieval. The key tradeoffs are chunk size (smaller is more precise but loses surrounding context), chunk overlap (overlap reduces the chance of cutting relevant content at boundaries), and chunking strategy (fixed-size vs semantic vs structure-aware).
How do you evaluate a RAG pipeline?
RAG evaluation measures two separate things: retrieval quality and generation quality. Retrieval quality measures whether the relevant chunks were retrieved. Metrics include recall at k (did the relevant chunk appear in the top k results), precision at k (what fraction of the top k results were relevant), and mean reciprocal rank. Generation quality measures whether the language model produced a correct, grounded answer from the retrieved context. Metrics include faithfulness (does the answer contradict the context), answer relevance (does the answer address the question), and context relevance (how relevant were the retrieved chunks to the question). Frameworks like RAGAS automate many of these evaluations.
What vector database should I use for a RAG application?
For prototyping and small-scale applications, Chroma is the easiest to get started with and requires minimal configuration. For production applications, the right choice depends on your scale and deployment requirements. Pinecone is fully managed and requires no infrastructure work. Qdrant and Weaviate are open source with strong performance and self-hosting options. pgvector is the lowest-friction option for teams already using PostgreSQL who need moderate-scale vector search without adopting a new database system. Start with Chroma to build and test your pipeline and migrate to a production-grade option when you have a clear picture of your scale and operational requirements.