What Is RAG?
Retrieval-Augmented Generation (RAG) solves a fundamental LLM limitation: models only know what they were trained on. If you ask gpt-4o about your internal documentation, last week’s earnings call, or a niche technical topic, it will hallucinate or say it doesn’t know.
RAG fixes this by retrieving relevant context at query time and injecting it into the prompt. The pipeline works like this:
- Ingest — split documents into chunks, embed them, store in a vector DB
- Query — embed the user’s question, find similar chunks, stuff them into the prompt
- Generate — the LLM answers using retrieved context, not memory
Pinecone is the leading managed vector database. It handles billions of vectors at millisecond latency, so you don’t need to run your own infrastructure.
Prerequisites
- Completed Introduction to LangChain
- Pinecone account (free tier available at pinecone.io)
- OpenAI API key
Disclosure: This article contains affiliate links. We may earn a commission at no extra cost to you.
Installation
pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv
Add to your .env:
OPENAI_API_KEY=sk-your-openai-key
PINECONE_API_KEY=your-pinecone-key
PINECONE_INDEX_NAME=agentscookbook-demo
Step 1: Create a Pinecone Index
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# text-embedding-3-small produces 1536-dim vectors
pc.create_index(
name="agentscookbook-demo",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
Run this once. After that, the index persists in Pinecone’s cloud — no local state needed.
Step 2: Ingest Documents
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
load_dotenv()
# Load your documents
loader = TextLoader("docs/knowledge-base.txt")
documents = loader.load()
# Split into chunks (~500 tokens each, 50-token overlap)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# Embed and upload to Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
chunks,
embeddings,
index_name=os.environ["PINECONE_INDEX_NAME"],
)
print(f"Uploaded {len(chunks)} chunks to Pinecone")
Step 3: Build the RAG Chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Connect to the existing index
vectorstore = PineconeVectorStore(
index_name=os.environ["PINECONE_INDEX_NAME"],
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# RAG prompt
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant. Answer the question using ONLY
the context provided below. If the answer isn't in the context, say so.
Context:
{context}"""),
("human", "{question}"),
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Build the RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Query it
answer = rag_chain.invoke("What is our refund policy?")
print(answer)
Evaluating RAG Quality
Three metrics matter in production:
| Metric | What it measures | Target |
|---|---|---|
| Faithfulness | Does the answer match the context? | > 0.9 |
| Answer relevancy | Is the answer relevant to the question? | > 0.8 |
| Context recall | Were the right chunks retrieved? | > 0.7 |
Use LangSmith to trace every retrieval and generation step automatically.
Frequently Asked Questions
How many documents can Pinecone handle?
Pinecone’s free Starter plan supports 100,000 vectors. The paid tiers scale to billions. For a typical knowledge base of 1,000 documents (~500 chunks each), you’ll stay comfortably within the free tier.
What chunk size should I use?
Start with 500 tokens / 50-token overlap — it works well for most text. Increase chunk size for longer documents with dense technical content; decrease for short Q&A pairs. Always measure retrieval quality on a representative sample before tuning.
Can I use a local embedding model instead of OpenAI?
Yes. Replace OpenAIEmbeddings with HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5"). You’ll need pip install sentence-transformers. The embedding dimension changes (384 for bge-small), so create a new Pinecone index with the matching dimension.
Next Steps
- LangChain vs LlamaIndex — Compare the two leading RAG frameworks
- What Is AutoGPT — Explore autonomous agents that can run RAG pipelines themselves