Build a RAG Pipeline with LangChain and Pinecone

Q: Can I use a local embedding model instead of OpenAI?

Yes. Replace OpenAIEmbeddings with HuggingFaceEmbeddings(modelname="BAAI/bge-small-en-v1.5"). You'll need pip install sentence-transformers. The embedding dimension changes (384 for bge-small), so create a new Pinecone index with the matching dimension.

What Is RAG?

Retrieval-Augmented Generation (RAG) solves a fundamental LLM limitation: models only know what they were trained on. If you ask gpt-4o about your internal documentation, last week’s earnings call, or a niche technical topic, it will hallucinate or say it doesn’t know.

RAG fixes this by retrieving relevant context at query time and injecting it into the prompt. The pipeline works like this:

Ingest — split documents into chunks, embed them, store in a vector DB
Query — embed the user’s question, find similar chunks, stuff them into the prompt
Generate — the LLM answers using retrieved context, not memory

Pinecone is the leading managed vector database. It handles billions of vectors at millisecond latency, so you don’t need to run your own infrastructure.

Prerequisites

Completed Introduction to LangChain
Pinecone account (free tier available at pinecone.io)
OpenAI API key

Installation

pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv

Add to your .env:

OPENAI_API_KEY=sk-your-openai-key
PINECONE_API_KEY=your-pinecone-key
PINECONE_INDEX_NAME=agentscookbook-demo

Step 1: Create a Pinecone Index

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# text-embedding-3-small produces 1536-dim vectors
pc.create_index(
    name="agentscookbook-demo",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

Run this once. After that, the index persists in Pinecone’s cloud — no local state needed.

Step 2: Ingest Documents

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

load_dotenv()

# Load your documents
loader = TextLoader("docs/knowledge-base.txt")
documents = loader.load()

# Split into chunks (~500 tokens each, 50-token overlap)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# Embed and upload to Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
    chunks,
    embeddings,
    index_name=os.environ["PINECONE_INDEX_NAME"],
)

print(f"Uploaded {len(chunks)} chunks to Pinecone")

Step 3: Build the RAG Chain

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Connect to the existing index
vectorstore = PineconeVectorStore(
    index_name=os.environ["PINECONE_INDEX_NAME"],
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# RAG prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant. Answer the question using ONLY
the context provided below. If the answer isn't in the context, say so.

Context:
{context}"""),
    ("human", "{question}"),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Query it
answer = rag_chain.invoke("What is our refund policy?")
print(answer)

Evaluating RAG Quality

Three metrics matter in production:

Metric	What it measures	Target
Faithfulness	Does the answer match the context?	> 0.9
Answer relevancy	Is the answer relevant to the question?	> 0.8
Context recall	Were the right chunks retrieved?	> 0.7

Use LangSmith to trace every retrieval and generation step automatically.

Frequently Asked Questions

How many documents can Pinecone handle?

Pinecone’s free Starter plan supports 100,000 vectors. The paid tiers scale to billions. For a typical knowledge base of 1,000 documents (~500 chunks each), you’ll stay comfortably within the free tier.

What chunk size should I use?

Start with 500 tokens / 50-token overlap — it works well for most text. Increase chunk size for longer documents with dense technical content; decrease for short Q&A pairs. Always measure retrieval quality on a representative sample before tuning.

Can I use a local embedding model instead of OpenAI?

Yes. Replace OpenAIEmbeddings with HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5"). You’ll need pip install sentence-transformers. The embedding dimension changes (384 for bge-small), so create a new Pinecone index with the matching dimension.

Next Steps

LangChain vs LlamaIndex — Compare the two leading RAG frameworks
What Is AutoGPT — Explore autonomous agents that can run RAG pipelines themselves