Day 8 — How RAG Works: Build a Document Q&A Bot
By the end of this post you will have a working program that answers questions about any PDF. Not a half-working tutorial — actual code you can run, break, fix, and show in an interview.
Time required: 45 minutes
Cost: Less than ₹1 in API calls
What you need: Laptop with Python installed, OpenAI API key
Before You Start — Check These First
Open your terminal and run:
python --version
You should see Python 3.8 or higher.
If you see an error:
- Windows: Download from python.org → install → restart terminal
- Mac: Run
brew install python3 - Linux: Run
sudo apt install python3 python3-pip
What We Are Building
A program where you:
- Give it any PDF (placement brochure, textbook, HR policy)
- Ask questions in plain English
- Get accurate answers from the actual document
This is exactly how Perplexity, ChatGPT file upload, and Notion AI work internally.
What is RAG? (5 minute explanation)
The problem: GPT-4 was trained on internet data until 2024. It has no idea what is in your college placement brochure.
If you ask: "What is the CGPA cutoff for TCS at my college?" It will either say it does not know, or make something up.
The solution: Find the relevant parts of the document yourself, then include them in the question.
WITHOUT RAG:
You → "What is the TCS cutoff?" → AI → "I don't know" ❌
WITH RAG:
You → "What is the TCS cutoff?"
System finds: "TCS requires minimum 6.5 CGPA, no backlogs"
System asks AI: "Based on this text: [TCS requires 6.5 CGPA...],
what is the TCS cutoff?"
AI → "TCS requires a minimum CGPA of 6.5 with no active backlogs" ✅
RAG = Retrieval Augmented Generation
- Retrieve the relevant document sections
- Augment the prompt with those sections
- Generate the answer
How It Works in 3 Steps
Step 1 — Chunk and embed the document
The PDF gets split into small pieces (chunks) of ~400 words each. Each chunk gets converted into a vector — a list of 1536 numbers that captures the meaning of that text.
"TCS requires 60% marks"
→ [0.23, -0.87, 0.45, 0.12, ...] (1536 numbers)
"minimum percentage for TCS"
→ [0.21, -0.85, 0.47, 0.14, ...] (very similar numbers!)
"today's weather in Hyderabad"
→ [0.91, 0.34, -0.23, 0.67, ...] (very different numbers)
Similar meaning = similar numbers. This is called a vector embedding.
Step 2 — Search for relevant chunks
When you ask a question, it also gets converted to a vector. ChromaDB finds the 3 document chunks with the most similar vectors.
Step 3 — Generate with context
Those 3 chunks + your question go to GPT-4o-mini:
"Based on this context: [chunk 1] [chunk 2] [chunk 3]
Answer: What is the TCS cutoff?"
The AI answers based on the actual document. No making things up.
Project Setup — Step by Step
Step 1: Create the Project Folder
mkdir rag-document-qa
cd rag-document-qa
Step 2: Create a Virtual Environment
python -m venv venv
Activate it:
# Windows
venv\Scripts\activate
# Mac or Linux
source venv/bin/activate
You should see (venv) at the start of your terminal line. This means it is active.
Step 3: Install Dependencies
pip install openai chromadb pypdf python-dotenv
Wait for installation to complete. You will see:
Successfully installed openai-x.x chromadb-x.x pypdf-x.x python-dotenv-x.x
Step 4: Get Your OpenAI API Key
- Go to platform.openai.com
- Sign up or log in
- Click your name → API Keys → Create new secret key
- Copy the key. It starts with
sk-
Cost for this project: About ₹0.50 for 100 questions.
Step 5: Create Your Files
Your final project structure will be:
rag-document-qa/
├── .env ← your API key (never share or commit this)
├── .gitignore ← tells git to ignore .env and venv
├── requirements.txt ← list of dependencies
├── rag_bot.py ← the main program
└── test.pdf ← a PDF to test with
Create .env:
On Mac/Linux:
touch .env
On Windows:
type nul > .env
Open .env in any text editor and add:
OPENAI_API_KEY=sk-your-actual-key-here
Create .gitignore:
# Mac/Linux
touch .gitignore
# Windows
type nul > .gitignore
Add this content to .gitignore:
.env
venv/
__pycache__/
*.pyc
chroma_db/
Create requirements.txt:
# Mac/Linux
touch requirements.txt
# Windows
type nul > requirements.txt
Add this content:
openai>=1.0.0
chromadb>=0.4.0
pypdf>=3.0.0
python-dotenv>=1.0.0
The Main Program
Create rag_bot.py and paste the complete code below:
# rag_bot.py
# RAG Document Q&A System
# Day 8 — AI Survival Kit for Engineers
import os
import sys
from openai import OpenAI
import chromadb
from pypdf import PdfReader
from dotenv import load_dotenv
# Load API key from .env file
load_dotenv()
# Check that API key exists
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
print("ERROR: OPENAI_API_KEY not found in .env file")
print("Create a .env file and add: OPENAI_API_KEY=sk-your-key")
sys.exit(1)
# Set up OpenAI
client = OpenAI(api_key=api_key)
# Set up ChromaDB (runs in memory — resets when program restarts)
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
# ── PART 1: Process the PDF ──────────────────────────────────────────────────
def read_pdf(path: str) -> str:
"""Extract all text from a PDF file"""
if not os.path.exists(path):
print(f"ERROR: File not found: {path}")
sys.exit(1)
print(f"Reading {path}...")
reader = PdfReader(path)
text = ""
for i, page in enumerate(reader.pages):
page_text = page.extract_text()
if page_text:
text += f"\n[Page {i+1}]\n{page_text}"
if not text.strip():
print("ERROR: No text found. Your PDF might be a scanned image.")
sys.exit(1)
print(f"Extracted text from {len(reader.pages)} pages")
return text
def split_into_chunks(text: str, chunk_size: int = 400, overlap: int = 50) -> list:
"""
Split text into overlapping chunks.
chunk_size = words per chunk (400 is a good default)
overlap = words shared between chunks (prevents losing info at edges)
"""
words = text.split()
chunks = []
step = chunk_size - overlap
for i in range(0, len(words), step):
chunk = " ".join(words[i : i + chunk_size])
if len(chunk.strip()) > 50: # skip very short pieces
chunks.append(chunk)
return chunks
def get_embedding(text: str) -> list:
"""Convert text into a vector (list of 1536 numbers)"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def load_pdf(path: str):
"""Full pipeline: PDF → chunks → embeddings → ChromaDB"""
text = read_pdf(path)
chunks = split_into_chunks(text)
print(f"Split into {len(chunks)} chunks")
print("Creating embeddings (may take 30 seconds for large PDFs)...")
for i, chunk in enumerate(chunks):
if i % 10 == 0:
print(f" chunk {i+1}/{len(chunks)}...")
collection.add(
documents=[chunk],
embeddings=[get_embedding(chunk)],
ids=[f"chunk_{i}"],
metadatas=[{"index": i}]
)
print(f"Done! Loaded {len(chunks)} chunks.\n")
# ── PART 2: Answer Questions ─────────────────────────────────────────────────
def find_chunks(question: str, n: int = 3) -> list:
"""Find the n most relevant chunks for a question"""
results = collection.query(
query_embeddings=[get_embedding(question)],
n_results=n,
include=["documents", "distances"]
)
chunks = []
for i in range(len(results["documents"][0])):
chunks.append({
"text": results["documents"][0][i],
"similarity": round(1 - results["distances"][0][i], 2)
})
return chunks
def answer(question: str) -> str:
"""RAG pipeline: question → find chunks → prompt → answer"""
chunks = find_chunks(question, n=3)
context = "\n\n---\n\n".join([c["text"] for c in chunks])
# Show similarity scores so student can see what was retrieved
print(f"\nFound {len(chunks)} relevant sections:")
for i, c in enumerate(chunks):
print(f" Section {i+1}: similarity {c['similarity']}")
prompt = f"""Answer the question using ONLY the context below.
If the answer is not in the context, say: "I could not find this in the document."
Do not make up any information.
Context:
{context}
Question: {question}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=500
)
return response.choices[0].message.content
# ── MAIN ─────────────────────────────────────────────────────────────────────
def main():
print("=" * 50)
print("RAG Document Q&A Bot")
print("Day 8 — AI Survival Kit")
print("=" * 50)
print()
path = input("PDF file path: ").strip().strip('"').strip("'")
load_pdf(path)
print("Type your questions. Type 'quit' to exit.")
print("-" * 50)
while True:
q = input("\nQuestion: ").strip()
if not q:
continue
if q.lower() in ["quit", "exit", "q"]:
break
print("\n" + answer(q))
if __name__ == "__main__":
main()
Get a Test PDF
Option 1 — Use your placement brochure Download it from your college website and save it in the project folder.
Option 2 — Create one yourself
Create a file called placement_info.txt with this content:
TCS Placement Information 2025
Eligibility Criteria:
Minimum 60% marks in 10th, 12th and graduation.
No active backlogs at the time of selection.
2023, 2024, 2025 and 2026 batch students are eligible.
B.Tech, B.E, M.Tech and MCA graduates are eligible.
Salary:
TCS Digital: 7 LPA
TCS Smart: 3.36 LPA
Selection Process:
Step 1: Online test (90 minutes) covering aptitude, verbal and reasoning.
Step 2: Technical interview.
Step 3: HR interview.
Infosys Placement Information 2025
Eligibility:
Minimum 65% aggregate marks.
2024 and 2025 batch graduates preferred.
No more than 2 years gap in education.
Salary: 3.6 LPA for Systems Engineer role.
Selection Process:
Step 1: Online assessment.
Step 2: Technical round.
Step 3: HR round.
Go to smallpdf.com → click Word to PDF → paste the text → download as placement_info.pdf → put it in your project folder.
Run the Program
Make sure your virtual environment is active (you see (venv) in terminal):
python rag_bot.py
Expected output:
==================================================
RAG Document Q&A Bot
Day 8 — AI Survival Kit
==================================================
PDF file path: placement_info.pdf
Reading placement_info.pdf...
Extracted text from 1 pages
Split into 5 chunks
Creating embeddings (may take 30 seconds for large PDFs)...
chunk 1/5...
Done! Loaded 5 chunks.
Type your questions. Type 'quit' to exit.
--------------------------------------------------
Question: What is the eligibility for TCS?
Found 3 relevant sections:
Section 1: similarity 0.87
Section 2: similarity 0.72
Section 3: similarity 0.58
To be eligible for TCS placement you need:
- Minimum 60% marks in 10th, 12th and graduation
- No active backlogs at time of selection
- Be from the 2023, 2024, 2025 or 2026 batch
- Be a B.Tech, B.E, M.Tech or MCA graduate
Common Errors and Fixes
ModuleNotFoundError: No module named 'openai'
Your virtual environment is not active.
# Mac/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
# Then install again
pip install openai chromadb pypdf python-dotenv
AuthenticationError: Incorrect API key
Open .env and check:
- No spaces around
=: writeOPENAI_API_KEY=sk-abc123notOPENAI_API_KEY = sk-abc123 - The key starts with
sk- - You copied the complete key
Could not extract text from PDF
Your PDF is a scanned image (a photo of a document). Use a different PDF or use the test content above.
openai.RateLimitError
You hit the API rate limit. Wait 60 seconds and try again.
Break It — Learn by Experimenting
Do not skip this section. Breaking and fixing teaches you 10x more than reading.
Experiment 1 — Reduce Retrieved Chunks
Find this line in rag_bot.py:
chunks = find_chunks(question, n=3)
Change n=3 to n=1:
chunks = find_chunks(question, n=1)
Ask the same question about TCS eligibility.
What changes? The answer is less complete. It may miss some criteria.
Why? With only 1 chunk retrieved, you give the AI less context to work with.
What you learned: More retrieved chunks = richer context = better answers. But too many = expensive and confusing. 3 to 5 is usually the right balance.
Change it back to n=3.
Experiment 2 — Change Chunk Size
Find this line:
chunks = split_into_chunks(text)
Change it to:
chunks = split_into_chunks(text, chunk_size=50, overlap=10)
Restart the program and ask your question.
What changes? More chunks, but each one is very short. Answers feel incomplete.
Why? 50-word chunks are too small to contain complete thoughts. The retrieved pieces do not give the AI enough to work with.
What you learned: Chunk size is the most important setting in RAG. Too small = fragmented. Too large = irrelevant information mixed in. 300 to 500 words is usually best.
Change it back to the default.
Experiment 3 — Ask Something Not in the Document
Ask: "What is the IIT JEE syllabus for 2025?"
What happens? The program says it could not find this in the document. It does NOT make something up.
Why? The prompt says: "If the answer is not in the context, say: I could not find this in the document. Do not make up any information."
What you learned: The system prompt controls AI behaviour completely. Try removing that instruction and asking again. You will see the AI start making things up — that is hallucination.
Experiment 4 — See the Raw Embeddings
Add one print line inside get_embedding():
def get_embedding(text: str) -> list:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embedding = response.data[0].embedding
print(f" Embedding: {len(embedding)} numbers, first 3: {embedding[:3]}") # ADD THIS
return embedding
Run the program and load the PDF.
What do you see? Every chunk gets turned into 1536 numbers. The question becomes 1536 numbers too. ChromaDB compares them all to find the closest match.
What you learned: Embeddings are just lists of numbers. The intelligence is in how similar meanings produce similar numbers. This is the foundation of all modern AI search.
Remove the print line when done.
The Challenge
Goal: Show exactly which part of the document each answer came from.
After displaying the answer, print the first 150 characters of the most relevant chunk as the source.
Expected output:
Answer: TCS requires a minimum of 60% marks throughout...
Source: "TCS Placement Information 2025. Eligibility Criteria: Minimum 60% marks in 10th, 12th and..."
Hint: The chunks list inside the answer() function has chunk["text"] for each retrieved chunk. chunks[0] is the most relevant one.
Try it yourself first. If you get stuck, the solution is in the GitHub repository.
Add a Web Interface (Optional)
Install Streamlit:
pip install streamlit
Create app.py:
import streamlit as st
import tempfile
import os
from rag_bot import load_pdf, answer
st.set_page_config(page_title="Document Q&A", page_icon="📚")
st.title("📚 Document Q&A")
st.caption("Upload any PDF and ask questions about it")
uploaded = st.file_uploader("Upload PDF", type="pdf")
if uploaded:
# Save to temp file
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
tmp.write(uploaded.getbuffer())
tmp_path = tmp.name
if st.button("Load Document"):
with st.spinner("Processing..."):
load_pdf(tmp_path)
st.success("Ready! Ask your questions below.")
question = st.text_input("Your question:")
if question:
with st.spinner("Thinking..."):
result = answer(question)
st.write("**Answer:**")
st.write(result)
Run it:
streamlit run app.py
Opens in your browser at http://localhost:8501.
What to Push to GitHub
# Initialise git
git init
# Add everything except .env and venv
git add .
git commit -m "Day 8: RAG document Q&A bot"
# Create repo on github.com and push
git remote add origin https://github.com/yourusername/rag-document-qa.git
git push -u origin main
Your repo should have:
rag-document-qa/
├── .gitignore ✅ (includes .env and venv)
├── README.md ← write 3 lines explaining what it does
├── requirements.txt ✅
├── rag_bot.py ✅
└── app.py ✅ (if you built the web UI)
NOT in GitHub:
.env ← never push this
venv/ ← never push this
What to Write in Your Portfolio
Project title: RAG Document Q&A System
Description (copy this): Built a Retrieval Augmented Generation system from scratch using Python. The system reads any PDF, splits it into 400-word chunks with 50-word overlap, embeds each chunk using OpenAI text-embedding-3-small, and stores vectors in ChromaDB. When a question is asked, it finds the 3 most semantically similar chunks using cosine similarity and feeds them as context to GPT-4o-mini. The result is accurate, document-grounded answers with no hallucination.
Tech stack: Python, OpenAI API, ChromaDB, pypdf, Streamlit
How to Explain This in an Interview
"I built a RAG system — Retrieval Augmented Generation. The user uploads any PDF and the system splits it into 400-word overlapping chunks. Each chunk gets converted to a 1536-dimensional vector using OpenAI embeddings and stored in ChromaDB. When a question comes in, it also gets embedded and ChromaDB uses cosine similarity to find the 3 most relevant chunks. Those chunks become the context in a prompt to GPT-4o-mini which generates an answer grounded in the actual document. I tested it with college placement brochures — it can answer specific eligibility questions accurately when the AI alone would hallucinate."
If asked follow-ups:
- "Why overlapping chunks?" → So information at chunk boundaries is not lost. A sentence spanning two chunks appears in both.
- "What is cosine similarity?" → It measures the angle between two vectors. Similar meaning = similar direction = high score.
- "Why ChromaDB and not a regular database?" → Regular databases match exact words. ChromaDB matches meaning, so similar questions find the right chunks even if the words differ.
Key Terms
| Term | Meaning |
|---|---|
| RAG | Retrieve relevant chunks → add to prompt → generate answer |
| Embedding | Text converted to a list of numbers capturing its meaning |
| Vector database | Stores embeddings, finds similar ones quickly |
| Semantic search | Finds similar meaning, not just matching keywords |
| Chunking | Splitting documents into smaller searchable pieces |
| Cosine similarity | Measures similarity between two vectors (0 to 1) |
| Hallucination | AI making up information — RAG prevents this |
Tomorrow: Day 9 — How AI Agents work. We will build an agent that searches job boards, filters results, ranks by your profile, and drafts outreach messages — all without you clicking anything.
Day 8 of 15 — AI Survival Kit for Engineers