Day 8 — How RAG Works: Build a Document Q&A Bot

By the end of this post you will have a working program that answers questions about any PDF. Not a half-working tutorial — actual code you can run, break, fix, and show in an interview.

Time required: 45 minutes
Cost: Less than ₹1 in API calls
What you need: Laptop with Python installed, OpenAI API key

Before You Start — Check These First

Open your terminal and run:

python --version

You should see Python 3.8 or higher.

If you see an error:

Windows: Download from python.org → install → restart terminal
Mac: Run brew install python3
Linux: Run sudo apt install python3 python3-pip

What We Are Building

A program where you:

Give it any PDF (placement brochure, textbook, HR policy)
Ask questions in plain English
Get accurate answers from the actual document

This is exactly how Perplexity, ChatGPT file upload, and Notion AI work internally.

What is RAG? (5 minute explanation)

The problem: GPT-4 was trained on internet data until 2024. It has no idea what is in your college placement brochure.

If you ask: "What is the CGPA cutoff for TCS at my college?" It will either say it does not know, or make something up.

The solution: Find the relevant parts of the document yourself, then include them in the question.

WITHOUT RAG:
You → "What is the TCS cutoff?" → AI → "I don't know" ❌

WITH RAG:
You → "What is the TCS cutoff?"
System finds: "TCS requires minimum 6.5 CGPA, no backlogs"
System asks AI: "Based on this text: [TCS requires 6.5 CGPA...],
                what is the TCS cutoff?"
AI → "TCS requires a minimum CGPA of 6.5 with no active backlogs" ✅

RAG = Retrieval Augmented Generation

Retrieve the relevant document sections
Augment the prompt with those sections
Generate the answer

How It Works in 3 Steps

Step 1 — Chunk and embed the document

The PDF gets split into small pieces (chunks) of ~400 words each. Each chunk gets converted into a vector — a list of 1536 numbers that captures the meaning of that text.

"TCS requires 60% marks"
→ [0.23, -0.87, 0.45, 0.12, ...] (1536 numbers)

"minimum percentage for TCS"  
→ [0.21, -0.85, 0.47, 0.14, ...] (very similar numbers!)

"today's weather in Hyderabad"
→ [0.91, 0.34, -0.23, 0.67, ...] (very different numbers)

Similar meaning = similar numbers. This is called a vector embedding.

Step 2 — Search for relevant chunks

When you ask a question, it also gets converted to a vector. ChromaDB finds the 3 document chunks with the most similar vectors.

Step 3 — Generate with context

Those 3 chunks + your question go to GPT-4o-mini:

"Based on this context: [chunk 1] [chunk 2] [chunk 3]
Answer: What is the TCS cutoff?"

The AI answers based on the actual document. No making things up.

Project Setup — Step by Step

Step 1: Create the Project Folder

mkdir rag-document-qa
cd rag-document-qa

Step 2: Create a Virtual Environment

python -m venv venv

Activate it:

# Windows
venv\Scripts\activate

# Mac or Linux
source venv/bin/activate

You should see (venv) at the start of your terminal line. This means it is active.

Step 3: Install Dependencies

pip install openai chromadb pypdf python-dotenv

Wait for installation to complete. You will see:

Successfully installed openai-x.x chromadb-x.x pypdf-x.x python-dotenv-x.x

Step 4: Get Your OpenAI API Key

Go to platform.openai.com
Sign up or log in
Click your name → API Keys → Create new secret key
Copy the key. It starts with sk-

Cost for this project: About ₹0.50 for 100 questions.

Step 5: Create Your Files

Your final project structure will be:

rag-document-qa/
├── .env              ← your API key (never share or commit this)
├── .gitignore        ← tells git to ignore .env and venv
├── requirements.txt  ← list of dependencies
├── rag_bot.py        ← the main program
└── test.pdf          ← a PDF to test with

Create .env:

On Mac/Linux:

touch .env

On Windows:

type nul > .env

Open .env in any text editor and add:

OPENAI_API_KEY=sk-your-actual-key-here

Create .gitignore:

# Mac/Linux
touch .gitignore

# Windows  
type nul > .gitignore

Add this content to .gitignore:

.env
venv/
__pycache__/
*.pyc
chroma_db/

Create requirements.txt:

# Mac/Linux
touch requirements.txt

# Windows
type nul > requirements.txt

Add this content:

openai>=1.0.0
chromadb>=0.4.0
pypdf>=3.0.0
python-dotenv>=1.0.0

The Main Program

Create rag_bot.py and paste the complete code below:

# rag_bot.py
# RAG Document Q&A System
# Day 8 — AI Survival Kit for Engineers

import os
import sys
from openai import OpenAI
import chromadb
from pypdf import PdfReader
from dotenv import load_dotenv

# Load API key from .env file
load_dotenv()

# Check that API key exists
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    print("ERROR: OPENAI_API_KEY not found in .env file")
    print("Create a .env file and add: OPENAI_API_KEY=sk-your-key")
    sys.exit(1)

# Set up OpenAI
client = OpenAI(api_key=api_key)

# Set up ChromaDB (runs in memory — resets when program restarts)
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)


# ── PART 1: Process the PDF ──────────────────────────────────────────────────

def read_pdf(path: str) -> str:
    """Extract all text from a PDF file"""

    if not os.path.exists(path):
        print(f"ERROR: File not found: {path}")
        sys.exit(1)

    print(f"Reading {path}...")
    reader = PdfReader(path)
    text = ""

    for i, page in enumerate(reader.pages):
        page_text = page.extract_text()
        if page_text:
            text += f"\n[Page {i+1}]\n{page_text}"

    if not text.strip():
        print("ERROR: No text found. Your PDF might be a scanned image.")
        sys.exit(1)

    print(f"Extracted text from {len(reader.pages)} pages")
    return text


def split_into_chunks(text: str, chunk_size: int = 400, overlap: int = 50) -> list:
    """
    Split text into overlapping chunks.

    chunk_size = words per chunk (400 is a good default)
    overlap    = words shared between chunks (prevents losing info at edges)
    """
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + chunk_size])
        if len(chunk.strip()) > 50:   # skip very short pieces
            chunks.append(chunk)

    return chunks


def get_embedding(text: str) -> list:
    """Convert text into a vector (list of 1536 numbers)"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding


def load_pdf(path: str):
    """Full pipeline: PDF → chunks → embeddings → ChromaDB"""

    text   = read_pdf(path)
    chunks = split_into_chunks(text)
    print(f"Split into {len(chunks)} chunks")
    print("Creating embeddings (may take 30 seconds for large PDFs)...")

    for i, chunk in enumerate(chunks):
        if i % 10 == 0:
            print(f"  chunk {i+1}/{len(chunks)}...")

        collection.add(
            documents=[chunk],
            embeddings=[get_embedding(chunk)],
            ids=[f"chunk_{i}"],
            metadatas=[{"index": i}]
        )

    print(f"Done! Loaded {len(chunks)} chunks.\n")


# ── PART 2: Answer Questions ─────────────────────────────────────────────────

def find_chunks(question: str, n: int = 3) -> list:
    """Find the n most relevant chunks for a question"""

    results = collection.query(
        query_embeddings=[get_embedding(question)],
        n_results=n,
        include=["documents", "distances"]
    )

    chunks = []
    for i in range(len(results["documents"][0])):
        chunks.append({
            "text":       results["documents"][0][i],
            "similarity": round(1 - results["distances"][0][i], 2)
        })
    return chunks


def answer(question: str) -> str:
    """RAG pipeline: question → find chunks → prompt → answer"""

    chunks  = find_chunks(question, n=3)
    context = "\n\n---\n\n".join([c["text"] for c in chunks])

    # Show similarity scores so student can see what was retrieved
    print(f"\nFound {len(chunks)} relevant sections:")
    for i, c in enumerate(chunks):
        print(f"  Section {i+1}: similarity {c['similarity']}")

    prompt = f"""Answer the question using ONLY the context below.
If the answer is not in the context, say: "I could not find this in the document."
Do not make up any information.

Context:
{context}

Question: {question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=500
    )
    return response.choices[0].message.content


# ── MAIN ─────────────────────────────────────────────────────────────────────

def main():
    print("=" * 50)
    print("RAG Document Q&A Bot")
    print("Day 8 — AI Survival Kit")
    print("=" * 50)
    print()

    path = input("PDF file path: ").strip().strip('"').strip("'")
    load_pdf(path)

    print("Type your questions. Type 'quit' to exit.")
    print("-" * 50)

    while True:
        q = input("\nQuestion: ").strip()
        if not q:
            continue
        if q.lower() in ["quit", "exit", "q"]:
            break
        print("\n" + answer(q))


if __name__ == "__main__":
    main()

Get a Test PDF

Option 1 — Use your placement brochure Download it from your college website and save it in the project folder.

Option 2 — Create one yourself

Create a file called placement_info.txt with this content:

TCS Placement Information 2025

Eligibility Criteria:
Minimum 60% marks in 10th, 12th and graduation.
No active backlogs at the time of selection.
2023, 2024, 2025 and 2026 batch students are eligible.
B.Tech, B.E, M.Tech and MCA graduates are eligible.

Salary:
TCS Digital: 7 LPA
TCS Smart: 3.36 LPA

Selection Process:
Step 1: Online test (90 minutes) covering aptitude, verbal and reasoning.
Step 2: Technical interview.
Step 3: HR interview.

Infosys Placement Information 2025

Eligibility:
Minimum 65% aggregate marks.
2024 and 2025 batch graduates preferred.
No more than 2 years gap in education.

Salary: 3.6 LPA for Systems Engineer role.

Selection Process:
Step 1: Online assessment.
Step 2: Technical round.
Step 3: HR round.

Go to smallpdf.com → click Word to PDF → paste the text → download as placement_info.pdf → put it in your project folder.

Run the Program

Make sure your virtual environment is active (you see (venv) in terminal):

python rag_bot.py

Expected output:

==================================================
RAG Document Q&A Bot
Day 8 — AI Survival Kit
==================================================

PDF file path: placement_info.pdf
Reading placement_info.pdf...
Extracted text from 1 pages
Split into 5 chunks
Creating embeddings (may take 30 seconds for large PDFs)...
  chunk 1/5...
Done! Loaded 5 chunks.

Type your questions. Type 'quit' to exit.
--------------------------------------------------

Question: What is the eligibility for TCS?

Found 3 relevant sections:
  Section 1: similarity 0.87
  Section 2: similarity 0.72
  Section 3: similarity 0.58

To be eligible for TCS placement you need:
- Minimum 60% marks in 10th, 12th and graduation
- No active backlogs at time of selection
- Be from the 2023, 2024, 2025 or 2026 batch
- Be a B.Tech, B.E, M.Tech or MCA graduate

Common Errors and Fixes

ModuleNotFoundError: No module named 'openai'

Your virtual environment is not active.

# Mac/Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

# Then install again
pip install openai chromadb pypdf python-dotenv

AuthenticationError: Incorrect API key

Open .env and check:

No spaces around =: write OPENAI_API_KEY=sk-abc123 not OPENAI_API_KEY = sk-abc123
The key starts with sk-
You copied the complete key

Could not extract text from PDF

Your PDF is a scanned image (a photo of a document). Use a different PDF or use the test content above.

openai.RateLimitError

You hit the API rate limit. Wait 60 seconds and try again.

Break It — Learn by Experimenting

Do not skip this section. Breaking and fixing teaches you 10x more than reading.

Experiment 1 — Reduce Retrieved Chunks

Find this line in rag_bot.py:

chunks = find_chunks(question, n=3)

Change n=3 to n=1:

chunks = find_chunks(question, n=1)

Ask the same question about TCS eligibility.

What changes? The answer is less complete. It may miss some criteria.

Why? With only 1 chunk retrieved, you give the AI less context to work with.

What you learned: More retrieved chunks = richer context = better answers. But too many = expensive and confusing. 3 to 5 is usually the right balance.

Change it back to n=3.

Experiment 2 — Change Chunk Size

Find this line:

chunks = split_into_chunks(text)

Change it to:

chunks = split_into_chunks(text, chunk_size=50, overlap=10)

Restart the program and ask your question.

What changes? More chunks, but each one is very short. Answers feel incomplete.

Why? 50-word chunks are too small to contain complete thoughts. The retrieved pieces do not give the AI enough to work with.

What you learned: Chunk size is the most important setting in RAG. Too small = fragmented. Too large = irrelevant information mixed in. 300 to 500 words is usually best.

Change it back to the default.

Experiment 3 — Ask Something Not in the Document

Ask: "What is the IIT JEE syllabus for 2025?"

What happens? The program says it could not find this in the document. It does NOT make something up.

Why? The prompt says: "If the answer is not in the context, say: I could not find this in the document. Do not make up any information."

What you learned: The system prompt controls AI behaviour completely. Try removing that instruction and asking again. You will see the AI start making things up — that is hallucination.

Experiment 4 — See the Raw Embeddings

Add one print line inside get_embedding():

def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    embedding = response.data[0].embedding
    print(f"  Embedding: {len(embedding)} numbers, first 3: {embedding[:3]}")  # ADD THIS
    return embedding

Run the program and load the PDF.

What do you see? Every chunk gets turned into 1536 numbers. The question becomes 1536 numbers too. ChromaDB compares them all to find the closest match.

What you learned: Embeddings are just lists of numbers. The intelligence is in how similar meanings produce similar numbers. This is the foundation of all modern AI search.

Remove the print line when done.

The Challenge

Goal: Show exactly which part of the document each answer came from.

After displaying the answer, print the first 150 characters of the most relevant chunk as the source.

Expected output:

Answer: TCS requires a minimum of 60% marks throughout...

Source: "TCS Placement Information 2025. Eligibility Criteria: Minimum 60% marks in 10th, 12th and..."

Hint: The chunks list inside the answer() function has chunk["text"] for each retrieved chunk. chunks[0] is the most relevant one.

Try it yourself first. If you get stuck, the solution is in the GitHub repository.

Add a Web Interface (Optional)

Install Streamlit:

pip install streamlit

Create app.py:

import streamlit as st
import tempfile
import os
from rag_bot import load_pdf, answer

st.set_page_config(page_title="Document Q&A", page_icon="📚")
st.title("📚 Document Q&A")
st.caption("Upload any PDF and ask questions about it")

uploaded = st.file_uploader("Upload PDF", type="pdf")

if uploaded:
    # Save to temp file
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        tmp.write(uploaded.getbuffer())
        tmp_path = tmp.name

    if st.button("Load Document"):
        with st.spinner("Processing..."):
            load_pdf(tmp_path)
        st.success("Ready! Ask your questions below.")

    question = st.text_input("Your question:")
    if question:
        with st.spinner("Thinking..."):
            result = answer(question)
        st.write("**Answer:**")
        st.write(result)

Run it:

streamlit run app.py

Opens in your browser at http://localhost:8501.

What to Push to GitHub

# Initialise git
git init

# Add everything except .env and venv
git add .
git commit -m "Day 8: RAG document Q&A bot"

# Create repo on github.com and push
git remote add origin https://github.com/yourusername/rag-document-qa.git
git push -u origin main

Your repo should have:

rag-document-qa/
├── .gitignore        ✅ (includes .env and venv)
├── README.md         ← write 3 lines explaining what it does
├── requirements.txt  ✅
├── rag_bot.py        ✅
└── app.py            ✅ (if you built the web UI)

NOT in GitHub:

.env              ← never push this
venv/             ← never push this

What to Write in Your Portfolio

Project title: RAG Document Q&A System

Description (copy this): Built a Retrieval Augmented Generation system from scratch using Python. The system reads any PDF, splits it into 400-word chunks with 50-word overlap, embeds each chunk using OpenAI text-embedding-3-small, and stores vectors in ChromaDB. When a question is asked, it finds the 3 most semantically similar chunks using cosine similarity and feeds them as context to GPT-4o-mini. The result is accurate, document-grounded answers with no hallucination.

Tech stack: Python, OpenAI API, ChromaDB, pypdf, Streamlit

How to Explain This in an Interview

"I built a RAG system — Retrieval Augmented Generation. The user uploads any PDF and the system splits it into 400-word overlapping chunks. Each chunk gets converted to a 1536-dimensional vector using OpenAI embeddings and stored in ChromaDB. When a question comes in, it also gets embedded and ChromaDB uses cosine similarity to find the 3 most relevant chunks. Those chunks become the context in a prompt to GPT-4o-mini which generates an answer grounded in the actual document. I tested it with college placement brochures — it can answer specific eligibility questions accurately when the AI alone would hallucinate."

If asked follow-ups:

"Why overlapping chunks?" → So information at chunk boundaries is not lost. A sentence spanning two chunks appears in both.
"What is cosine similarity?" → It measures the angle between two vectors. Similar meaning = similar direction = high score.
"Why ChromaDB and not a regular database?" → Regular databases match exact words. ChromaDB matches meaning, so similar questions find the right chunks even if the words differ.

Key Terms

Term	Meaning
RAG	Retrieve relevant chunks → add to prompt → generate answer
Embedding	Text converted to a list of numbers capturing its meaning
Vector database	Stores embeddings, finds similar ones quickly
Semantic search	Finds similar meaning, not just matching keywords
Chunking	Splitting documents into smaller searchable pieces
Cosine similarity	Measures similarity between two vectors (0 to 1)
Hallucination	AI making up information — RAG prevents this

Tomorrow: Day 9 — How AI Agents work. We will build an agent that searches job boards, filters results, ranks by your profile, and drafts outreach messages — all without you clicking anything.

Day 8 of 15 — AI Survival Kit for Engineers