LLM GPT Claude Groq DeepSeek AI transformers fresher 2026

Day 19 — How LLMs Work: GPT vs Claude vs Groq vs DeepSeek Explained

How large language models actually work explained simply. Plus an honest comparison of GPT-4o, Claude, Groq, DeepSeek — when to use each, pricing, and which to build with.

17 May 2026 9 min read

Day 19 — How LLMs Work: GPT vs Claude vs Groq vs DeepSeek

You have been using AI tools throughout this course. Today you learn how they actually work.

Not the deep academic version — the practical understanding that lets you use these tools more effectively, explain them in interviews, and make smart decisions about which to use for what.


What an LLM Actually Is

LLM stands for Large Language Model. The "large" refers to both the amount of training data and the number of parameters (the numbers inside the model that encode its knowledge).

The core task an LLM is trained to do is deceptively simple: predict the next word.

Given "The capital of France is", what word comes next? "Paris." Given "def calculate_fibonacci(n):", what comes next? The function body.

That is literally it. Everything else — answering questions, writing code, explaining concepts, having conversations — emerges from doing this one task extremely well across an enormous amount of text.


How Training Works (Simplified)

Step 1: Pre-training

The model is shown hundreds of billions of words from the internet, books, code repositories, and scientific papers. For each piece of text, it repeatedly:

  1. Sees the first N words
  2. Predicts the next word
  3. Gets told what the actual next word was
  4. Adjusts its internal numbers slightly to be less wrong next time

This process runs across thousands of GPUs for months. GPT-4's training reportedly cost over $100 million in compute. This is why only well-funded companies can train frontier models from scratch.

After pre-training, the model can complete text. It knows a lot about the world. But it does not know how to follow instructions.

Step 2: Instruction Fine-tuning (SFT)

The pre-trained model is fine-tuned on examples of good conversations. Human trainers write examples of questions and ideal responses. The model learns to be helpful and to follow the format of "user asks, assistant answers."

Step 3: RLHF (Reinforcement Learning from Human Feedback)

Human raters compare pairs of model responses and indicate which is better. The model learns to generate responses that humans prefer — accurate, helpful, safe, well-formatted.

This is what transforms a text predictor into a conversational assistant.


The Transformer Architecture (What You Need to Know)

All modern LLMs are built on the Transformer architecture, introduced by Google in 2017.

The key innovation is the attention mechanism. When processing a sentence, each word pays attention to every other word to understand context.

In "The bank can guarantee deposits will eventually cover future tuition costs" — does "bank" mean a financial institution or a riverbank? Attention looks at the surrounding words ("deposits", "costs") and determines it is financial.

Context window: How many words the model can "see" at once. GPT-4 Turbo has a 128,000 token context (roughly 100,000 words). Claude has up to 200,000 tokens. This is why you can paste an entire document and ask questions about it.

Parameters: The numbers inside the model. GPT-3 had 175 billion parameters. GPT-4 is estimated at over 1 trillion. More parameters generally means more capability but also more compute required for inference.

Tokens: Models do not process words — they process tokens, which are roughly 3/4 of a word on average. "unhappy" might be one token. "unprecedented" might be two. This matters for pricing (you pay per token) and context length.


GPT-4o (OpenAI)

What it is: OpenAI's flagship model as of 2025-2026. Multimodal — handles text, images, audio.

Strengths:

  • Best general-purpose capability across diverse tasks
  • Strong at code generation and debugging
  • Widely tested and well understood
  • Excellent function calling / tool use (important for agents)
  • Large ecosystem of integrations

Weaknesses:

  • Most expensive of the major models
  • Slower than Groq for inference
  • Closed source — you cannot run it yourself

Pricing (approximate): $5 per million input tokens, $15 per million output tokens for GPT-4o. GPT-4o-mini is $0.15/$0.60 — much cheaper for simpler tasks.

When to use it:

  • Complex reasoning tasks
  • Code generation for difficult problems
  • Multimodal tasks (image + text)
  • When you need the most reliable capability

For building: OpenAI has the best documentation, the largest community, and the most examples. For your first AI project, GPT-4o-mini is the right choice — cheap enough to experiment freely, capable enough to build real things.


Claude (Anthropic)

What it is: Anthropic's model family. Claude 3.5 Sonnet and Claude 3 Opus are the frontier models.

Strengths:

  • Best reasoning and analysis of long documents
  • Superior at following nuanced instructions
  • Largest context window (200,000 tokens — can process entire codebases)
  • Excellent at writing that sounds human (not "AI-flavoured")
  • Stronger safety properties — less likely to hallucinate confidently
  • MCP support (Day 11) — built natively

Weaknesses:

  • Not available in all countries/regions without API access
  • Function calling is slightly less mature than OpenAI
  • Smaller ecosystem than OpenAI

Pricing: Claude 3.5 Sonnet is $3/$15 per million tokens (input/output). Claude Haiku (smaller, faster) is $0.25/$1.25.

When to use it:

  • Long document analysis (paste an entire report and ask questions)
  • Complex writing tasks
  • Tasks where following detailed instructions matters
  • Building with MCP (Claude Desktop natively supports it)

For building: Excellent choice, especially if you built the MCP server from Day 11. The Claude API is well-designed and the documentation is clear.


Groq

What it is: Not an AI company in the traditional sense — Groq builds custom hardware (LPUs — Language Processing Units) designed specifically for fast LLM inference. They run open-source models (Llama, Mixtral, Gemma) on this hardware.

Strengths:

  • Dramatically faster than OpenAI or Anthropic — 500+ tokens per second vs ~50 for GPT-4
  • Very cheap — often 10-20x cheaper than GPT-4
  • Free tier is generous (great for learning and prototyping)
  • Streaming responses appear almost instantly

Weaknesses:

  • Not Groq's own model — they run other companies' models
  • Capability ceiling is below GPT-4o for complex tasks
  • Less reliable for very complex reasoning

Pricing: Free tier: 14,400 requests/day (enough to build and test). Paid: from $0.05 per million tokens (for Llama 3 8B) — extremely cheap.

When to use it:

  • Real-time applications where speed matters (chatbots, live suggestions)
  • High-volume applications where cost matters
  • Prototyping and learning (free tier is generous)
  • Applications that need streaming responses to feel fast

For building: Groq is the best choice for Day 20's project. Free, fast, works with the OpenAI SDK (just change the base URL). You can build a production-quality chat app on the free tier.


DeepSeek

What it is: A Chinese AI company that released models comparable to GPT-4 at a fraction of the training cost. DeepSeek-V3 and DeepSeek-R1 made significant waves in early 2025.

Strengths:

  • DeepSeek-R1 has exceptional reasoning — comparable to OpenAI o1 for math and coding
  • Extremely cheap to run via API
  • Open weights — you can run DeepSeek locally on a powerful machine
  • Strong at code and mathematical reasoning specifically

Weaknesses:

  • Data privacy concerns — data goes to Chinese servers
  • May have content restrictions on certain topics
  • Not appropriate for applications involving sensitive data
  • Less integration support than OpenAI

Pricing: DeepSeek-V3: $0.27 per million input tokens, $1.10 per million output. Significantly cheaper than GPT-4o.

When to use it:

  • Math and coding tasks where reasoning quality matters
  • Cost-sensitive applications
  • When you want to run a model locally (self-hosted)
  • Research and experimentation

For building: Good for personal projects. For production Indian applications, consider data residency rules under DPDPA before using.


Quick Comparison Table

Model Speed Cost Reasoning Context Best For
GPT-4o Medium $$$ Excellent 128K General, code, multimodal
GPT-4o-mini Fast $ Good 128K Budget, high volume
Claude 3.5 Sonnet Medium $$ Excellent 200K Long docs, writing, MCP
Claude Haiku Fast $ Good 200K Fast, cheap tasks
Groq (Llama 3) Very Fast Free/$ Good 8K Realtime, streaming
DeepSeek-V3 Medium $ Very Good 64K Math, code, cost savings
DeepSeek-R1 Slow $$ Exceptional 64K Complex reasoning

How to Choose for Your Project

Building a placement chatbot for students? Groq — it is fast, free, and good enough for Q&A.

Building a document analysis tool? Claude — best at long context and following complex instructions.

Building a coding assistant? GPT-4o or DeepSeek-R1 for quality, Groq for speed.

Prototyping and learning? Groq free tier. No billing setup needed.

Production application at scale? GPT-4o-mini or Claude Haiku — balance of quality and cost.

Sensitive Indian user data? OpenAI or Anthropic — clearer data processing agreements, GDPR/data protection compliance.


Interview Questions on LLMs

"What is a transformer?" A neural network architecture where each token can attend to every other token in the sequence. The attention mechanism allows the model to understand context — the same word can mean different things in different contexts.

"What is the difference between GPT and BERT?" GPT is autoregressive — it generates text from left to right and is trained to predict the next token. BERT is bidirectional — it can see the full sentence and is trained to predict masked tokens. GPT is better for generation; BERT is better for classification.

"What is hallucination and how do you prevent it?" Hallucination is when the model generates text that sounds confident but is factually wrong. Prevention: RAG (grounding answers in real documents), prompt instructions ("only answer from the provided context"), and verification systems.

"What is RAG?" You built this on Day 9. You know this better than most interviewers asking the question.

"What is fine-tuning vs prompting?" You covered this on Day 12.


The Practical Reality

You do not need to understand transformer mathematics to build useful AI applications. You need to understand:

  1. What these models can and cannot do (they predict text, they do not "know" things)
  2. How to prompt them effectively (clear instructions, examples, constraints)
  3. Which model fits which use case (speed, cost, capability)
  4. How to build reliable systems around them (RAG, validation, error handling)

The engineer who understands these four things and has built real applications is more valuable than the engineer who can derive backpropagation but has never shipped anything.


Tomorrow: Day 20 — You will build a complete streaming chat application using Groq's free API. A real URL, real AI responses, deployed on Vercel. Takes 2 hours.


Day 19 of the AI Survival Kit — Career Roadmaps series

Ready to stand out?

Your portfolio is 60 seconds away.

Upload your resume. AI builds your portfolio. Share it everywhere.

Build Free Portfolio

Free forever · No credit card · 60 seconds