I know it is a bit late to tinker on this topic but here we go. So, since the last hear and a half, I have been using Claude and Gemini in both my personal and professional projects. From the first glance I knew it an application of machine learning and it just vommitng a series of words based on some probablistic distribution. But it is actually feels like magic. That’s why all the genai application are marked by a sparkle and use purple/magenta color. So, finally with this three part series, I am trying to understand(from a high level view) how does the llm works, unlearn all I know about prompt engineering, RAG, model evals and hopefully to come up with a optimal way to make GenAI deterministic in prod while being cost consious.
The internals
LLM stands for Large Language Model. So, we can assume it is a machine learning model that does something with language. And for that it needs to be trained on a large dataset of text.
Tansformers
Pre-training
Application
Tokenization
As we discussed the transformers create a pattern during pre-training and then it uses to predict the next token in the sequence. But storing natural as it is will be too expensive. And it will be painfully slow to process to do any kind of inference on top of that. So, we do convert the text into a sequence of integers. This allows us to store the text in a much more efficient way and also to process it much faster.
Byte Pair Encoding
Most LLMs use BPE or variants (WordPiece, SentencePiece). BPE learns subword units from training data.
import tiktoken
model_name = "gpt-4"
encoding = tiktoken.encoding_for_model(model_name)
text = "Cause the light was on."
tokens = encoding.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
for token in tokens:
print(f"{token}: '{encoding.decode([token])}'")
decoded = encoding.decode(tokens)
print(f"Decoded: {decoded}")Output:
Text: Cause the light was on.
Tokens: [62012, 279, 3177, 574, 389, 13]
Token count: 6
62012: 'Cause'
279: ' the'
3177: ' light'
574: ' was'
389: ' on'
13: '.'
Decoded: Cause the light was on. Why Subword Tokenization?
Word-level: Vocabulary explosion, can’t handle rare words
Character-level: Sequences too long, loses semantic meaning
Subword: Balance between vocabulary size and sequence length
# Compare tokenization
texts = [
"run running runner",
"antiestablishmentarianism",
"😀 🎉 🚀"
]
for text in texts:
tokens = encoding.encode(text)
print(f"\nText: {text}")
print(f"Tokens ({len(tokens)}): {[encoding.decode([t]) for t in tokens]}")Token Limits Are Real
Context windows are measured in tokens, not characters. This matters:
def estimate_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example: code is token-heavy
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
prose = "A simple recursive function to calculate Fibonacci numbers."
print(f"Code tokens: {estimate_tokens(code)}") # ~40 tokens
print(f"Prose tokens: {estimate_tokens(prose)}") # ~10 tokensCode, JSON, and special characters consume more tokens than natural language.
Transformer Architecture: The Engine
The transformer is the computational graph that processes token sequences.
Input Representation
import torch
import torch.nn as nn
class InputEmbedding(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.scale = d_model ** 0.5
def forward(self, x):
# Scale embeddings by sqrt(d_model) as per paper
return self.embedding(x) * self.scale
# Example
vocab_size = 50257 # GPT-2 vocab
d_model = 768 # Hidden dimension
embedding = InputEmbedding(vocab_size, d_model)
tokens = torch.tensor([[15496, 995]]) # "Hello world"
embedded = embedding(tokens)
print(f"Shape: {embedded.shape}") # [batch_size, seq_len, d_model]Positional Encoding
Transformers have no inherent position awareness. We inject it:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# Create constant PE matrix
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
# Compute div_term for sinusoidal encoding
div_term = torch.exp(
torch.arange(0, d_model, 2).float() *
(-torch.log(torch.tensor(10000.0)) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
# x: [batch_size, seq_len, d_model]
return x + self.pe[:, :x.size(1)]Modern models often use learned positional embeddings:
class LearnedPositionalEmbedding(nn.Module):
def __init__(self, max_len, d_model):
super().__init__()
self.pos_embedding = nn.Embedding(max_len, d_model)
def forward(self, x):
batch_size, seq_len = x.shape[:2]
positions = torch.arange(seq_len, device=x.device)
return x + self.pos_embedding(positions)Multi-Head Self-Attention: The Core Innovation
Self-attention computes relationships between all positions in a sequence.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Q, K, V projections
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
# Output projection
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
batch_size, seq_len, d_model = x.shape
# [batch, seq_len, d_model] -> [batch, num_heads, seq_len, d_k]
return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
def forward(self, x, mask=None):
batch_size, seq_len, _ = x.shape
# Project and split heads
Q = self.split_heads(self.W_q(x))
K = self.split_heads(self.W_k(x))
V = self.split_heads(self.W_v(x))
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
attention_weights = self.dropout(attention_weights)
# Apply attention to values
attention_output = torch.matmul(attention_weights, V)
# Concatenate heads
attention_output = attention_output.transpose(1, 2).contiguous()
attention_output = attention_output.view(batch_size, seq_len, self.d_model)
# Final projection
return self.W_o(attention_output), attention_weightsWhy multiple heads? Each head learns different relationships:
- Syntactic dependencies (subject-verb)
- Semantic relationships (synonyms, antonyms)
- Positional patterns (previous word, next word)
- Long-range dependencies (pronouns to antecedents)
Feed-Forward Networks
After attention, each position goes through a position-wise FFN:
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
self.activation = nn.GELU() # Modern models use GELU
def forward(self, x):
# FFN(x) = Linear2(GELU(Linear1(x)))
return self.linear2(self.dropout(self.activation(self.linear1(x))))Typical dimensions:
d_model = 768(GPT-2 small)d_ff = 3072(4x expansion)d_model = 12288(GPT-4, estimated)
Layer Normalization and Residuals
Critical for training deep networks:
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm architecture (modern approach)
# Attention with residual
normed = self.norm1(x)
attention_output, _ = self.attention(normed, mask)
x = x + self.dropout1(attention_output)
# FFN with residual
normed = self.norm2(x)
ffn_output = self.feed_forward(normed)
x = x + self.dropout2(ffn_output)
return xPre-norm vs Post-norm:
- Post-norm:
LayerNorm(x + Sublayer(x))(original paper) - Pre-norm:
x + Sublayer(LayerNorm(x))(modern, more stable)
Complete GPT-Style Model
class GPTModel(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_len, dropout=0.1):
super().__init__()
self.token_embedding = InputEmbedding(vocab_size, d_model)
self.pos_embedding = LearnedPositionalEmbedding(max_len, d_model)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Tie weights (embedding and output share weights)
self.lm_head.weight = self.token_embedding.embedding.weight
self.dropout = nn.Dropout(dropout)
def forward(self, tokens, mask=None):
# Embed tokens and add positions
x = self.token_embedding(tokens)
x = self.pos_embedding(x)
x = self.dropout(x)
# Apply transformer blocks
for block in self.blocks:
x = block(x, mask)
# Final layer norm
x = self.ln_f(x)
# Project to vocabulary
logits = self.lm_head(x)
return logits
# Example: GPT-2 Small dimensions
model = GPTModel(
vocab_size=50257,
d_model=768,
num_heads=12,
d_ff=3072,
num_layers=12,
max_len=1024,
dropout=0.1
)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Output: ~117M parametersTraining: Next Token Prediction
LLMs are trained to predict the next token given all previous tokens.
def train_step(model, batch, optimizer):
# batch: [batch_size, seq_len]
input_tokens = batch[:, :-1] # All except last
target_tokens = batch[:, 1:] # All except first
# Forward pass
logits = model(input_tokens)
# Compute loss
loss = F.cross_entropy(
logits.reshape(-1, logits.size(-1)),
target_tokens.reshape(-1),
ignore_index=-100 # Ignore padding
)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping (prevents exploding gradients)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
return loss.item()Loss Function Explained
Cross-entropy loss for language modeling:
# Manual cross-entropy calculation
def manual_cross_entropy(logits, targets):
# logits: [batch*seq_len, vocab_size]
# targets: [batch*seq_len]
# Get probabilities
probs = F.softmax(logits, dim=-1)
# Get target probabilities
target_probs = probs[range(len(targets)), targets]
# Negative log likelihood
loss = -torch.log(target_probs).mean()
return lossThe model learns to assign high probability to the actual next token.
Inference: Text Generation
Autoregressive Generation
@torch.no_grad()
def generate(model, prompt_tokens, max_new_tokens=50, temperature=1.0):
model.eval()
tokens = prompt_tokens.clone()
for _ in range(max_new_tokens):
# Get predictions for last position
logits = model(tokens)
next_token_logits = logits[:, -1, :] / temperature
# Sample next token
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
tokens = torch.cat([tokens, next_token], dim=1)
# Check for EOS token
if next_token.item() == EOS_TOKEN_ID:
break
# Truncate if exceeds max context
if tokens.size(1) > MAX_CONTEXT_LENGTH:
tokens = tokens[:, -MAX_CONTEXT_LENGTH:]
return tokensSampling Strategies
Temperature sampling:
def temperature_sample(logits, temperature=1.0):
if temperature == 0:
return torch.argmax(logits, dim=-1)
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1)Top-k sampling:
def top_k_sample(logits, k=50, temperature=1.0):
logits = logits / temperature
top_k_logits, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_logits, dim=-1)
sampled = torch.multinomial(probs, num_samples=1)
return top_k_indices.gather(-1, sampled)Top-p (nucleus) sampling:
def top_p_sample(logits, p=0.9, temperature=1.0):
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
# Sort probabilities
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Remove tokens with cumulative probability above threshold
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
indices_to_remove = sorted_indices_to_remove.scatter(
-1, sorted_indices, sorted_indices_to_remove
)
logits[indices_to_remove] = -float('inf')
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1)KV-Cache Optimization
Naive generation recomputes attention for all previous tokens every step. KV-caching stores past keys and values:
class GPTWithCache(nn.Module):
def forward(self, tokens, past_kv_cache=None):
if past_kv_cache is None:
# First forward pass - compute everything
x = self.embed(tokens)
kv_cache = []
for block in self.blocks:
x, kv = block(x, use_cache=True)
kv_cache.append(kv)
return x, kv_cache
else:
# Use cached K,V for previous tokens
# Only compute for new token
x = self.embed(tokens[:, -1:])
new_kv_cache = []
for block, past_kv in zip(self.blocks, past_kv_cache):
x, kv = block(x, past_kv=past_kv, use_cache=True)
new_kv_cache.append(kv)
return x, new_kv_cache
def generate_with_cache(model, prompt_tokens, max_new_tokens=50):
tokens = prompt_tokens
past_kv = None
for _ in range(max_new_tokens):
logits, past_kv = model(tokens, past_kv_cache=past_kv)
next_token = sample(logits[:, -1, :])
tokens = torch.cat([tokens, next_token], dim=1)
return tokensSpeedup: O(n²) → O(n) for generation of n tokens
Scaling Laws
Model performance scales predictably with three factors:
Chinchilla scaling laws:
Loss ∝ (Compute)^(-α)
Optimal: N_params ≈ 20 × N_tokens
For 1T tokens: ~50B parameters
For 10T tokens: ~500B parameters GPT-4 implications:
- Estimated 1.7T parameters
- Trained on ~13T tokens (rumored)
- Cost: ~$100M in compute
What’s Next?
You now understand the technical foundation: tokenization converts text to integers, transformers process sequences through attention and feed-forward layers, training optimizes next-token prediction, and inference generates text autoregressively with various sampling strategies.
But knowing how LLMs work doesn’t mean you can make them work reliably. In “How to Make LLMs Work”, we’ll cover:
- Prompt engineering that actually works
- When to fine-tune vs use RAG
- Building reliable chains and agents
- Handling failures and fallbacks
- Testing and evaluation strategies
The practical implementation guide awaits in Part 2.