🤖 AI / ML LLMs · MLOps PyTorch · RAG ~105 questions

Senior AI / ML Engineer

A complete set of senior-level AI/ML engineering interview questions covering machine learning fundamentals, deep learning, transformer architecture, LLMs, RAG, fine-tuning, agents, vector databases, MLOps, production AI systems, and responsible AI.

No questions match your search. Try a different keyword.

ML Fundamentals

12 questions
1Explain bias-variance tradeoff. How does it guide model selection and regularization?

Total model error = Bias² + Variance + Irreducible Noise.

Bias is error from wrong assumptions in the learning algorithm — a model that is too simple to capture the true pattern (underfitting). High bias → systematic errors regardless of training data.

Variance is error from sensitivity to fluctuations in the training set — a model that memorizes noise (overfitting). High variance → performs great on training data, poorly on unseen data.

# Diagnosing:
# High bias (underfitting): training error ≈ validation error, both high
# High variance (overfitting): training error << validation error
# Good fit: training ≈ validation error, both low

# Regularization reduces variance at cost of slight bias increase:
# L1 (Lasso): adds λ·Σ|wᵢ| — produces sparse weights, feature selection
# L2 (Ridge): adds λ·Σwᵢ² — shrinks weights toward zero, no sparsity
# Elastic Net: combines L1 + L2

# Dropout: randomly zeroes activations during training — acts as ensemble
# Early stopping: stops training when validation loss increases
# Data augmentation: effectively increases dataset size → reduces variance

Practical guidance: Start with a simple model (high bias intentionally) to establish a baseline. Add complexity if training error is too high. Regularize if validation error diverges from training error. More data always helps variance but not bias.

Modern twist: Deep neural networks defy the classical tradeoff — very large overparameterized models can simultaneously have low bias AND low variance when combined with implicit regularization from SGD and early stopping (the "double descent" phenomenon).

2What is gradient descent and its variants — SGD, Mini-batch, Adam, AdamW? When do you use each?

Gradient descent minimizes a loss function by iteratively updating parameters in the direction of the negative gradient: θ ← θ - α·∇L(θ).

Batch GD: computes gradient over entire dataset — exact gradient, slow per update, memory-intensive. Rarely used for large datasets.

SGD (Stochastic GD): one sample per update — fast, noisy gradients, good generalization. Noise helps escape local minima. Requires careful learning rate tuning.

Mini-batch GD: batch of 32–512 samples — best of both worlds. Most common for deep learning. GPU parallelism makes this efficient.

Momentum: accumulates velocity in consistent gradient directions, damps oscillations: v ← βv + ∇L; θ ← θ - α·v

Adam (Adaptive Moment Estimation): adapts learning rate per-parameter using first moment (mean) and second moment (variance) of gradients. Often converges faster with less tuning.

# Adam update:
m = β₁·m + (1-β₁)·g         # first moment (momentum)
v = β₂·v + (1-β₂)·g²        # second moment (RMSprop)
m̂ = m / (1-β₁ᵗ)             # bias correction
v̂ = v / (1-β₂ᵗ)
θ = θ - α·m̂ / (√v̂ + ε)     # typical: α=1e-3, β₁=0.9, β₂=0.999

AdamW: Adam with decoupled weight decay. Weight decay in Adam is incorrectly entangled with adaptive learning rates. AdamW fixes this — standard for transformer training.

When to use: AdamW for transformers and most deep learning. SGD+momentum for CNNs (often better generalization than Adam with careful tuning). Adam when fast convergence matters more than final accuracy.

3Explain cross-entropy loss, KL divergence, and when you use each loss function.

Cross-entropy loss measures the difference between predicted probability distribution and true distribution. For classification:

# Binary cross-entropy (BCE) — binary classification
BCE = -[y·log(p) + (1-y)·log(1-p)]

# Categorical cross-entropy — multi-class
CCE = -Σ yᵢ·log(pᵢ)  # sum over classes
# With softmax output, this equals negative log-likelihood

# In PyTorch:
criterion = nn.CrossEntropyLoss()  # combines log_softmax + NLLLoss
loss = criterion(logits, targets)  # logits: raw (pre-softmax) scores

KL Divergence measures how much distribution Q diverges from reference distribution P:

KL(P||Q) = Σ P(x)·log(P(x)/Q(x))
# Not symmetric: KL(P||Q) ≠ KL(Q||P)
# Cross-entropy H(P,Q) = H(P) + KL(P||Q)
# Minimizing cross-entropy = minimizing KL divergence (when P is fixed)

Loss function selection guide:

  • Regression: MSE (penalizes large errors heavily), MAE (robust to outliers), Huber (combines both)
  • Binary classification: BCE with sigmoid output
  • Multi-class classification: Cross-entropy with softmax
  • Multi-label classification: BCE per label (each label independent)
  • Object detection: Focal loss (down-weights easy negatives, focuses on hard examples)
  • Language modeling: Cross-entropy on next-token prediction
  • RLHF / KL penalty: KL divergence between fine-tuned model and reference model to prevent reward hacking
  • VAE: reconstruction loss + KL divergence to standard normal
4What is backpropagation and the chain rule? Explain vanishing and exploding gradients.

Backpropagation computes gradients of the loss with respect to all parameters by applying the chain rule of calculus backwards through the computation graph.

# Chain rule: ∂L/∂w = ∂L/∂y · ∂y/∂w
# For a 3-layer network:
# ∂L/∂W₁ = ∂L/∂ŷ · ∂ŷ/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂W₁

# In PyTorch — autograd handles this automatically:
loss = criterion(model(x), y)
loss.backward()   # computes all gradients
optimizer.step()  # updates parameters using gradients

Vanishing gradients: In deep networks, gradients shrink exponentially as they propagate backward through layers. With sigmoid/tanh activations (derivatives <1), multiplying many small values → gradients ≈ 0 → early layers don't learn. Classic problem in RNNs on long sequences.

Exploding gradients: Gradients grow exponentially → NaN or instability. Common in RNNs and early transformer training.

Solutions:

  • ReLU activation: derivative is 1 for positive inputs — no vanishing. But "dying ReLU" problem (neurons stuck at 0). Use Leaky ReLU, GELU, SiLU for transformers.
  • Residual connections (skip connections): gradient highway — can bypass layers entirely, enabling training of very deep networks
  • Layer normalization: normalizes activations to stable range
  • Gradient clipping: torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) caps gradient magnitude — standard for transformers
  • LSTM/GRU: gated architectures that control information flow, mitigating vanishing gradients in RNNs
  • Careful weight initialization: Xavier/Glorot for sigmoid/tanh, He/Kaiming for ReLU
5How do you evaluate classification models? Explain precision, recall, F1, ROC-AUC, and when each matters.
# Confusion matrix terminology:
# TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative

Precision = TP / (TP + FP)    # of predicted positives, how many are real?
Recall    = TP / (TP + FN)    # of actual positives, how many did we find?
F1        = 2 · (P·R) / (P+R) # harmonic mean — balanced metric
Accuracy  = (TP+TN) / All     # misleading for imbalanced classes!

When to prioritize each:

  • Precision: when false positives are costly — spam filter (annoying to mark legitimate email as spam), fraud detection (blocking valid transactions)
  • Recall: when false negatives are costly — cancer screening (missing a cancer is worse than a false alarm), security threat detection
  • F1: balanced — good default when you care about both; use F-beta (Fβ = (1+β²)·P·R / (β²·P+R)) to weight recall (β>1) or precision (β<1)
  • ROC-AUC: threshold-independent; measures model's ability to rank positives above negatives (AUC=1 perfect, 0.5 random). Good for balanced classes and comparing models.
  • PR-AUC (Average Precision): better than ROC-AUC for highly imbalanced datasets — focuses on positive class performance

Class imbalance: with 99% negatives, a model predicting all-negative gets 99% accuracy but zero recall. Use: stratified sampling, class weights, oversampling (SMOTE), undersampling, or threshold adjustment post-training.

6Explain feature engineering, normalization, and handling missing data. What are best practices?

Feature normalization — why it matters: Gradient descent converges faster and more stably when features are on comparable scales. Required for distance-based algorithms (KNN, SVM, KMeans).

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler: z = (x - μ) / σ — zero mean, unit variance
# Use when: features roughly Gaussian, no significant outliers
scaler = StandardScaler().fit(X_train)  # fit ONLY on train set
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # use train stats on test!

# MinMaxScaler: x → [0,1] — use for bounded ranges, neural nets
# RobustScaler: uses median and IQR — robust to outliers
# Log transform: for highly skewed features (income, counts)

Handling missing data:

  • MCAR (Missing Completely At Random): safe to drop rows or impute
  • MAR (Missing At Random): impute using other features
  • MNAR (Missing Not At Random): missingness contains information — add indicator feature
# Imputation strategies:
# Simple: mean/median (numerical), mode (categorical) — fast but ignores relationships
# KNN imputation: use k nearest neighbors' values — better but slow
# Iterative imputation (MICE): model each feature as function of others
# Add a binary "was_missing" indicator — let model learn from missingness pattern

from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

Data leakage: the #1 mistake. Never fit transformers (scalers, imputers, encoders) on the full dataset — always fit on training set only, then transform train/val/test. Use sklearn Pipeline to enforce this.

7What is cross-validation? Explain k-fold, stratified, time-series, and leave-one-out CV.

Cross-validation estimates model generalization by systematically training and evaluating on different data splits — gives more reliable performance estimates than a single train/test split.

k-Fold CV: Split data into k folds. For each fold: train on k-1 folds, evaluate on remaining fold. Average k scores. Typical k=5 or 10.

Stratified k-Fold: ensures each fold has the same class distribution as the full dataset. Essential for imbalanced classes. Always use for classification.

Time-Series CV (Walk-Forward Validation): respects temporal order — always train on past, validate on future. Never shuffle time-series data. Use TimeSeriesSplit from sklearn.

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Split 1: train [0..19], val [20..24]
# Split 2: train [0..24], val [25..29]  ← expanding window
# Split 3: train [0..29], val [30..34]

Leave-One-Out (LOO): k=n — each sample is its own validation set. Unbiased but computationally expensive. Use for very small datasets.

Nested CV: outer loop evaluates model, inner loop tunes hyperparameters. Unbiased estimate when doing hyperparameter search — prevents optimistic bias from tuning on the test fold.

Group k-Fold: when samples are not independent (multiple measurements from same patient, same user) — groups never span train/test split.

8Explain common tree-based models — Random Forest, Gradient Boosting, XGBoost. What are their differences?

Random Forest: Bagging ensemble of decision trees. Each tree trained on a bootstrap sample with random feature subsets at each split. Predictions averaged (regression) or majority-voted (classification). Reduces variance through ensemble averaging. Robust, easy to tune, good baseline. Parallelizable.

Gradient Boosting: Sequential ensemble — each tree corrects the errors of the previous. Minimizes a differentiable loss function via gradient descent in function space. Higher accuracy than Random Forest but more prone to overfitting and slower to train. Key hyperparameters: number of trees, learning rate, max_depth (shallow trees work better — typically 3–6).

XGBoost: Optimized gradient boosting with: second-order Taylor expansion of the loss (faster convergence), L1/L2 regularization, efficient handling of sparse data, column subsampling (like RF), parallel tree construction, out-of-core computation for large datasets. Dominant on tabular Kaggle competitions for years.

LightGBM: Leaf-wise tree growth (vs XGBoost's level-wise) — faster, better accuracy on large datasets. Uses histogram-based algorithms for splitting — very memory efficient. Best choice for large tabular datasets.

CatBoost: Handles categorical features natively (no one-hot encoding needed). Ordered boosting to reduce overfitting. Good when dataset has many categorical features.

When to choose: Start with LightGBM or XGBoost for tabular data. Use Random Forest when you need feature importance or when interpretability matters. Deep learning surpasses tree models on text, images, and time series with sufficient data.

9What is the curse of dimensionality? How do dimensionality reduction techniques like PCA and t-SNE work?

The curse of dimensionality: as the number of features grows, data becomes increasingly sparse in the feature space, distances become meaningless (all points equidistant), and exponentially more data is needed to maintain density. Distance-based algorithms (KNN, clustering) suffer most.

PCA (Principal Component Analysis): Linear dimensionality reduction. Finds orthogonal directions (principal components) of maximum variance. Projects data onto top-k components.

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)  # keep 95% of variance
X_reduced = pca.fit_transform(X_scaled)  # must scale first!
print(pca.explained_variance_ratio_)     # variance per component

# Use cases: preprocessing for linear models, visualization, noise reduction
# Limitation: only captures LINEAR relationships

t-SNE (t-distributed Stochastic Neighbor Embedding): Non-linear. Preserves local neighborhood structure — similar points in high dimensions stay close in 2D. Excellent for visualization.

from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X)
# Perplexity: effective number of neighbors (5–50). Higher = more global structure.
# WARNING: Does NOT preserve global distances. Not for downstream ML — only visualization.

UMAP: Faster than t-SNE, preserves more global structure, can be used for preprocessing (unlike t-SNE). Increasingly preferred for embedding visualization.

Autoencoders: Non-linear, learns task-specific compressed representation. Encoder compresses to bottleneck; decoder reconstructs. Bottleneck = low-dimensional representation.

10Explain data augmentation strategies for different modalities — images, text, tabular, and time series.

Data augmentation artificially increases dataset size by applying label-preserving transformations, reducing overfitting and improving model robustness.

Images:

from torchvision import transforms
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.RandomCrop(224, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomGrayscale(p=0.1),
    # Advanced: Cutout, MixUp, CutMix, AugMix, RandAugment
])
# MixUp: blend two images and labels → trains on convex combinations
# CutMix: paste patches from one image onto another
# AutoAugment: learned augmentation policy (ImageNet-optimal)

Text: Synonym replacement (WordNet/embeddings); random insertion, deletion, swap; back-translation (translate to French then back to English — paraphrase effect); EDA (Easy Data Augmentation); contextual word embedding substitution; paraphrasing with LLMs.

Tabular: SMOTE (synthetic minority oversampling — interpolates between minority class samples); Gaussian noise injection; feature combinations; GANs (CTGAN, TVAE for tabular synthesis).

Time series: Window slicing/cropping; time warping (stretch/compress segments); magnitude warping; jitter (add noise); permutation (shuffle subsequences); frequency domain augmentation; mixup on time series.

Advanced: Self-supervised pre-training on unlabeled data (contrastive learning, masked autoencoders) effectively performs implicit augmentation and produces rich representations before any labeled training.

11What is transfer learning and why does it work? Explain pre-training, fine-tuning, and feature extraction.

Transfer learning reuses knowledge learned on a source task/domain for a target task/domain. It works because learned representations (edges, textures, concepts, linguistic patterns) are hierarchically transferable — lower layers learn general features, higher layers learn task-specific features.

Feature extraction (frozen backbone): Use the pre-trained model as a fixed feature extractor. Only train a new classification head. Fast, needs little data, avoids overfitting on small datasets.

model = resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False   # freeze all layers
model.fc = nn.Linear(2048, num_classes)  # replace final layer
# Only model.fc parameters are updated during training

Full fine-tuning: Unfreeze all layers, train end-to-end with small learning rate. Better performance when you have enough labeled data. Risk of catastrophic forgetting — use smaller LR for early layers, larger for late layers (discriminative fine-tuning).

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Fine-tune entire model with learning rate schedule:
optimizer = AdamW([
    {'params': model.bert.parameters(), 'lr': 2e-5},    # small LR for backbone
    {'params': model.classifier.parameters(), 'lr': 1e-3},  # larger LR for head
])

When NOT to transfer: When the source domain is very different from target (limited transfer). When you have massive labeled data (diminishing returns). When the pre-trained model is too large for your inference constraints.

12Explain hyperparameter optimization — grid search, random search, Bayesian optimization, and early stopping.

Grid search: exhaustive search over a predefined parameter grid. Scales poorly — O(product of all values). Only practical for <3 hyperparameters with few values each.

Random search: samples randomly from parameter distributions. Surprisingly effective — high-dimensional spaces often have low effective dimensionality (few parameters actually matter). Each trial covers more of the important parameter ranges than grid search.

Bayesian optimization: builds a probabilistic surrogate model (Gaussian Process or Tree-structured Parzen Estimator) of the objective function. Uses the surrogate to select the next promising configuration balancing exploration vs exploitation. Much more sample-efficient — finds good hyperparameters in 10–50 trials vs 100s for random search.

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    model = build_model(lr, dropout, n_layers)
    return train_and_evaluate(model)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, n_jobs=4)

Population-Based Training (PBT): trains a population of models in parallel, periodically copying weights from better-performing models and mutating their hyperparameters. Used by DeepMind for RL and by Hugging Face for LLM training.

Early stopping: monitor validation metric, stop training when it stops improving for N epochs (patience). Prevents overfitting and saves compute. Save best checkpoint, not final weights.

Deep Learning

10 questions
1Explain batch normalization and layer normalization. When do you use each?

Normalization layers stabilize training by reducing internal covariate shift — the change in activation distributions as parameters update.

Batch Normalization (BatchNorm): Normalizes across the batch dimension for each feature. Computes mean and variance across the batch at each layer.

# BatchNorm: normalize across batch for each feature
# Input: (N, C, H, W) — normalize per channel across N, H, W
# At training: use batch statistics (mean, var)
# At inference: use running statistics (accumulated during training)

# Problems:
# - Batch-size dependent: unstable with small batches (< 8-16)
# - Not suitable for RNNs/sequential models (variable length)
# - Requires synchronization in distributed training

Layer Normalization (LayerNorm): Normalizes across the feature dimension for each sample independently. No batch-size dependency.

# LayerNorm: normalize across features for each sample
# Input: (N, L, D) — normalize per (N, L) position across D features
# Same behavior at train and inference time
# Standard for Transformers, RNNs, NLP tasks

import torch.nn as nn
# In a Transformer:
self.norm1 = nn.LayerNorm(d_model)
x = x + self.norm1(self.self_attention(x))  # pre-norm or post-norm

Other variants:

  • Instance Norm: normalizes per sample per channel — used in style transfer
  • Group Norm: normalizes within groups of channels — works well with small batches, used in object detection
  • RMS Norm (RMSNorm): simplified LayerNorm without mean subtraction — used in LLaMA, Mistral for efficiency

Rule of thumb: BatchNorm for CNNs on images (large batch). LayerNorm for Transformers and NLP. GroupNorm for small-batch detection tasks.

2Explain activation functions — ReLU, GELU, SiLU, Swish. Why do transformers use GELU?

ReLU: f(x) = max(0, x) — simple, fast, no vanishing gradient for positive values. Problems: "dying ReLU" (neurons stuck at 0 when bias negative), not differentiable at 0, no negative values.

Leaky ReLU: f(x) = x if x>0, else αx — fixes dying ReLU. α typically 0.01. Parametric ReLU (PReLU) learns α.

GELU (Gaussian Error Linear Unit): f(x) = x · Φ(x) where Φ is the Gaussian CDF. Smooth approximation to ReLU that gates the input with its own probability of being positive. Empirically outperforms ReLU in transformer models (BERT, GPT all use GELU).

# GELU approximation (fast version used in practice):
# f(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³)))

import torch.nn.functional as F
output = F.gelu(x)  # standard in transformers

SiLU / Swish: f(x) = x · σ(x) — self-gated activation. Similar to GELU, smooth, non-monotonic. Used in EfficientNet, Llama models.

Why GELU for transformers: Smooth (differentiable everywhere), can produce negative values (richer representations), empirically better than ReLU on NLP tasks. The probabilistic gating aligns well with attention mechanisms that also gate information.

Mish: f(x) = x · tanh(softplus(x)) — smooth, non-monotonic, often better than ReLU for vision.

3What are learning rate schedules? Explain warmup, cosine annealing, and OneCycleLR.

Learning rate schedules dynamically adjust the learning rate during training — typically high in the middle for fast learning, low at the end for fine-grained convergence.

Linear warmup: Start with very small LR, linearly increase to target LR over first N steps. Essential for transformers — without warmup, large LR on random weights causes instability and divergence in early training. Typical warmup: 1–4% of total steps.

Cosine annealing: LR follows a cosine curve from max to min (or to 0). Smooth decay, can restart at end of cycle (cosine annealing with restarts — SGDR) to escape local minima.

from torch.optim.lr_scheduler import (
    CosineAnnealingLR, CosineAnnealingWarmRestarts, OneCycleLR
)

# Standard transformer schedule: warmup + cosine decay
# Used in BERT, GPT, T5 training
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0.04*total_steps, num_training_steps=total_steps
)

# OneCycleLR (1cycle policy) — super-convergence
scheduler = OneCycleLR(optimizer, max_lr=3e-4,
    steps_per_epoch=len(train_loader), epochs=10,
    pct_start=0.3)       # 30% of training is warmup phase
# LR goes: 0 → max_lr (warmup) → min_lr (annealing)
# Often allows 10x larger LR and fewer epochs than standard training

ReduceLROnPlateau: reduces LR when validation metric stops improving — adaptive but reactive. Useful when you don't know total training steps upfront.

Practical guidance: For transformers, use linear warmup + cosine decay as the default. For CNNs, OneCycleLR often gives best results. Warmup steps ≈ 5-10% of total steps. Final LR ≈ 1/10 of peak LR.

4What is mixed precision training (FP16/BF16)? How does it reduce memory and speed up training?

Mixed precision training uses lower-precision floating point for most computations while maintaining FP32 master weights for numerical stability.

FP32: 32-bit float, 1 sign + 8 exponent + 23 mantissa bits. Full precision, high memory.

FP16: 16-bit float, smaller range (max ~65504). 2× memory savings, 2-8× faster on Tensor Cores. Risk: overflow/underflow for values outside range. Requires loss scaling to prevent gradient underflow.

BF16 (Brain Float 16): 16-bit with same exponent range as FP32 (8 bits) but fewer mantissa bits (7). Less precise but same dynamic range — no overflow issues. Preferred for LLM training (A100, H100 GPUs support natively).

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # for FP16 loss scaling

for batch in dataloader:
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):  # or bfloat16
        output = model(batch)
        loss = criterion(output, labels)

    scaler.scale(loss).backward()   # scaled gradients
    scaler.unscale_(optimizer)       # unscale before clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)           # update with unscaled grads
    scaler.update()                  # adjust scale factor

# BF16 (A100/H100 — no scaling needed)
with autocast(dtype=torch.bfloat16):
    loss = model(batch)

Memory savings: FP16/BF16 activations = 2× memory reduction. With gradient checkpointing, can train models 4-8× larger on same hardware. Critical for LLM fine-tuning on consumer GPUs.

5What is gradient checkpointing and how does it trade compute for memory?

During backpropagation, PyTorch must retain all intermediate activations (forward pass outputs) to compute gradients. For deep models, this can require enormous GPU memory — proportional to model depth × batch size × activation size.

Gradient checkpointing (activation checkpointing): Discards intermediate activations during the forward pass. During backpropagation, recomputes them from the nearest checkpoint on-demand. Trades memory for extra compute (~33% slower, but enables much larger models).

from torch.utils.checkpoint import checkpoint

class TransformerLayer(nn.Module):
    def forward(self, x):
        # Without checkpointing: all attention/FFN activations retained
        return self.ffn(self.attention(x))

# With checkpointing: activations recomputed during backward
def forward(self, x):
    return checkpoint(self.layer_forward, x, use_reentrant=False)

# HuggingFace Transformers — single flag
model.gradient_checkpointing_enable()

# Selective checkpointing — checkpoint expensive layers only
# Typical strategy: checkpoint every k transformer layers

Memory vs compute tradeoff:

  • No checkpointing: O(N) memory for N layers, O(1) extra compute
  • Full checkpointing: O(√N) memory (optimal checkpoint spacing), O(N) extra compute (each activation recomputed once)
  • Checkpoint every layer: O(1) activation memory, O(N) extra recomputation

Combined with mixed precision and optimizer state optimization (8-bit Adam, Adafactor), gradient checkpointing enables fine-tuning 70B parameter models on a single A100 GPU.

6What is distributed training? Explain DDP, model parallelism, and tensor parallelism.

Data Parallelism (DDP — DistributedDataParallel): Each GPU holds a complete model replica. Data is sharded across GPUs. Gradients are all-reduced (averaged) across GPUs after each backward pass. Scales linearly with number of GPUs. Best when model fits in a single GPU's memory.

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
model = DDP(model.cuda(), device_ids=[local_rank])
# Gradient sync is automatic — just train normally
loss.backward()  # gradients auto-synchronized across all GPUs

Model Parallelism (Pipeline Parallelism): Different layers on different GPUs. Overcomes GPU memory limits for very large models. GPUs process micro-batches in a pipeline. Inter-GPU communication only at layer boundaries.

Tensor Parallelism: Individual weight matrices sharded across GPUs. Each GPU holds a slice of every layer. Requires frequent communication within layers. Used for largest models (GPT-3, Megatron-LM). Megatron-LM splits attention heads across GPUs.

ZeRO (Zero Redundancy Optimizer): Shards optimizer states, gradients, and parameters across data-parallel ranks. Stage 1: optimizer states sharded. Stage 2: + gradients. Stage 3: + parameters. DeepSpeed ZeRO-3 can train models 10× larger than GPU memory. Standard for LLM training.

from deepspeed import DeepSpeedEngine
# ZeRO-3 config:
# "zero_optimization": {"stage": 3, "offload_param": {"device": "cpu"}}

In practice for LLM training: combine DDP (data parallelism across nodes) + tensor parallelism (within a node across GPUs) + pipeline parallelism (across layers/nodes). This 3D parallelism is used by NVIDIA Megatron and Google for training GPT-4-scale models.

7What are CNNs? Explain convolution, pooling, receptive field, and modern architectures.

Convolution: Slides a learnable filter over the input, computing dot products at each position. Exploits spatial locality (nearby pixels are correlated) and translation equivariance (a feature detector works everywhere in the image). Parameters: kernel_size, stride, padding, dilation.

# Depthwise separable convolution — efficiency trick (MobileNet, EfficientNet)
# Regular conv: kernel_size² × C_in × C_out parameters
# Depthwise: kernel_size² × C_in (one filter per channel)
# Pointwise: C_in × C_out (1×1 conv to mix channels)
# ~8-9× fewer parameters, similar accuracy

nn.Conv2d(in, out, kernel_size=3, padding=1)         # standard
nn.Conv2d(in, in, kernel_size=3, groups=in, padding=1)  # depthwise
nn.Conv2d(in, out, kernel_size=1)                     # pointwise

Pooling: Max pooling takes the maximum in each region — provides translation invariance, reduces spatial dimensions. Global Average Pooling (GAP) averages the entire feature map to a single value per channel — replaces dense layers in modern CNNs.

Receptive field: the region of the input that influences a given neuron. Grows with depth. Large receptive fields capture global context. Dilated convolutions increase receptive field without pooling (used in segmentation).

Architecture evolution: LeNet (1998) → AlexNet (2012, deep learning revival) → VGG (deeper, 3×3 convs) → ResNet (skip connections, 152 layers) → EfficientNet (compound scaling, NAS) → Vision Transformer (ViT, patch-based self-attention) → ConvNeXt (modernized ResNet with transformer design choices).

Vision Transformers (ViT): Split image into patches (16×16), linearly embed each, treat as tokens in a Transformer. No inductive bias — needs more data than CNNs to work well, but scales better. State-of-the-art on ImageNet with sufficient data/compute.

8What are RNNs, LSTMs, and GRUs? What problems did they solve and why were they superseded?

RNN (Recurrent Neural Network): Processes sequences step-by-step, maintaining a hidden state. Each step: h_t = tanh(W·h_{t-1} + U·x_t + b). Problem: vanilla RNNs suffer severe vanishing gradients for long sequences — cannot learn long-range dependencies.

LSTM (Long Short-Term Memory): Introduces a cell state (memory) and three gates — forget, input, output — that control information flow. The forget gate decides what to erase, input gate what to write, output gate what to expose. Gradients can flow through the cell state without vanishing. Can learn dependencies 100-1000 steps back.

# LSTM gates (simplified):
f_t = σ(W_f·[h_{t-1}, x_t])        # forget: what to erase from cell
i_t = σ(W_i·[h_{t-1}, x_t])        # input: what to write
g_t = tanh(W_g·[h_{t-1}, x_t])     # candidate cell content
o_t = σ(W_o·[h_{t-1}, x_t])        # output: what to expose
c_t = f_t⊙c_{t-1} + i_t⊙g_t       # update cell state
h_t = o_t⊙tanh(c_t)                # new hidden state

GRU (Gated Recurrent Unit): Simplified LSTM with two gates (reset, update). Fewer parameters, often comparable performance. Faster to train.

Why superseded by Transformers:

  • Sequential computation — cannot be parallelized across time steps (slow training)
  • Information bottleneck — entire sequence compressed into a fixed-size hidden state
  • Transformers process all positions in parallel via self-attention — 10-100× faster training on modern hardware
  • Direct attention to any position — no information loss across distance

LSTMs/GRUs still used for streaming inference (step-by-step, no future context needed), on-device models (state space is small/fixed), and as part of hybrid architectures.

9What are contrastive learning and self-supervised learning? Explain CLIP, SimCLR, and DINO.

Self-supervised learning creates supervision signals from the data itself — no human labels required. The model learns rich representations by solving pretext tasks.

Contrastive learning: Train a model to bring similar (positive) pairs close in embedding space and push dissimilar (negative) pairs apart.

SimCLR: Creates two augmented views of the same image (positive pair). All other images in the batch are negatives. Uses NT-Xent (normalized temperature-scaled cross-entropy) loss. Encoder learns invariances to augmentations.

# Contrastive loss (InfoNCE):
# sim(z_i, z_j) = cosine similarity of normalized embeddings
# L = -log( exp(sim(z_i,z_j)/τ) / Σ_k exp(sim(z_i,z_k)/τ) )
# τ = temperature (0.07 typical) — controls hardness of negatives

CLIP (Contrastive Language-Image Pre-training, OpenAI): Trains an image encoder and a text encoder jointly. Positive pairs: (image, its caption). Negatives: all other image-text combinations in the batch. Learns to align visual and language representations. Enables zero-shot classification: compare image embedding to text embeddings of class names.

DINO (Self-DIstillation with NO labels): Student-teacher framework where both student and teacher are ViTs. Teacher is an exponential moving average of student weights. No negative pairs needed — avoids representation collapse via centering and sharpening. Learns excellent features without labels; attention maps naturally segment objects.

Masked Autoencoders (MAE): Mask 75% of image patches, reconstruct from visible 25%. Forces model to learn rich global representations. Extremely efficient — only process 25% of patches in encoder.

10What are diffusion models? How do they work and how do they compare to GANs and VAEs?

Diffusion models learn to reverse a gradual noising process. Forward process: gradually add Gaussian noise to data over T steps until pure noise. Reverse process: a neural network (UNet) learns to predict and remove noise step-by-step.

# Forward process (fixed): q(x_t | x_{t-1}) = N(√(1-β_t)·x_{t-1}, β_t·I)
# Reverse process (learned): p_θ(x_{t-1} | x_t)
# Model predicts noise ε at each timestep:
# L = E[||ε - ε_θ(√ᾱ_t·x_0 + √(1-ᾱ_t)·ε, t)||²]

# Classifier-free guidance (CFG) — controls adherence to prompt
# ε_guided = ε_uncond + w·(ε_cond - ε_uncond)
# Higher w (7-12) → closer to prompt, less diversity

Key variants:

  • DDPM: Original denoising diffusion probabilistic model — 1000 steps, slow sampling
  • DDIM: Deterministic sampling in ~50 steps — same quality, much faster
  • Latent Diffusion (Stable Diffusion): Diffusion in VAE latent space (64×64) instead of pixel space (512×512) — 4-8× faster, less memory
  • SDXL, FLUX: Higher resolution, better coherence, improved text following

Comparison:

  • VAEs: Fast inference, smooth latent space, easy interpolation. Blurry outputs — posteriors are Gaussian, averaging effect. Good for latent representation learning.
  • GANs: Sharp outputs, fast inference. Training unstable (mode collapse, discriminator failure). Diverse samples but hard to train reliably.
  • Diffusion: Best quality, highest diversity, stable training. Slow inference (many denoising steps). Current state-of-the-art for image/audio/video generation.

Transformers & LLMs

12 questions
1Explain the Transformer architecture in detail — self-attention, multi-head attention, positional encoding, and the FFN layer.

The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallel computation over sequences.

Self-Attention:

# For each token, compute Query, Key, Value projections:
Q = X·W_Q,  K = X·W_K,  V = X·W_V   # (seq_len × d_k)

# Scaled dot-product attention:
Attention(Q,K,V) = softmax(Q·Kᵀ / √d_k) · V
# Q·Kᵀ: similarity scores between all pairs of tokens — (seq_len × seq_len)
# √d_k scaling: prevents softmax saturation with large d_k
# Result: each token is a weighted sum of all Values

Multi-Head Attention: Run h parallel attention heads with different projections. Each head learns different types of relationships (syntactic, semantic, coreference). Concatenate outputs: MultiHead = Concat(head₁,...,headₕ)·W_O.

Positional Encoding: Transformers have no recurrence → no inherent position information. Add positional information to token embeddings. Original: sinusoidal encodings. Modern LLMs use Rotary Position Embedding (RoPE) — encodes relative positions via rotation in embedding space, extrapolates better to longer sequences than the model saw during training.

FFN (Feed-Forward Network): Applied position-wise after attention. Two linear layers with activation in between: FFN(x) = max(0, xW₁+b₁)W₂+b₂. Typically 4× the model dimension. Stores "factual knowledge" — models with larger FFNs can store more facts.

Full transformer block:

# Pre-norm (modern LLMs) — more stable training:
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

# Post-norm (original paper):
x = LayerNorm(x + Attention(x))
x = LayerNorm(x + FFN(x))
2What is the attention complexity problem? Explain Flash Attention, sparse attention, and efficient variants.

Standard self-attention has O(n²) time and memory complexity in sequence length n. For n=100K tokens, the attention matrix is 100K × 100K = 10B entries — infeasible in GPU memory.

Flash Attention (Dao et al., 2022): The key insight is that storing the full n×n attention matrix is wasteful. Flash Attention computes attention in tiles that fit in SRAM (fast on-chip memory), avoiding slow HBM (GPU memory) reads/writes. Same mathematical output as standard attention but:

  • Memory: O(n) instead of O(n²) — doesn't materialize the full attention matrix
  • Speed: 2-4× faster than standard attention through IO-aware tiling
  • Flash Attention 2: further optimized for H100 architecture, 2× faster again
# In PyTorch 2.0+:
with torch.backends.cuda.sdp_kernel(enable_flash=True):
    output = F.scaled_dot_product_attention(q, k, v, is_causal=True)
# Automatically uses Flash Attention when available

Sparse attention patterns:

  • Sliding window: each token attends to nearby tokens only (Longformer)
  • Strided: every n-th token attends globally (Longformer, BigBird)
  • Global tokens: special [CLS]-like tokens attend to all, all attend to them
  • Linear attention: reformulate as kernel function, O(n) complexity but lower quality

Grouped Query Attention (GQA): reduces KV cache memory by sharing K/V heads across groups of Q heads. Used in Llama 2/3, Mistral. Multi-Query Attention (MQA) is extreme case — single K/V head.

3What are the major LLM architectures? Compare GPT, BERT, T5, and modern models like Llama and Mistral.

GPT (Decoder-only, autoregressive): Causal masking — each token only attends to previous tokens. Trained with next-token prediction. Best for generation tasks: chat, code, completion. All modern instruction-tuned models (GPT-4, Claude, Llama, Gemini) use decoder-only architectures.

BERT (Encoder-only, bidirectional): Each token attends to all other tokens. Trained with masked language modeling (predict masked tokens) + next sentence prediction. Not generative. Best for classification, NER, QA (extractive). Smaller, faster inference for understanding tasks.

T5 (Encoder-Decoder): Encoder processes input bidirectionally; decoder generates output autoregressively. Framed all NLP tasks as text-to-text (input: "translate English to French: …", output: "…"). Flexible but larger for serving. Used in summarization, translation, dialogue.

Modern decoder-only improvements:

  • Llama 2/3 (Meta): RoPE positional encoding, SwiGLU activation, grouped query attention, pre-norm with RMSNorm. Open weights, state-of-the-art open-source.
  • Mistral 7B: Sliding window attention (allows 32K context efficiently), grouped query attention, better than Llama 2 13B at same size.
  • Mixtral (Sparse MoE): Mixture of Experts — 8 FFN experts per layer, router selects top-2 per token. 47B total params but only 12B active per token. 8× more capacity for same inference cost.
  • Gemma, Phi, Qwen: Smaller but highly capable models with specialized training data.
4What is the KV cache? How does it work and why is it critical for LLM inference?

During autoregressive generation, the model generates one token at a time. Without caching, each new token requires recomputing keys and values for ALL previous tokens — O(n) compute per token, O(n²) total.

KV cache: store the K and V tensors from the attention computation for all previously processed tokens. When generating token n+1, only compute K/V for the new token and append to the cache. Attend over the full cached K/V sequence.

# Conceptually:
# Prefill phase: process entire prompt in parallel, populate KV cache
# Decode phase: generate one token at a time, using cached K/V
# Each decode step: O(1) new K/V computed, O(n) cache read

# KV cache memory:
# Size = 2 (K+V) × n_layers × n_kv_heads × d_head × seq_len × dtype_bytes
# Llama-3 8B, 4K tokens, BF16: 2 × 32 × 8 × 128 × 4096 × 2 bytes ≈ 512 MB
# 32K context: ~4 GB — KV cache dominates for long contexts

KV cache optimizations:

  • Quantized KV cache: INT8/INT4 quantization reduces memory 2-4×
  • PagedAttention (vLLM): manages KV cache like OS virtual memory — paged blocks, eliminates fragmentation, enables much higher GPU utilization
  • Prefix caching: cache KV for repeated prefixes (system prompts) across requests — massive speedup for long system prompts
  • Streaming LLM: sliding window KV cache — keep only recent tokens + "attention sink" tokens
  • GQA/MQA: reduce KV cache size by sharing heads
5What is tokenization? Explain BPE, WordPiece, SentencePiece, and how tokenization affects model behavior.

Tokenization converts raw text into a sequence of integer IDs for the model. The tokenizer defines the model's vocabulary and directly impacts what it can represent.

BPE (Byte-Pair Encoding): Start with character-level vocabulary. Iteratively merge the most frequent adjacent pair of tokens. Stop when vocabulary size is reached. GPT-2, GPT-3, GPT-4, Llama all use BPE (or tiktoken's variant). Typical vocab size: 32K–128K tokens.

# BPE process:
# Start: ["l", "o", "w", "e", "r"] frequency: 500
# Merge "l"+"o"→"lo": ["lo", "w", "e", "r"]
# Merge "lo"+"w"→"low": ["low", "e", "r"]
# ...until vocab size reached

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer.encode("Hello, world!")  # [128000, 9906, 11, 1917, 0]
tokenizer.decode(tokens)                    # "Hello, world!"

WordPiece: Similar to BPE but uses likelihood instead of frequency for merges. Used in BERT, DistilBERT. Subwords prefixed with "##" (e.g., "playing" → ["play", "##ing"]).

SentencePiece: Treats text as a sequence of unicode characters, works directly on raw text without whitespace pre-tokenization. Language-agnostic. Used in T5, XLNet, multilingual models.

How tokenization affects behavior:

  • Numbers like "12345" may tokenize as ["123", "45"] — the model can't "see" individual digits easily, hurting arithmetic
  • Non-English text uses more tokens per word (higher cost, shorter effective context)
  • Token boundaries affect model's ability to reason about character-level patterns
  • Vocabulary size tradeoff: larger vocab = fewer tokens per text but larger embedding table
6What is RLHF (Reinforcement Learning from Human Feedback)? Explain SFT, reward modeling, and PPO/DPO.

RLHF aligns language models with human preferences — making them helpful, harmless, and honest. Used to create ChatGPT, Claude, Llama-2-chat, etc.

Step 1 — Supervised Fine-Tuning (SFT): Fine-tune the base LLM on high-quality demonstration data (human-written ideal responses to instructions). Creates a well-behaved starting point for RL.

Step 2 — Reward Modeling: Collect human preference data: show human rankers pairs of model responses, they choose the better one. Train a reward model (RM) to predict human preference scores. RM takes (prompt, response) → scalar reward.

Step 3 — PPO (Proximal Policy Optimization): Use the RM as a reward signal to further fine-tune the SFT model via RL. Add a KL divergence penalty between fine-tuned model and SFT model to prevent reward hacking (generating responses that fool the RM without being genuinely good).

# PPO objective:
# Reward = RM(prompt, response) - β·KL(π_θ || π_SFT)
# β controls how far the model drifts from SFT baseline
# Higher β → stays closer to SFT, less reward hacking
# Lower β → more optimization, more reward hacking risk

DPO (Direct Preference Optimization): Simplifies RLHF by eliminating the separate reward model and RL training loop. Directly optimizes the LLM on preference pairs using a binary cross-entropy loss derived from the Bradley-Terry preference model.

# DPO loss (simplified):
# L_DPO = -E[log σ(β·(log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
# y_w = preferred response, y_l = rejected response
# No reward model, no RL — trains faster and more stably than PPO

RLHF alternatives: RLAIF (AI feedback instead of human feedback), Constitutional AI (Anthropic's self-critique approach), SPIN (self-play), SimPO (simplified DPO without reference model).

7What is in-context learning and few-shot prompting? How do chain-of-thought prompting and prompt engineering work?

In-context learning: LLMs can learn to perform tasks from examples provided in the prompt — without any parameter updates. The model adapts its behavior purely from context.

  • Zero-shot: just a task description, no examples
  • One-shot: one example
  • Few-shot: 2-20 examples demonstrating the pattern

Chain-of-Thought (CoT) prompting: Adding "Let's think step by step" or providing step-by-step reasoning examples dramatically improves performance on reasoning tasks (math, logic, multi-step problems). The model's intermediate steps serve as a working memory.

# Standard prompting (often wrong for multi-step math):
# Q: Roger has 5 tennis balls. He buys 2 more cans, 3 balls each. How many?
# A: 11

# Chain-of-Thought:
# Q: (same) A: Roger starts with 5 balls. He buys 2×3=6 more. 5+6=11 balls.

# Self-consistency: sample multiple CoT paths, take majority vote
# Tree of Thought: explore multiple reasoning branches simultaneously

Key prompt engineering techniques:

  • System prompt: define persona, constraints, output format
  • Role prompting: "You are an expert Python developer..."
  • Output format specification: "Respond only in JSON with keys: {reason, answer, confidence}"
  • Negative examples: show what NOT to do
  • Delimiters: use XML tags or ``` to clearly separate context from instructions
  • ReAct (Reason+Act): interleave reasoning and tool use

Automatic prompt optimization: DSPy (declarative prompt optimization), APE (automatic prompt engineer), textgrad — use LLMs to optimize prompts programmatically.

8What are LLM decoding strategies? Explain greedy, beam search, sampling, temperature, top-k, and top-p.

At each generation step, the model outputs a probability distribution over the vocabulary. Decoding strategy determines which token to select next.

Greedy decoding: always pick the highest probability token. Fast, deterministic, but can get stuck in repetitive loops and misses globally good but locally suboptimal paths.

Beam search: maintain k "beams" (partial sequences), expand each, keep top-k by cumulative log-probability. Better than greedy for translation, summarization. Not great for open-ended generation — tends to produce generic, low-entropy text.

Sampling: sample from the full probability distribution. More diverse outputs. Can produce incoherent text without truncation.

Temperature scaling: divide logits by T before softmax. T<1: sharper distribution (more confident). T>1: flatter distribution (more random). T→0: greedy. T=1: unchanged.

# Temperature effect:
# logits / T → softmax
# T=0.1: very confident, near-greedy
# T=1.0: original distribution
# T=1.5: more creative, more errors
# Typical for code: T=0.2  |  chat: T=0.7  |  creative: T=1.0-1.2

Top-k sampling: only sample from the top-k tokens by probability. Prevents sampling very unlikely tokens. k=50 is common. Problem: k is fixed regardless of distribution shape.

Top-p (nucleus) sampling: sample from the smallest set of tokens whose cumulative probability ≥ p. Adaptive k — uses fewer tokens when distribution is peaked, more when flat. p=0.9 or p=0.95 are common defaults. Often preferred over top-k.

Min-p sampling: removes tokens with probability < p × max_prob. Newer alternative to top-p, avoids incoherent tokens while allowing more diversity.

Typical production settings: temperature=0.7, top_p=0.9 for chat; temperature=0.2, top_p=0.95 for code generation.

9What is quantization? Explain GPTQ, AWQ, GGUF, and how they enable running large models on consumer hardware.

Quantization reduces the precision of model weights and/or activations — smaller storage, faster inference, lower memory usage, at the cost of slight accuracy degradation.

Post-Training Quantization (PTQ): Quantize an already-trained model. No retraining needed. Fast to apply. Some accuracy loss, especially at INT4.

# Memory comparison for Llama-3 8B:
# FP32:  32 GB  (4 bytes/param × 8B params)
# BF16:  16 GB  (2 bytes/param)
# INT8:   8 GB  (1 byte/param)
# INT4:   4 GB  (~0.5 bytes/param)
# 2-bit:  2 GB  (~0.25 bytes/param)

GPTQ (Generative Pre-Trained Transformer Quantization): Layer-wise quantization. Minimizes quantization error per layer using the Hessian of the loss. INT4 with minimal quality loss. One-time calibration step on a small dataset.

AWQ (Activation-aware Weight Quantization): Identifies important weights (those with large activations) and preserves them at higher precision. Outperforms GPTQ on INT4, faster to apply.

GGUF (GPT-Generated Unified Format): llama.cpp's format. Supports mixed precision (different layers quantized differently). Enables CPU inference with optional GPU offloading. K-quants (Q4_K_M, Q5_K_S) use knowledge about which weights matter. Ideal for running LLMs on MacBooks and consumer GPUs.

# Run quantized model with llama.cpp:
./llama-cli -m llama-3-8b-Q4_K_M.gguf -p "Explain transformers" -n 512

# In Python via llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-Q4_K_M.gguf", n_gpu_layers=35)
output = llm("Explain quantum computing", max_tokens=512)

bitsandbytes (QLoRA): 8-bit or 4-bit quantization integrated with HuggingFace. Crucial for fine-tuning large models on consumer GPUs — load 70B model in 4-bit (~35GB), fine-tune only LoRA adapters.

10What are speculative decoding and other LLM inference acceleration techniques?

LLM inference is memory-bandwidth-bound — the bottleneck is loading weights from GPU memory, not arithmetic. Acceleration techniques aim to generate more tokens per memory load.

Speculative Decoding: Use a small, fast draft model to propose k tokens. Verify all k tokens in one forward pass of the large model in parallel. Accept the longest prefix that matches what the large model would have generated. 2-3× speedup with no quality loss when draft quality is high.

# Speculative decoding flow:
# Draft model generates: ["The", "cat", "sat", "on"]  (k=4 tokens)
# Target model verifies all 4 in one forward pass
# If target agrees on ["The", "cat", "sat"] but not "on":
# Accept "The cat sat", resample from target distribution at position 3
# Net effect: 3 tokens for ~1.x forward passes of target model

Continuous batching: Traditional static batching waits for all requests to finish before starting new ones. Continuous batching adds new requests to the batch as soon as a slot frees up. Key innovation in vLLM — dramatically improves GPU utilization.

Tensor parallelism + pipeline parallelism: Distribute the model across multiple GPUs to reduce per-GPU memory and increase throughput.

Prompt caching (prefix caching): Cache KV states for repeated prompts/system prompts. Anthropic/OpenAI offer this as a feature — 90%+ cache hit rates for applications with stable system prompts.

Chunked prefill: Split long prompt prefills into chunks to interleave with decode steps — reduces time-to-first-token for queued requests.

Medusa/EAGLE: Add extra decoding heads to the LLM itself that predict multiple future tokens simultaneously, then verify with the main head. Avoids the need for a separate draft model.

11What is context length extension? How do RoPE scaling, YaRN, and Longformer-style approaches work?

Base LLMs are trained with fixed context lengths (e.g., 4K or 8K tokens). Extending context at inference time without retraining causes positional encoding out-of-distribution errors — the model has never seen position IDs beyond its training window.

RoPE (Rotary Position Embedding): encodes position via rotation in complex space. Position is multiplied into Q and K as e^{iθm} where θ are learned frequency parameters and m is position index.

Position interpolation: simply scale down all position indices to fit within the original training range. If trained on 4K, for 8K context, divide all positions by 2. Requires a small amount of fine-tuning to adapt. Simple but loses some resolution.

YaRN (Yet another RoPE extension): NTK-aware interpolation that applies different scaling to different frequency components of RoPE. High-frequency components (fine-grained position) are extrapolated, low-frequency (coarse position) are interpolated. Much better quality than simple interpolation, especially for very long contexts.

LongRoPE: Finds optimal per-dimension rescaling factors via evolutionary search. Used in Phi-3-mini to extend from 4K to 128K context.

Architectural approaches for very long context:

  • Sliding window attention (Mistral): each token attends to only a local window — O(n·w) not O(n²)
  • Ring attention: distribute the sequence across multiple GPUs, each processes a shard
  • State space models (Mamba, RWKV): O(n) inference regardless of context — no attention, but different quality tradeoffs
12What is a Mixture of Experts (MoE) model? How does routing work and what are the tradeoffs?

Mixture of Experts replaces each dense FFN layer with multiple parallel "expert" FFN networks. A learned router selects which experts to activate per token. More total parameters, but only a fraction are active per forward pass.

# MoE FFN layer (simplified):
# 8 experts, top-2 routing:
router_logits = x @ W_router               # (batch × seq, n_experts)
top2_experts = topk(router_logits, k=2)    # select top-2 experts per token
gates = softmax(top2_logits)               # weights for selected experts

output = sum(gates[i] * expert_i(x) for i in top2_experts)

Mixtral 8×7B (Mistral AI): 8 experts per layer, top-2 selected per token. 47B total parameters, but only 13B active per token inference. Quality comparable to Llama-2 70B but inference cost of a 13B model.

Routing strategies:

  • Token-choice (standard): each token chooses its top-k experts
  • Expert-choice: each expert chooses its top-k tokens — guarantees expert utilization but tokens may be dropped
  • Auxiliary load balancing loss: penalizes routing imbalance — prevents all tokens going to the same expert

Tradeoffs:

  • ✅ More capacity (parameters) for same compute cost
  • ✅ Different experts specialize in different content types/languages
  • ❌ Requires all experts in memory — 47B params need ~94GB BF16 vs 13GB for dense 7B
  • ❌ Load imbalance can hurt training efficiency
  • ❌ Expert routing adds latency overhead in distributed settings

GPT-4 is widely believed to use MoE. Google's Switch Transformer (2021) demonstrated MoE at trillion-parameter scale.

RAG & Embeddings

10 questions
1What is RAG and why is it preferred over fine-tuning for knowledge-intensive tasks?

RAG (Retrieval-Augmented Generation) augments an LLM's responses by first retrieving relevant documents from a knowledge base and including them in the prompt context. The LLM then generates a response grounded in the retrieved information.

Why RAG over fine-tuning for knowledge tasks:

  • Up-to-date knowledge: Fine-tuning bakes knowledge into weights — stale the moment training ends. RAG retrieves from a live, updateable index.
  • Verifiability: RAG responses can cite sources. Users can verify claims. Fine-tuned models hallucinate without transparency about where knowledge came from.
  • Cost: Fine-tuning a 70B model costs thousands of dollars and hours. Updating a RAG index is nearly free.
  • Long-tail knowledge: Fine-tuning on rare facts often fails — the model averages over training data and loses specifics. RAG retrieves exact documents.
  • Privacy: Sensitive documents stay in a controlled index, not embedded in model weights that may be extracted.

When fine-tuning wins over RAG:

  • Teaching the model a new skill, format, or behavior (not factual knowledge)
  • Latency-critical applications where retrieval adds unacceptable overhead
  • When the knowledge base is small and stable
  • Teaching domain-specific style or tone

Best of both: Fine-tune for behavior + RAG for knowledge. Use RAG for facts, fine-tune for format and style.

2Explain the RAG pipeline in detail — chunking strategies, embedding, indexing, retrieval, and generation.

1. Document Loading & Chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,      # overlap prevents cutting context at boundaries
    separators=["\n\n", "\n", ". ", " "]  # try larger separators first
)
chunks = splitter.split_documents(docs)

# Chunking strategies:
# Fixed-size: simple but cuts sentences mid-thought
# Sentence-aware: split on sentence boundaries
# Semantic chunking: split when embedding similarity drops (LlamaIndex)
# Document structure-aware: respect headers, paragraphs (best for structured docs)
# Hierarchical: small chunks for retrieval, larger parent chunk in context

2. Embedding: Convert chunks to dense vectors using an embedding model. Choose based on quality vs cost vs language support.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")  # 1024-dim, high quality
embeddings = model.encode(chunks, batch_size=64, normalize_embeddings=True)

3. Indexing: Store embeddings in a vector database with an ANN index (HNSW, IVF). Add metadata for filtering.

4. Retrieval: Embed the query, find top-k nearest chunks by cosine similarity. Apply metadata filters. Optionally rerank.

5. Generation:

context = "\n\n---\n\n".join([chunk.page_content for chunk in retrieved_chunks])
prompt = f"""Answer the question based ONLY on the context below.
If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {question}
Answer:"""
response = llm.generate(prompt)
3What are embedding models? How do you choose between them? Explain BGE, E5, OpenAI, and Cohere embeddings.

Embedding models convert text into dense vectors where semantic similarity corresponds to geometric proximity. They are typically encoder-only transformers (BERT-style) trained with contrastive objectives on large paired datasets.

Key embedding models:

  • OpenAI text-embedding-3-small/large: API-based, 1536/3072 dims. Best for OpenAI ecosystem, supports Matryoshka (truncate to smaller dimension). Costs money per token.
  • Cohere embed-v3: Excellent multilingual support, input_type parameter (query vs document) — separate encodings for asymmetric retrieval. Strong for production.
  • BGE (BAAI General Embedding): Top open-source option. bge-large-en-v1.5 consistently tops MTEB leaderboard. Add instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}".
  • E5 (Microsoft): Very competitive. e5-mistral-7b-instruct — 7B parameter embedding model, best open-source quality but large.
  • nomic-embed-text: 8K token context (most embedding models: 512 tokens). Fully open (weights + training data).
  • all-MiniLM-L6-v2: Small (22MB), fast, good baseline. 384 dims. Great for edge/latency-sensitive applications.

Choosing an embedding model: Evaluate on your domain using MTEB (Massive Text Embedding Benchmark) task categories: Retrieval, STS, Classification, Clustering. In-domain evaluation always beats general benchmarks — test on your own data with your query patterns.

Asymmetric retrieval: Queries and documents have different characteristics. Use separate embeddings or instruction prefixes: "query: " for searches, "passage: " for documents (E5 style).

4What is hybrid search? Explain BM25, dense retrieval, and how to combine them with reciprocal rank fusion.

BM25 (keyword/sparse retrieval): Classic probabilistic ranking function. Scores documents based on term frequency, inverse document frequency, and document length normalization. Excellent for exact keyword matching, product IDs, proper nouns, technical terms. Fast, no GPU needed.

Dense retrieval (semantic): Uses embedding similarity. Captures meaning and synonyms — "car" matches "vehicle". Fails for exact matching and rare terms not in training data.

Hybrid search: combine both signals for best results. Sparse catches what dense misses and vice versa.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util

# BM25 retrieval
bm25 = BM25Okapi([doc.split() for doc in corpus])
bm25_scores = bm25.get_scores(query.split())
bm25_top = bm25_scores.argsort()[-50:][::-1]  # top 50

# Dense retrieval
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
query_emb = model.encode(query)
doc_embs = model.encode(corpus)
dense_scores = util.cos_sim(query_emb, doc_embs)[0]
dense_top = dense_scores.argsort(descending=True)[:50]

# Reciprocal Rank Fusion (RRF) — combines ranked lists
def rrf(rankings: list[list[int]], k=60) -> dict[int, float]:
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# RRF k=60 is robust and requires no hyperparameter tuning
# Linear combination (α·dense + (1-α)·sparse) requires tuning α

When hybrid clearly beats pure dense: technical documentation with exact model numbers, code search with specific function names, e-commerce with product SKUs, any domain with important rare/specialized terms.

5What is reranking? How do cross-encoders improve retrieval quality over bi-encoders?

Standard RAG retrieval uses bi-encoders (embedding models) that encode query and documents independently. This is fast but misses fine-grained query-document interactions.

Cross-encoders (rerankers): Take the query and a candidate document together as input, producing a relevance score that captures their interaction. Much more accurate but O(n) — too slow for full corpus search.

Two-stage retrieval pipeline:

from sentence_transformers import CrossEncoder

# Stage 1: Fast bi-encoder retrieval — get top 100 candidates
top_100 = dense_retrieval(query, k=100)

# Stage 2: Accurate cross-encoder reranking — rerank top 100
reranker = CrossEncoder("BAAI/bge-reranker-large")
pairs = [(query, doc.page_content) for doc in top_100]
scores = reranker.predict(pairs)
reranked = sorted(zip(scores, top_100), reverse=True)[:5]  # take top 5

# Cross-encoder models:
# BAAI/bge-reranker-v2-m3 — multilingual, strong
# Cohere rerank-english-v3.0 — API, excellent quality
# cross-encoder/ms-marco-MiniLM-L-6-v2 — fast, good quality

Quality improvement: Adding a cross-encoder reranker typically improves NDCG@10 by 5-15% over bi-encoder alone. The gain is especially large when the bi-encoder retrieves many near-miss documents that are lexically similar but semantically irrelevant.

Cost: Reranking 100 docs per query takes ~50-200ms for a cross-encoder. Use Cohere's API to avoid running your own model. For latency-sensitive apps, use a smaller/faster reranker model or reduce the candidate pool.

6What are advanced RAG techniques — HyDE, query rewriting, multi-query, and self-RAG?

HyDE (Hypothetical Document Embeddings): Use the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer for retrieval. The hypothesis is often more similar to real documents than the raw query.

hypothetical = llm.generate(f"Write a passage that answers: {query}")
retrieval_embedding = embed(hypothetical)  # retrieves better than embed(query)
# Works because documents are written in "answer" style, queries in "question" style

Query rewriting: Use an LLM to rephrase ambiguous or conversational queries into retrieval-optimized forms. For multi-turn conversations, expand pronouns and references: "What about its performance?" → "What is the performance of Python's asyncio event loop?"

Multi-query retrieval: Generate N different phrasings of the query, retrieve for each, deduplicate and merge results. More comprehensive coverage of the information space.

queries = llm.generate(f"Generate 3 different search queries for: {original_query}")
results = [retrieve(q) for q in queries]
merged = deduplicate(flatten(results))

Self-RAG: The LLM itself decides when to retrieve (not always), evaluates retrieved documents for relevance, and critiques its own generation for faithfulness and usefulness. More accurate but complex pipeline.

FLARE (Forward-Looking Active Retrieval): Generate a response word-by-word; when the model's confidence drops (low probability token), trigger a retrieval using the generated text so far as query.

Corrective RAG (CRAG): After retrieval, evaluate document relevance; if below threshold, rewrite query and search the web. Adds a correction loop to standard RAG.

7How do you evaluate RAG pipelines? What metrics matter and how do you measure hallucination?

RAG evaluation must assess both retrieval quality and generation quality independently, plus the end-to-end pipeline.

Retrieval metrics:

  • Precision@k: fraction of retrieved docs that are relevant
  • Recall@k: fraction of relevant docs that were retrieved
  • NDCG@k (Normalized Discounted Cumulative Gain): accounts for rank — relevant docs higher in the list are valued more
  • MRR (Mean Reciprocal Rank): average of 1/rank_of_first_relevant_doc

Generation metrics:

  • Faithfulness: Does the answer contain only claims supported by the retrieved context? (measures hallucination)
  • Answer Relevance: Does the answer actually address the question?
  • Context Precision/Recall: Did retrieval get the right documents?

RAGAS framework:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

results = evaluate(
    dataset=eval_dataset,  # questions, contexts, answers, ground_truths
    metrics=[faithfulness, answer_relevancy, context_recall]
)
# faithfulness: LLM-as-judge checks if each answer claim is in context
# answer_relevancy: embeds question and reverse-generates questions from answer
# context_recall: checks if ground-truth answer components are in retrieved context

Hallucination detection: LLM-as-judge with explicit faithfulness check — prompt a powerful LLM to identify claims in the response not supported by the retrieved context. Also: TruthfulQA evaluation, NLI-based entailment check (does context entail the claim?).

Building an eval set: Manually curate 100-500 question-answer pairs from your domain. Automate generation with LLMs (generate Q&A from documents), then human-review for quality. Run nightly in CI to catch regressions.

8What is GraphRAG and knowledge graphs? When do they improve over standard RAG?

Standard RAG limitation: retrieves isolated text chunks. Misses relationships and connections between entities across documents. Fails on global summarization queries ("What are all the themes in these documents?") because no single chunk contains the answer.

GraphRAG (Microsoft): Builds a knowledge graph from the document corpus, then uses it for retrieval.

  1. Entity extraction: LLM extracts entities and relationships from each chunk → (Entity1, relationship, Entity2) triples
  2. Graph construction: merge entities across documents, build a property graph
  3. Community detection: cluster related entities into communities (Leiden algorithm)
  4. Community summaries: LLM summarizes each community → captures global themes
  5. Retrieval: Local (entity-focused) + Global (community summary) queries
from graphrag.index import run_pipeline
# Extract entities, build graph, generate community summaries
# Then query with local or global search modes

When GraphRAG wins over standard RAG:

  • Multi-hop reasoning: "Who worked with X who also collaborated with Y?"
  • Global summarization: "What are the main themes across all documents?"
  • Relationship queries: "What is the relationship between X and Y?"
  • Narrative understanding across many documents

Tradeoffs: GraphRAG is expensive to build (many LLM calls for extraction), slow to update (rebuild graph on new docs), higher latency per query. Use standard RAG for factual Q&A; GraphRAG for complex multi-document reasoning.

9How do you handle multimodal RAG with images, tables, and structured data?

Many real-world documents contain images, charts, tables, and mixed content. Standard text RAG ignores or poorly handles these.

Tables:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("report.pdf", strategy="hi_res",
    infer_table_structure=True)

# Strategies for tables:
# 1. Convert to markdown/CSV — embed as text (good for structured queries)
# 2. Generate natural language summary — embed the description
# 3. Use a dedicated table QA model
# 4. Store in a SQL database, use NL-to-SQL for table queries (Text2SQL)

Images and charts:

  • Caption-based: Use a vision model (LLaVA, GPT-4V) to generate text descriptions of images. Embed and retrieve the descriptions.
  • Multi-modal embeddings (CLIP, ImageBind): Embed images and text into the same space. Query with text, retrieve relevant images by similarity.
  • ColPali: Index PDFs as images of pages (page screenshots). Retrieve pages using vision-language models — handles complex layouts natively without parsing.
from byaldi import RAGMultiModalModel
model = RAGMultiModalModel.from_pretrained("vidore/colpali")
model.index(input_path="docs/", index_name="my_docs")
results = model.search("What was the revenue trend in Q3?", k=3)
# Returns: relevant PDF pages as images — pass to GPT-4V for final answer

Structured data + RAG: For queries over databases/spreadsheets alongside documents, use a router that decides: this query needs SQL (go to database), this needs semantic search (go to vector DB), this needs both (merge results).

10What is long-context vs RAG? When do you use a 128K context window instead of RAG?

Modern LLMs support very long contexts (128K tokens = ~100K words). This raises the question: why use RAG at all?

Long-context is better when:

  • Document fits entirely in the context window — no retrieval errors possible
  • Tasks requiring holistic understanding of a full document (legal contracts, codebases)
  • Multi-document reasoning where all docs together fit in context
  • Conversation history that must all be considered
  • You need every detail — RAG might miss relevant chunks

RAG is better when:

  • Corpus is larger than any context window (millions of docs)
  • Cost: 128K context = ~$1.28/call at GPT-4o pricing vs $0.001 for a targeted 2K context RAG response
  • Latency: processing 128K tokens takes seconds; a RAG lookup + 2K generation is ~300ms
  • The "lost in the middle" problem — LLMs perform worse on information in the middle of a very long context
  • Dynamic knowledge base that changes frequently

"Lost in the middle": Research shows LLM performance drops for information in the middle of very long contexts — models best recall information at the beginning and end. RAG mitigates this by precisely extracting and positioning relevant content.

Hybrid approach: Use RAG to retrieve the most relevant ~5-10 documents, then pass all of them in a moderate context window (8K-32K). Better than either extreme for most use cases.

Fine-tuning

8 questions
1What is LoRA and QLoRA? How do they enable parameter-efficient fine-tuning?

LoRA (Low-Rank Adaptation): Freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. Instead of updating W (d×d), trains two smaller matrices A (d×r) and B (r×d) where r ≪ d.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                    # rank — higher = more capacity, more params
    lora_alpha=32,           # scaling factor (effective LR = lr × alpha/r)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 6,607,343,616
# trainable %: 0.0516 — only 0.05% of params trained!

How it works: During forward pass: h = W₀x + (BA)x · α/r. BA is initialized so B=0 (no change at start). The model adapts via BA while W₀ stays frozen. At inference, merge: W = W₀ + BA — zero inference overhead.

QLoRA: Load the base model in 4-bit NF4 quantization (bitsandbytes), then apply LoRA on top. Enables fine-tuning 70B models on a single 48GB A100 GPU (vs 560GB for full fine-tuning in BF16).

from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 — better than INT4 for weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # quantize quantization constants
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)

LoRA rank selection: r=8-16 for most tasks. r=64-128 for complex tasks requiring significant knowledge acquisition. Higher r ≈ more powerful but more parameters and risk of overfitting.

2What other PEFT methods exist beyond LoRA? Compare LoRA, prefix tuning, prompt tuning, and adapters.

Prompt Tuning (Lester et al., 2021): Add a small number of soft prompt tokens (trainable embeddings) to the input. Only these tokens are trained — 0 additional params per layer. Works well at large scales (100B+), poor at smaller scales. Very fast, but limited expressivity.

Prefix Tuning (Li & Liang, 2021): Prepend trainable key-value pairs to every attention layer's K/V matrices. More expressive than prompt tuning. 0.1% of parameters. Good for generation tasks.

Adapter layers (Houlsby et al., 2019): Insert small bottleneck layers (FFN with a narrow hidden dimension) after each transformer sub-layer. Only adapter parameters are trained. ~1-5% of parameters. Stable training but adds inference latency (extra layers can't be merged).

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations): Learns element-wise scaling vectors applied to keys, values, and FFN activations. Extremely parameter-efficient (~0.01% params). Fast inference (element-wise multiply, can be fused).

DoRA (Weight-Decomposed Low-Rank Adaptation): Decomposes weight updates into magnitude and direction components. LoRA only adjusts direction — DoRA adjusts both. Better than LoRA on many benchmarks with similar parameter count.

Comparison:

  • LoRA/QLoRA: Best overall balance. Mergeable (zero inference overhead). Standard choice for most fine-tuning.
  • Prompt tuning: When you have minimal storage budget (just store a few vectors per task). Works best at scale.
  • Adapters: When serving multiple task-specific variants — swap adapter weights per-request.
  • DoRA: When quality is paramount and you can afford slightly more params than LoRA.
3How do you prepare a fine-tuning dataset? What makes a good instruction dataset?

Dataset quality matters far more than quantity for fine-tuning. 1,000 high-quality examples beat 100,000 noisy ones (see Alpaca vs Vicuna — same 52K examples, but Vicuna used GPT-4 for generation vs GPT-3.5 for Alpaca).

Dataset formats:

# Chat/instruction format (most common for instruction tuning):
{
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a string"},
        {"role": "assistant", "content": "def reverse_string(s: str) -> str:\n    return s[::-1]"}
    ]
}

# Preference format (for DPO/RLHF):
{
    "prompt": "Explain quantum entanglement",
    "chosen": "Quantum entanglement is a phenomenon where...",
    "rejected": "Quantum entanglement means particles are linked..."
}

What makes a good instruction dataset:

  • Diversity: wide range of task types, topics, difficulty levels
  • Complexity: include multi-step reasoning, edge cases
  • Correct responses: LLM-generated responses should be verified or filtered
  • Appropriate length: responses should be neither too short (under-explained) nor too long (padding)
  • No contamination: no test set examples, no personal data
  • Deduplication: near-duplicate examples waste training

Synthetic data generation:

from datasets import load_dataset
# Generate with GPT-4/Claude, filter with a smaller model
# Evol-Instruct: evolve simple instructions into complex ones
# Self-Instruct: model generates new tasks from seed examples
# Magpie: extract conversation data by prompting with role templates

Data mixture: For domain-specific fine-tuning, mix target domain data (70-90%) with general instruction data (10-30%) to prevent catastrophic forgetting of general capabilities.

4Walk through a complete fine-tuning setup with the HuggingFace ecosystem — Trainer, TRL, and SFTTrainer.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset

# 1. Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 2. Configure LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"],
    bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)

# 3. Training arguments
args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 4×4 = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
)

# 4. SFTTrainer handles chat template formatting automatically
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=args,
    max_seq_length=2048,
    packing=True,   # pack multiple short sequences into one — more efficient
)
trainer.train()
trainer.save_model("./final-model")

# 5. Merge LoRA adapters into base model for deployment
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "./final-model")
model = model.merge_and_unload()  # merge adapters → no inference overhead
model.save_pretrained("./merged-model")
5What is catastrophic forgetting and how do you prevent it during fine-tuning?

Catastrophic forgetting: When fine-tuning on a narrow dataset, the model's performance on tasks not in the fine-tuning data degrades significantly — it "forgets" general capabilities. A medical QA fine-tuned model might become worse at coding or math.

Prevention strategies:

  • PEFT methods (LoRA): By freezing most parameters, LoRA dramatically reduces catastrophic forgetting — only the low-rank adapters change, preserving base model knowledge.
  • Data mixture: Include general instruction data (Alpaca, OpenHermes, general conversations) alongside domain-specific data. 10-30% general data typically prevents major forgetting.
  • Lower learning rate: Smaller LR = smaller parameter updates = less forgetting. Use the smallest LR that still achieves your target performance.
  • Regularization (EWC — Elastic Weight Consolidation): Add a penalty term that discourages changing weights that were important for previous tasks. Computationally expensive for LLMs.
  • Replay / Experience Replay: Mix some original training data (or model-generated approximations) into the fine-tuning data.
  • Evaluate on general benchmarks throughout training: Monitor MMLU, HellaSwag, TruthfulQA to catch regression early. Stop if general capability drops below a threshold.
# Monitor forgetting during training
from lm_eval import evaluator
results = evaluator.simple_evaluate(model=model,
    tasks=["mmlu", "hellaswag", "truthfulqa_mc"],
    num_fewshot=5)
# Alert if any benchmark drops > 5% from baseline
6How do you fine-tune embedding models? What is contrastive training for retrieval?

Generic embedding models are trained on general text pairs. For domain-specific retrieval (legal, medical, code, finance), fine-tuning on domain-relevant pairs dramatically improves performance.

Data for embedding fine-tuning:

  • Positive pairs: (query, relevant document) — queries matched with their best answers
  • Hard negatives: (query, similar-but-wrong document) — documents that are topically related but don't actually answer the query. Critical for high-quality training.
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Multiple Negatives Ranking Loss (most common for retrieval)
# Each row: (query, positive). All other positives in batch = negatives
train_examples = [
    InputExample(texts=["What causes diabetes?", "Diabetes is caused by..."]),
    InputExample(texts=["How to center a div in CSS?", "To center a div..."]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=64)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Hard negative mining: use the base model to find similar-but-wrong docs
# Then add them as explicit negatives in TripletLoss or with special loss
train_loss = losses.TripletLoss(model)
# InputExample(texts=[anchor, positive, hard_negative])

model.fit(train_objectives=[(train_dataloader, train_loss)],
    epochs=3, warmup_steps=100, show_progress_bar=True)
model.save("my-domain-embeddings")

LLM-based embedding fine-tuning: Use an LLM (Mistral, Llama) as the backbone for embedding by mean-pooling or adding a special [EOS] token. Instruction-tuned embeddings (e5-mistral-7b) — prepend task instructions. Very powerful but expensive to run.

7How do you implement DPO training? What data do you need and what are the pitfalls?
from trl import DPOTrainer, DPOConfig

# Data format:
# {"prompt": "...", "chosen": "good response", "rejected": "bad response"}

# Load SFT model (DPO starts from an already instruction-tuned model)
model = AutoModelForCausalLM.from_pretrained("my-sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("my-sft-model")  # frozen reference

dpo_config = DPOConfig(
    beta=0.1,                    # KL penalty — higher = stay closer to reference
    loss_type="sigmoid",         # original DPO; alternatives: "hinge", "ipo", "simpo"
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,          # very small LR — DPO is sensitive
    num_train_epochs=1,          # usually 1-3 epochs enough
    bf16=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Common DPO pitfalls:

  • Bad preference data: If chosen/rejected pairs are low quality or ambiguous, DPO degrades the model. Each pair needs a clear quality difference.
  • Learning rate too high: DPO is very sensitive to LR. Start at 5e-7 and tune carefully.
  • Length bias: Models can learn to prefer longer responses regardless of quality. Use SimPO (length-normalized) or filter by response length similarity.
  • Forgetting after DPO: DPO can reduce the model's diversity and creative ability. Monitor on diverse benchmarks.
  • Reference model too far from SFT model: If your SFT model is already fine-tuned significantly, the reference model must match it — use the same SFT checkpoint.
8How do you evaluate a fine-tuned LLM? What benchmarks and human evaluation methods do you use?

Automated benchmarks for general capability:

  • MMLU (Massive Multitask Language Understanding): 57 subjects, multiple choice. Tests breadth of world knowledge.
  • HellaSwag: Commonsense inference. Complete the sentence with the right ending.
  • TruthfulQA: Measures tendency to repeat human misconceptions.
  • HumanEval / MBPP: Code generation benchmarks — pass@k metric.
  • MT-Bench: Multi-turn conversation benchmark scored by GPT-4.
  • AlpacaEval 2.0: Win rate against GPT-4 Turbo, length-controlled.
# LM Evaluation Harness (EleutherAI) — run standard benchmarks
lm_eval --model hf \
    --model_args pretrained=./my-fine-tuned-model \
    --tasks mmlu,hellaswag,truthfulqa_mc,humaneval \
    --num_fewshot 5 \
    --output_path ./results.json

Domain-specific evaluation: Build a custom eval set of 100-500 examples representative of your actual use case. Use LLM-as-judge (GPT-4/Claude) for quality scoring. Define a rubric with specific criteria (accuracy, helpfulness, format adherence, safety).

Human evaluation (gold standard): A/B blind comparison — show evaluators responses from base model vs fine-tuned model, ask which is better. Pairwise is more reliable than absolute scoring. Hire domain experts for specialized topics. Use Labelbox, Scale AI, or Toloka for crowdsourcing.

Regression testing: Maintain a fixed eval set in CI. Run before/after each training run. Alert if any benchmark drops more than 2-3% — catches accidental degradation early.

Agents & Tool Use

8 questions
1What is an AI agent? Explain the ReAct framework and the agent loop.

An AI agent is an LLM that can perceive its environment, reason about what actions to take, execute those actions via tools, observe results, and iterate until a goal is achieved. Unlike a single LLM call, agents can complete multi-step tasks autonomously.

ReAct (Reason + Act): Interleaves reasoning (Thought) with actions (Act) and observations (Observe). The model explicitly thinks before acting, making reasoning transparent and correctable.

Question: What is the population of the capital of France?

Thought: I need to find the capital of France, then its population.
Action: search("capital of France")
Observation: The capital of France is Paris.

Thought: Now I need the population of Paris.
Action: search("population of Paris 2024")
Observation: Paris has a population of approximately 2.1 million (city proper).

Thought: I have the information needed to answer.
Final Answer: The population of Paris, the capital of France, is approximately 2.1 million people.

Agent loop:

  1. Receive task + current state
  2. LLM reasons and selects an action (or returns final answer)
  3. Execute the action (tool call, code execution, etc.)
  4. Observe the result, add to context
  5. Repeat until done or max iterations reached

Key challenges: Agents can get stuck in loops, lose context in long reasoning chains, make cascading errors, and are harder to debug than single-pass pipelines. Mitigation: max step limits, structured output validation, checkpointing, human-in-the-loop for irreversible actions.

2How does tool/function calling work at the API level? What are best practices for tool design?

Tool calling allows the LLM to request structured function execution. The model outputs a JSON object specifying which function to call and with what arguments. Your code executes it and returns the result.

tools = [{
    "type": "function",
    "function": {
        "name": "search_documents",
        "description": "Search the knowledge base for relevant documents. Use when you need specific factual information.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query. Be specific and use keywords."
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return (1-10)",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o", messages=messages, tools=tools,
    tool_choice="auto"  # "auto", "required", or {"type":"function","function":{"name":"..."}}
)

# Handle tool calls
if response.choices[0].finish_reason == "tool_calls":
    for tc in response.choices[0].message.tool_calls:
        args = json.loads(tc.function.arguments)
        result = execute_tool(tc.function.name, args)
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": str(result)})

Tool design best practices:

  • Clear, specific descriptions: The model relies entirely on the description to decide when and how to use the tool. Include when to use it, what it returns, and any limitations.
  • Validated parameters: Use JSON Schema constraints (enum, minimum, maximum, pattern) to prevent invalid inputs from reaching your code.
  • Idempotent where possible: Agents may call tools multiple times. Design read tools to be always safe; add confirmation for destructive actions.
  • Return structured output: JSON responses are easier for the LLM to parse than free text.
  • Limit tool count: More than 10-15 tools causes the model to struggle with selection. Group related operations or use tool routing.
3What are multi-agent systems? Explain orchestrator-worker, debate, and reflection patterns.

Multi-agent systems use multiple LLM instances collaborating to complete complex tasks that benefit from specialization, parallelism, or verification.

Orchestrator-Worker pattern: An orchestrator agent plans and delegates subtasks to specialized worker agents. Worker results are aggregated by the orchestrator.

from langgraph.graph import StateGraph

# Orchestrator decomposes: "Write a research report on LLMs"
# → assigns to: researcher agent, writer agent, fact-checker agent
# Workers run in parallel → orchestrator merges results

# LangGraph — stateful multi-agent orchestration
graph = StateGraph(AgentState)
graph.add_node("orchestrator", orchestrator_node)
graph.add_node("researcher", researcher_node)
graph.add_node("writer", writer_node)
graph.add_conditional_edges("orchestrator", route_to_worker)

Reflection pattern: An agent generates a response, then a "critic" agent (or the same agent in a second pass) evaluates and provides feedback. The generator revises based on feedback. Iterates until quality threshold is met.

draft = generator_agent.run(task)
for _ in range(max_iterations):
    critique = critic_agent.evaluate(draft)
    if critique.score >= threshold:
        break
    draft = generator_agent.revise(draft, critique.feedback)

Debate/adversarial pattern: Multiple agents argue different perspectives or solutions. A judge agent (or human) selects the best. Good for complex decisions, reduces single-model bias.

Specialized agent teams:

  • CrewAI: Define agents with roles, goals, backstories. Agents collaborate via sequential or hierarchical processes.
  • AutoGen (Microsoft): Conversational multi-agent framework. Agents converse to complete tasks.
  • LangGraph: State machine-based, fine-grained control over agent flow. Best for complex stateful workflows.
4What is memory management in agents? Explain short-term, long-term, semantic, and episodic memory.

Agent memory determines what information persists across conversations and tool calls, enabling more capable and personalized behavior.

Short-term memory (in-context): The current conversation history in the context window. Temporary — gone at session end. Managed via message history trimming and summarization for long conversations.

from langchain_core.memory import ConversationSummaryBufferMemory
# Keeps recent messages verbatim + LLM-generated summary of older messages
memory = ConversationSummaryBufferMemory(
    llm=llm, max_token_limit=2000,
    return_messages=True
)

Long-term memory (external storage): Persists across sessions. Implementation strategies:

  • Semantic memory: facts and knowledge about the world/user. Stored as embeddings in a vector DB. Retrieved by similarity to current context. "User prefers Python over JavaScript."
  • Episodic memory: records of past interactions and outcomes. "Last week, user asked about X and found answer Y helpful." Enables learning from past successes/failures.
  • Procedural memory: learned workflows and skills. "For database queries, always check schema first, then construct query, then validate result."
from mem0 import Memory  # mem0 — LLM-optimized memory layer

m = Memory()
# Store: extract important facts from conversation
m.add("User prefers concise Python code with type hints", user_id="alice")
# Retrieve: fetch relevant memories for context
memories = m.search("coding preferences", user_id="alice")
# Inject relevant memories into system prompt before each conversation

Memory in production: Use vector DB (Pinecone, Weaviate) for semantic memory with TTL. Categorize memories (user preferences vs facts vs procedures). Implement memory decay — older memories have lower retrieval priority. Always get user consent for persistent memory.

5What is the Model Context Protocol (MCP)? How does it standardize tool integration?

MCP (Model Context Protocol) is an open standard introduced by Anthropic (November 2024) that standardizes how AI models connect to external tools and data sources. It defines a common protocol for AI assistants to communicate with external servers that expose tools, resources, and prompts.

Architecture:

  • MCP Host: The AI application (Claude Desktop, Cursor, your custom app) that manages connections
  • MCP Client: Protocol client within the host, maintains 1:1 connections with servers
  • MCP Server: Lightweight service that exposes capabilities via the protocol
// MCP Server (Node.js example)
import { Server } from "@modelcontextprotocol/sdk/server/index.js";

const server = new Server({ name: "database-server", version: "1.0.0" });

// Expose tools
server.setRequestHandler(ListToolsRequestSchema, async () => ({
    tools: [{
        name: "query_database",
        description: "Execute a read-only SQL query",
        inputSchema: {
            type: "object",
            properties: {
                sql: { type: "string", description: "SQL SELECT statement" }
            },
            required: ["sql"]
        }
    }]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
    if (request.params.name === "query_database") {
        const result = await db.query(request.params.arguments.sql);
        return { content: [{ type: "text", text: JSON.stringify(result) }] };
    }
});

Key MCP capabilities: Tools (executable functions), Resources (data sources the model can read), Prompts (reusable prompt templates), Sampling (server can request LLM completions).

Why it matters: Before MCP, every LLM integration required custom API glue code. MCP is becoming the "USB-C of AI integrations" — write an MCP server once, connect to any MCP-compatible model host. Rapidly growing ecosystem of MCP servers (GitHub, Slack, databases, file systems, web browsers).

6How do you implement code execution agents safely? What sandboxing approaches exist?

Code execution agents (like Devin, Claude's computer use, OpenAI's Code Interpreter) can write and execute code to solve problems. This is extremely powerful but requires careful sandboxing to prevent security issues.

Sandboxing options:

  • Docker containers: Isolated filesystem, network, and process space. Simple, widely available. Still shares the host kernel — container escape vulnerabilities exist.
  • gVisor (Google): User-space kernel sandbox. Each container gets its own kernel implementation — stronger isolation than standard Docker.
  • Firecracker (AWS): Lightweight VMs (microVMs). Used in AWS Lambda. True VM isolation with near-container startup speed.
  • WebAssembly (WASM): Browser-inspired sandboxed execution. No filesystem or network by default. Excellent for untrusted code execution.
  • E2B (sandbox-as-a-service): Managed secure sandboxes for AI agents. Persistent sessions, file system, network access controls.
import e2b

async def execute_code(code: str) -> str:
    async with e2b.AsyncSandbox() as sandbox:
        # Isolated environment with timeout
        execution = await sandbox.run_code(code, timeout=30)
        return execution.text + (execution.error or "")

# The agent writes code, e2b runs it safely:
agent_code = llm.generate("Write Python to scrape example.com")
result = await execute_code(agent_code)
# result fed back to agent for next step

Security controls for code execution agents:

  • Network egress filtering — prevent exfiltrating data or calling external APIs without approval
  • CPU/memory limits — prevent runaway computation
  • Execution timeout — kill long-running processes
  • Read-only filesystem mounts — protect host data
  • Allowlist of permitted packages — prevent installing malicious packages
7How do you build reliable agents? What are the key patterns for error handling and recovery?

Agents fail frequently — tool errors, invalid JSON, reasoning loops, context exhaustion. Reliable agents need systematic error handling.

Structured output enforcement:

from pydantic import BaseModel
from openai import OpenAI

class AgentAction(BaseModel):
    thought: str
    action: str  # "search" | "calculate" | "finish"
    action_input: str
    confidence: float

# Use structured outputs to prevent JSON parsing failures
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=messages,
    response_format=AgentAction,  # guaranteed valid structure
)
action = response.choices[0].message.parsed

Retry with feedback:

def execute_with_retry(agent, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = agent.run(task)
            if validate_result(result):
                return result
            # Tell agent what went wrong
            task = f"{task}\n\nPrevious attempt failed: {result}\nPlease correct and retry."
        except ToolError as e:
            task = f"{task}\n\nTool error: {e}. Try a different approach."
    return fallback_response(task)

Key reliability patterns:

  • Checkpointing: Save agent state after each successful step. Resume from checkpoint on failure rather than restarting.
  • Verification steps: After each action, have the agent verify the result matches expectations before continuing.
  • Graceful degradation: If a tool is unavailable, try alternatives or acknowledge limitations rather than hallucinating.
  • Timeout and budget limits: Hard limits on time, tokens, and tool calls. Agents can spin indefinitely without them.
  • Human-in-the-loop triggers: Automatically escalate to human review for low-confidence decisions, high-stakes actions, or repeated failures.
8What is LangGraph and how does it differ from LangChain for building stateful agents?

LangChain is a framework for building LLM-powered applications with chains (sequential pipelines) and agents. Good for simple linear workflows. Agents are implemented as while-loops — limited control over execution flow.

LangGraph models agent workflows as directed graphs (or cyclic graphs). Nodes are processing steps, edges are transitions. Enables loops, branches, parallel execution, and human-in-the-loop pausing.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]  # message history
    next_step: str

def call_model(state: AgentState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def call_tool(state: AgentState):
    last_message = state["messages"][-1]
    tool_result = execute_tool(last_message.tool_calls[0])
    return {"messages": [ToolMessage(content=tool_result)]}

def should_continue(state: AgentState) -> str:
    if state["messages"][-1].tool_calls:
        return "call_tool"
    return END

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("agent", call_model)
workflow.add_node("tools", call_tool)
workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_continue)
workflow.add_edge("tools", "agent")  # loop back to agent after tool use

app = workflow.compile(
    checkpointer=MemorySaver(),       # persist state across invocations
    interrupt_before=["tools"],       # pause for human approval before tool use
)

# Invoke
result = app.invoke({"messages": [HumanMessage("Research and summarize AI trends")]},
    config={"configurable": {"thread_id": "session-1"}})

When to use LangGraph: Complex multi-step agents with branching logic; stateful workflows that pause for human review; multi-agent systems with coordination; any agent that needs cycles (loops back to earlier steps).

Vector Databases

6 questions
1How do vector databases work internally? Explain HNSW, IVF, and LSH indexing algorithms.

Vector databases store high-dimensional embeddings and enable fast Approximate Nearest Neighbor (ANN) search. Exact nearest neighbor search is O(n·d) — too slow for millions of vectors.

HNSW (Hierarchical Navigable Small World): Graph-based index. Builds a multi-layer proximity graph. Top layers are sparse (long-range connections), bottom layers are dense (local connections). Search starts at a random node in the top layer, greedily navigates toward the query, descends to lower layers. O(log n) search. Very fast, high recall, but high memory usage (graph edges stored).

import hnswlib

# Build index
index = hnswlib.Index(space='cosine', dim=1536)
index.init_index(max_elements=1_000_000, ef_construction=200, M=16)
# M: connections per node (higher = more accurate, more memory, slower build)
# ef_construction: search effort during build (higher = better index quality)
index.add_items(embeddings, ids)

# Search
index.set_ef(50)  # ef: search effort at query time (higher = more accurate, slower)
labels, distances = index.knn_query(query_embedding, k=10)

IVF (Inverted File Index): Cluster vectors into nlist clusters (e.g., k-means). At search time, probe nprobe nearest cluster centroids, search only within those clusters. Much lower memory than HNSW. Trade-off: worse recall if relevant vectors are in unprobbed clusters. Used in FAISS.

LSH (Locality Sensitive Hashing): Hash similar vectors into the same bucket with high probability. Very fast, sub-linear search. Lower recall than HNSW/IVF. Better for approximate search with extreme speed requirements.

Product Quantization (PQ): Compresses vectors by splitting into sub-vectors and quantizing each. 8-32× memory reduction. Used with IVF as IVF-PQ in FAISS for billion-scale search.

2Compare major vector databases — Pinecone, Weaviate, Qdrant, Chroma, pgvector, and Milvus.

pgvector: PostgreSQL extension. Best when already on Postgres — no new infrastructure. Supports HNSW (v0.5+) and IVF. Scales to ~10M vectors comfortably. Can join vector search with SQL filters. Limited compared to dedicated vector DBs at massive scale.

-- pgvector
CREATE TABLE embeddings (id bigserial, content text, embedding vector(1536));
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops);
SELECT content FROM embeddings ORDER BY embedding <=> $1 LIMIT 5;

Pinecone: Fully managed, serverless. No infrastructure management. Very easy to get started. Scales to billions of vectors. Expensive at scale. Good for teams that want managed service.

Qdrant: Open-source (self-host or cloud). Written in Rust — fast and memory-efficient. Best-in-class filtered search (payload filters applied during ANN, not post-filtering). Strong quantization support. Ideal for filtered similarity search.

Weaviate: Open-source, GraphQL API. Built-in support for hybrid search (BM25 + dense). Schema-based (define object types). Module system for embedding generation. Good for semantic search with structured data.

Chroma: Open-source, embedded or self-hosted. Easiest to get started (pip install, in-memory). Perfect for prototyping and development. Less suitable for large-scale production.

Milvus: Open-source, designed for billion-scale. Cloud-native, distributed. Most features but most complex. LF AI Foundation project. Zilliz Cloud is the managed version.

FAISS (Meta): Library, not a database — no server, no persistence. Use for in-process search or building your own vector DB layer. Fastest raw search performance. Widely used as the underlying index in other vector DBs.

3What is metadata filtering in vector search? How does it impact performance?

Most real-world vector search needs to combine semantic similarity with structured filters — "find similar documents, but only from the last 30 days" or "find similar products, but only in the 'electronics' category".

Post-filtering (naive): Retrieve top-k by vector similarity, then filter by metadata. Problem: if 90% of results are filtered out, you might need top-100 to get 5 valid results. Recall degrades with high filter selectivity.

Pre-filtering: Filter by metadata first (get candidate IDs), then do exact search over that subset. Good recall, but can be slow if the candidate set is large. Requires metadata index.

Integrated filtering (best approach — Qdrant, Weaviate): Apply filters during the ANN graph traversal. The search algorithm skips filtered-out nodes while navigating the HNSW graph. Maintains recall without the performance penalty of post-filtering.

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, Range

client = QdrantClient("localhost", port=6333)

results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(must=[
        FieldCondition(key="date", range=Range(gte="2024-01-01")),
        FieldCondition(key="category", match={"value": "technology"}),
    ]),
    limit=10
)  # filter applied DURING HNSW traversal — efficient and accurate

Metadata indexing: Always create payload indexes on frequently filtered fields. Without indexes, filtering requires scanning all vectors — O(n) not O(log n).

4How do you scale vector databases to hundreds of millions of vectors?

Memory constraints: HNSW for 100M vectors at 1536 dims in float32 = 100M × 1536 × 4 bytes ≈ 600GB. Plus graph edges (~200GB for M=16). Doesn't fit in a single server's RAM.

Quantization for memory reduction:

# Scalar quantization (float32 → int8): 4× memory reduction
# Product quantization (PQ): 16-32× reduction, some accuracy loss
# Binary quantization: 32× reduction (1 bit per dimension)

# Qdrant scalar quantization:
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig
client.create_collection("large_index",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, always_ram=True)
    )
)

Disk-based ANN (DiskANN): Store the full graph on SSD, cache hot nodes in RAM. Achieves good recall and throughput with dramatically less RAM. Used in Azure Cognitive Search. Latency slightly higher than pure in-memory.

Distributed sharding: Partition vectors across multiple nodes. Each node searches its shard; a coordinator merges results. Weaviate and Milvus support this natively. Key challenge: determining which shard to route queries to without checking all shards (multi-shard overhead).

Tiered storage: Hot vectors (recent, frequently accessed) in RAM + HNSW. Cold vectors on disk or object storage + flat scan or DiskANN. Route queries to appropriate tier based on vector age/frequency.

IVF for billion scale: IVF-PQ in FAISS is the standard approach for billion-scale: ~64 bytes per vector (vs 6KB+ for HNSW), brute-force within probed clusters. Billion-vector search in <100ms on a single GPU with FAISS-GPU.

5What is a multi-vector search and ColBERT-style late interaction? When is it superior to single-vector search?

Standard bi-encoder search compresses the entire document meaning into a single vector — information loss, especially for long or complex documents.

ColBERT (Contextualized Late Interaction over BERT): Represents each token in the document as a separate embedding vector. At search time, compute maximum similarity between each query token and all document token vectors (MaxSim). This "late interaction" preserves token-level semantics.

# ColBERT interaction:
# Query: ["what", "causes", "diabetes"] → 3 vectors
# Doc: ["diabetes", "is", "caused", "by", "insulin", ...] → N vectors
# Score = sum over query tokens of: max(cosine_sim(q_token, d_token) for d_token in doc)
# This captures: "causes" ↔ "caused", "diabetes" ↔ "diabetes"

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(collection=documents, index_name="my_index")
results = RAG.search(query="What causes type 2 diabetes?", k=5)

ColPali: Extends ColBERT to visual documents — each page is encoded as a grid of patch embeddings. Enables searching PDFs natively as images without parsing text.

When multi-vector is superior:

  • Long documents where a single vector loses detail
  • Technical documents with specific terminology that needs token-level matching
  • Tasks requiring understanding of specific phrases within documents

Tradeoff: Multi-vector storage cost is 100-300× more than single-vector (one vector per token vs one per document). Use for re-ranking top-100 candidates from fast single-vector retrieval, not for the full corpus search.

6How do you handle index updates and deletions in a live vector database?

HNSW graphs are difficult to update — inserting a new vector requires connecting it to the graph (fast), but deleting a vector requires restructuring its connections (slow and complex). Most vector databases handle this with soft deletes or segment-based architectures.

Soft delete with tombstones: Mark deleted vectors as deleted. Filter them out at query time. Periodically rebuild the index to physically remove them. Simple but wastes memory and slightly slows queries.

Segment-based architecture (Qdrant, Milvus): Incoming data goes to a mutable segment (small, exact search). Periodically, mutable segments are merged into immutable indexed segments (large, HNSW). Deletions only update the mutable segment or set tombstones. Background optimization merges and rebuilds segments.

from qdrant_client import QdrantClient
from qdrant_client.models import PointIdsList

# Upsert (insert or update by ID)
client.upsert(collection_name="docs", points=[
    PointStruct(id=doc_id, vector=embedding, payload={"content": text, "updated_at": now})
])

# Delete by IDs
client.delete(collection_name="docs", points_selector=PointIdsList(points=[id1, id2, id3]))

# Delete by filter (e.g., delete all documents older than 30 days)
client.delete(collection_name="docs", points_selector=FilterSelector(
    filter=Filter(must=[FieldCondition(key="updated_at", range=Range(lt=thirty_days_ago))])
))

Change Data Capture (CDC) pipeline: For keeping a vector index in sync with a primary database: set up CDC (Debezium) on the source DB → stream changes to Kafka → consume and upsert/delete in vector DB. Ensures the index reflects the source of truth with minimal lag (seconds to minutes).

MLOps

8 questions
1What is MLOps and how does it differ from DevOps? What are the key components of an MLOps platform?

MLOps applies DevOps principles to the full machine learning lifecycle. Unlike software, ML systems have additional complexities: data versioning (data is a first-class artifact), model versioning (models are trained artifacts), training pipelines (separate from deployment pipelines), and drift/degradation (models become stale as data distribution shifts).

Key differences from DevOps:

  • Code + Data + Model all need version control and testing
  • CI/CD includes model training, evaluation, and validation — not just build and test
  • Monitoring includes data drift, model drift, prediction distribution — not just latency/errors
  • Retraining triggers (data drift, performance degradation) vs pure code deploys

Key MLOps platform components:

  • Data versioning: DVC, LakeFS, Delta Lake — track dataset versions like code
  • Experiment tracking: MLflow, W&B, Neptune — log parameters, metrics, artifacts per experiment run
  • Feature store: Feast, Tecton, Hopsworks — consistent feature computation for training and serving
  • Pipeline orchestration: Kubeflow, Metaflow, Prefect, Airflow — reproducible training pipelines
  • Model registry: MLflow Registry, W&B Model Registry — version, stage, and govern models
  • Model serving: TorchServe, BentoML, Seldon, Ray Serve — scalable inference
  • Monitoring: Evidently, WhyLabs, Arize — data drift, model performance monitoring
2How do you detect and handle data drift and model drift in production?

Data drift (covariate shift): The distribution of input features changes. The model was trained on one distribution but is now seeing different data. Example: a fraud detection model trained in 2022 faces different transaction patterns in 2024.

Concept drift: The relationship between inputs and outputs changes. Even if inputs look the same, the correct output changes. Example: sentiment analysis model becomes outdated as language evolves.

Model degradation: Upstream changes to data pipelines, feature computation, or serving infrastructure silently change model behavior.

Detection methods:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ClassificationPreset

# Compare reference (training) data distribution vs current production data
report = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
report.run(reference_data=train_df, current_data=production_df)
report.save_html("drift_report.html")

# Statistical tests for drift detection:
# KS test (Kolmogorov-Smirnov): continuous features
# Chi-square test: categorical features
# PSI (Population Stability Index): common in finance
# MMD (Maximum Mean Discrepancy): multivariate, for embeddings

# In production: compute PSI/KS daily, alert if > threshold
# PSI < 0.1: no significant change
# PSI 0.1-0.25: some change, monitor
# PSI > 0.25: significant drift, investigate and retrain

Handling drift: Automated retraining trigger when drift score exceeds threshold. Expanding training window to include recent data. Incremental learning / online learning. Feature engineering to make features more robust to drift. A/B test with newly retrained model before full rollout.

3How do you implement CI/CD for machine learning? What does a complete ML pipeline look like?
# Complete ML CI/CD pipeline (GitHub Actions + DVC + MLflow)

# .github/workflows/ml-pipeline.yml
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # nightly retraining

jobs:
  ml-pipeline:
    steps:
      # 1. Data validation
      - run: python validate_data.py
        # Check schema, missing values, statistical properties
        # Fail if data quality below threshold

      # 2. Reproducible training (DVC pipeline)
      - run: dvc repro train
        # Pulls exact dataset version, trains model, logs to MLflow

      # 3. Model evaluation and comparison
      - run: python evaluate.py
        # Compare new model vs current production model
        # Check: accuracy >= 0.95, latency <= 100ms, fairness metrics pass

      # 4. Model validation gates
      - run: python validate_model.py
        # Behavioral tests (invariance tests, directional tests)
        # Performance on held-out test set
        # Bias/fairness checks

      # 5. Register model if gates pass
      - run: python register_model.py
        # MLflow Model Registry: Staging → Production after human approval

      # 6. Deploy to staging
      - run: kubectl apply -f k8s/staging-deployment.yaml

      # 7. Automated integration tests on staging
      - run: pytest tests/integration/

      # 8. Canary deployment: 5% traffic to new model
      - run: kubectl apply -f k8s/canary-deployment.yaml
        # Monitor metrics for 30 min before full rollout

Model testing (beyond accuracy): Unit tests for data preprocessing functions; invariance tests (input perturbations that shouldn't change output: "I love this movie" vs "i love this movie"); minimum functionality tests (basic sanity cases that must always pass); slice-based evaluation (performance on demographic groups).

4How do you serve ML models efficiently? Compare TorchServe, TensorRT, ONNX Runtime, and vLLM.

ONNX Runtime: Cross-platform inference engine. Export PyTorch/TensorFlow models to ONNX format, then serve with highly optimized runtime. Supports CPU, CUDA, DirectML, TensorRT backends. 2-5× faster than native PyTorch for many models.

import torch
from transformers import AutoModelForSequenceClassification

# Export to ONNX
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
torch.onnx.export(model, (input_ids, attention_mask), "model.onnx",
    opset_version=17, dynamic_axes={"input_ids": {0: "batch", 1: "sequence"}})

# Serve with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
outputs = session.run(None, {"input_ids": input_ids, "attention_mask": attn_mask})

TensorRT (NVIDIA): NVIDIA-specific optimization. Graph optimization, layer fusion, precision calibration (INT8, FP16). 2-5× speedup over ONNX Runtime on NVIDIA GPUs. Requires calibration dataset for INT8. Best for production on NVIDIA GPUs.

TorchServe: PyTorch-native model server. Multi-model serving, A/B testing, batching, gRPC/REST. Simpler than building your own serving layer but less feature-rich than Triton.

NVIDIA Triton: Enterprise-grade model serving. Any framework (TensorFlow, PyTorch, ONNX, TensorRT). Ensemble models, dynamic batching, concurrent model execution. Industry standard for high-throughput inference.

vLLM: LLM-specific server. PagedAttention for KV cache efficiency, continuous batching, speculative decoding. OpenAI-compatible API. State-of-the-art LLM serving throughput (3-24× improvement over naive serving).

vllm serve meta-llama/Meta-Llama-3-8B \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --enable-chunked-prefill \
    --speculative-model "meta-llama/Meta-Llama-3-8B-Instruct" \
    --num-speculative-tokens 5
5How do you implement feature stores? What problems do they solve?

The feature reuse problem: Team A computes "user_average_spend_30d" for a fraud model. Team B needs the same feature for a recommendation model. Without a feature store, it's computed twice — often differently, leading to inconsistencies. Training computes features one way; serving computes them another (training-serving skew).

Feature store solves:

  • Training-serving consistency: Same feature computation code for offline training and online serving
  • Feature reuse: Central registry of features, discoverable and shareable across teams
  • Point-in-time correctness: Training data uses feature values as they were at the prediction time, not current values — prevents data leakage
  • Online/offline duality: Offline store (data warehouse) for training, online store (low-latency DB) for serving
from feast import FeatureStore, Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64

# Define features once
user_stats = FeatureView(
    name="user_stats",
    entities=["user_id"],
    schema=[
        Field(name="avg_spend_30d", dtype=Float32),
        Field(name="transaction_count_7d", dtype=Int64),
    ],
    source=BigQuerySource(table="analytics.user_features"),
    ttl=timedelta(days=1),  # features expire after 1 day
)

store = FeatureStore(repo_path=".")

# Training: retrieve historical feature values with point-in-time join
training_df = store.get_historical_features(
    entity_df=entity_df_with_timestamps,  # user_id + event_timestamp per row
    features=["user_stats:avg_spend_30d", "user_stats:transaction_count_7d"]
).to_df()

# Serving: low-latency online feature retrieval
features = store.get_online_features(
    features=["user_stats:avg_spend_30d"],
    entity_rows=[{"user_id": user_id}]
).to_dict()
6How do you do A/B testing and shadow mode deployment for ML models?

Shadow mode (dark launch): New model receives production traffic and makes predictions, but its predictions are NOT served to users. Predictions are logged and compared offline against the production model. Zero user impact. Ideal for validating behavior before live A/B test.

async def predict(request: PredictRequest) -> Response:
    # Production model prediction — always returned to user
    prod_prediction = prod_model.predict(request.features)

    # Shadow model runs in background — result discarded
    asyncio.create_task(shadow_predict(request, prod_prediction))

    return Response(prediction=prod_prediction)

async def shadow_predict(request, prod_prediction):
    shadow_prediction = shadow_model.predict(request.features)
    # Log both predictions for comparison
    logger.info("shadow_comparison", extra={
        "prod": prod_prediction, "shadow": shadow_prediction,
        "request_id": request.id
    })

A/B testing (canary deployment): Route a percentage of real traffic to the new model. Users in group A see production model, group B sees new model. Measure business metrics (conversion, engagement, revenue) not just accuracy.

# Feature flag / traffic splitting:
def get_model(user_id: str) -> Model:
    # Consistent assignment: same user always gets same model
    if hash(user_id) % 100 < 10:  # 10% to new model
        return new_model
    return prod_model

# Measure with proper statistical tests:
# t-test for continuous metrics (revenue, session length)
# Chi-square for binary metrics (conversion rate)
# Calculate required sample size upfront (power analysis)
# Typical experiment duration: 1-2 weeks for stable estimates

Multi-armed bandit: Automatically routes more traffic to the better-performing model, less to the worse. Faster convergence than fixed A/B test but harder to interpret statistically. Use for rapid model iteration where maximizing total reward matters more than causal inference.

7How do you manage GPU infrastructure for training and serving? What are key cost optimizations?

GPU types: A100 (80GB HBM3, 312 TFLOPS BF16) — top choice for LLM training. H100 (80GB HBM3e, 989 TFLOPS BF16) — fastest, most expensive, 3× A100 for transformer workloads. L40S (48GB) — good inference GPU, lower cost. A10G (24GB) — cheap inference, popular on AWS. RTX 4090 (24GB) — consumer GPU, great performance/$, no NVLink.

Training cost optimizations:

  • Spot/preemptible instances: 60-80% discount. Need checkpointing every 30-60min to resume after interruption. Use Spot on AWS, Preemptible on GCP.
  • Mixed precision + Flash Attention: 2-3× memory reduction → larger batch sizes → better GPU utilization → fewer GPU-hours.
  • Gradient accumulation: Simulate large batches without large GPU count.
  • Efficient data loading: Multi-worker DataLoader, pre-tokenized cached datasets. GPU should never wait for CPU data.

Inference cost optimizations:

  • Request batching: Batch multiple user requests together for GPU parallelism. Continuous batching in vLLM/TGI maximizes GPU utilization.
  • Quantization: INT8 reduces GPU memory 2×, INT4 by 4×. Run the same model on half the GPUs.
  • Model distillation: Train a smaller "student" model to mimic a larger "teacher." 10× smaller model with 80-90% quality.
  • Caching: Cache responses for common queries. Semantic caching (find similar cached queries) can have high hit rates for repetitive workloads.
  • CPU offloading: Run smaller models (classification, embeddings) on CPU — much cheaper than GPU instances.
8What is experiment tracking and how do you use MLflow in a production ML workflow?
import mlflow
import mlflow.pytorch

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("llm-fine-tuning")

with mlflow.start_run(run_name="llama3-lora-r16") as run:
    # Log hyperparameters
    mlflow.log_params({
        "model": "meta-llama/Meta-Llama-3-8B",
        "lora_r": 16, "lora_alpha": 32,
        "learning_rate": 2e-4,
        "epochs": 3, "batch_size": 16,
    })

    for epoch in range(epochs):
        train_loss = train_epoch()
        eval_metrics = evaluate()

        # Log metrics per step
        mlflow.log_metrics({
            "train_loss": train_loss,
            "eval_loss": eval_metrics["loss"],
            "eval_accuracy": eval_metrics["accuracy"],
        }, step=epoch)

    # Log model artifact
    mlflow.pytorch.log_model(model, "model",
        registered_model_name="llama3-fine-tuned")

    # Log evaluation artifacts
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.log_artifact("eval_report.json")

# Transition model to production via Model Registry
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="llama3-fine-tuned", version=3, stage="Production"
)

# Compare experiments programmatically
runs = mlflow.search_runs(experiment_names=["llm-fine-tuning"],
    filter_string="metrics.eval_accuracy > 0.90",
    order_by=["metrics.eval_accuracy DESC"])
best_run = runs.iloc[0]

Alternatives: Weights & Biases (W&B) — better visualization, sweeps (HPO), richer media logging, team collaboration. Neptune — similar to W&B, strong on comparison. Comet ML — good for LLM-specific logging. All can be dropped in with minimal code changes.

Production AI Systems

6 questions
1How do you design a scalable LLM API service? What are the key architectural considerations?

LLM serving architecture layers:

Client → API Gateway → Load Balancer → Router → LLM Workers
                                          ↓
                                   Cache Layer (Redis)
                                          ↓
                                   Queue (Redis/SQS)
                                          ↓
                                   vLLM Workers (GPU fleet)
                                          ↓
                                   Observability (metrics, traces, logs)

Key design decisions:

  • Request queuing: LLM inference is slow (1-30s). Use async queuing so requests don't timeout during GPU backlog. Return a job ID immediately, poll or stream results.
  • Streaming responses: Stream tokens via SSE or WebSocket as they're generated. Users get first token in 200-500ms instead of waiting 10-30s for the full response. Dramatically improves perceived latency.
  • Model routing: Route simple queries to fast/cheap models (GPT-4o-mini, Haiku), complex queries to powerful models (GPT-4o, Opus). Intent classification layer determines routing.
  • Request prioritization: Premium users get priority queue slots. Batch/async jobs run on off-peak capacity.
  • Horizontal scaling: vLLM workers are stateless (KV cache is per-request). Scale GPU workers based on queue depth. Use Kubernetes with GPU node pools.

Latency targets: Time to first token (TTFT) < 500ms; tokens per second (TPS) > 30 for streaming to feel fast. Optimize for TTFT with prefill optimization; TPS with batching and fast inference kernels.

2How do you monitor LLM applications in production? What metrics and alerts are critical?

Infrastructure metrics (standard): GPU utilization, GPU memory, TTFT, TPS/throughput, queue depth, error rate. Alert on: GPU util < 70% sustained (under-provisioned), queue depth growing unboundedly, error rate > 1%.

LLM-specific metrics:

from prometheus_client import Histogram, Counter, Gauge
import opentelemetry

# Core LLM metrics
time_to_first_token = Histogram("llm_ttft_seconds", "Time to first token",
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0])
tokens_per_second = Histogram("llm_tps", "Generation speed")
input_tokens = Counter("llm_input_tokens_total", "Input tokens consumed")
output_tokens = Counter("llm_output_tokens_total", "Output tokens generated")
finish_reason = Counter("llm_finish_reason_total", "Generation finish reason",
    labelnames=["reason"])  # "stop", "length", "content_filter"

# Quality metrics (sampled)
guardrail_triggered = Counter("llm_guardrail_triggered_total", "Safety blocks")
user_thumbs_down = Counter("llm_negative_feedback_total")

# Cost metrics
cost_per_request = Histogram("llm_cost_usd", "Cost per request in USD")

LLM observability (traces): Log every prompt, response, latency, model used, token counts with a correlation ID. This allows debugging specific failures. Use LangSmith, Langfuse, or Helicone for LLM-specific tracing.

Quality monitoring: Sample 1-5% of production traffic for automated quality evaluation (LLM-as-judge). Track: response relevance, faithfulness (for RAG), policy adherence. Alert on sudden quality degradation. Human review queue for samples below quality threshold.

Cost alerting: Set daily/weekly token budgets per customer/feature. Alert at 80% of budget. Auto-throttle at 100%.

3How do you implement guardrails and content moderation for LLM applications?

Guardrails are safety and quality checks applied to LLM inputs and outputs to prevent harmful, off-topic, or low-quality responses.

Input guardrails:

from openai import OpenAI
from nemoguardrails import RailsConfig, LLMRails

# OpenAI Moderation API — free, fast, checks harmful content
def moderate_input(text: str) -> bool:
    result = client.moderations.create(input=text)
    return not result.results[0].flagged

# Topic/scope guardrail: block off-topic requests
def is_on_topic(query: str, allowed_topics: list[str]) -> bool:
    classifier_prompt = f"Is this query about {allowed_topics}? Answer yes/no: {query}"
    return "yes" in llm.generate(classifier_prompt).lower()

# PII detection and redaction
import presidio_analyzer
analyzer = presidio_analyzer.AnalyzerEngine()
results = analyzer.analyze(text=user_input, language="en")
# Redact PII before sending to LLM

Output guardrails:

  • Hallucination detection: For RAG, check if every claim in the response is supported by the retrieved context
  • Format validation: Ensure JSON responses are valid, required fields present
  • Toxicity filtering: Check model output with a toxicity classifier before returning to user
  • Jailbreak detection: Pattern matching + ML classifier for prompt injection attempts
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./guardrails_config")
rails = LLMRails(config)

# NeMo Guardrails intercepts at input, dialog, and output levels
response = await rails.generate_async(
    messages=[{"role": "user", "content": user_message}]
)

NVIDIA NeMo Guardrails: Declarative YAML-based guardrail definition. Input/output/dialog rails. Colang language for defining policies. Can add custom Python actions.

Guardrails AI: Python library, validator-based. Pydantic integration for structured output. Custom validators for any check.

4How do you handle LLM cost management and rate limiting at scale?
import redis
from fastapi import HTTPException
from typing import Optional

class LLMCostManager:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.MODEL_COSTS = {
            "gpt-4o": {"input": 5.0, "output": 15.0},      # per 1M tokens
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},
            "claude-opus-4-5": {"input": 15.0, "output": 75.0},
        }

    def check_and_consume_budget(self, user_id: str, estimated_tokens: int,
                                  model: str, daily_budget_usd: float = 10.0):
        key = f"cost:{user_id}:{date.today()}"
        estimated_cost = estimated_tokens / 1e6 * self.MODEL_COSTS[model]["output"]

        # Atomic increment with expiry
        pipe = self.redis.pipeline()
        pipe.incrbyfloat(key, estimated_cost)
        pipe.expire(key, 86400)  # expire after 24h
        results = pipe.execute()
        total_cost = results[0]

        if total_cost > daily_budget_usd:
            raise HTTPException(429, f"Daily budget ${daily_budget_usd} exceeded")

    def select_model(self, complexity: str, user_tier: str) -> str:
        """Route to appropriate model based on complexity and user tier."""
        if user_tier == "free" or complexity == "simple":
            return "gpt-4o-mini"  # $0.15/1M input
        elif complexity == "medium":
            return "gpt-4o"
        else:
            return "claude-opus-4-5"  # only for premium + complex

# Semantic caching — avoid duplicate LLM calls
class SemanticCache:
    def __init__(self, embedder, redis_client, threshold=0.95):
        self.embedder = embedder
        self.redis = redis_client
        self.threshold = threshold

    def get(self, query: str) -> Optional[str]:
        query_emb = self.embedder.encode(query)
        # Vector search in Redis (RedisStack with vector similarity)
        results = self.redis.ft("cache_idx").search(
            Query(f"*=>[KNN 1 @embedding $vec AS score]").return_fields("response","score"),
            query_params={"vec": query_emb.tobytes()}
        )
        if results.docs and float(results.docs[0].score) >= self.threshold:
            return results.docs[0].response
        return None
5How do you build a robust evaluation pipeline for LLM applications in production?
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# Create a dataset for regression testing
dataset = client.create_dataset("production-eval-set")
client.create_examples(
    inputs=[{"question": q} for q in questions],
    outputs=[{"answer": a} for a in golden_answers],
    dataset_id=dataset.id
)

# Define evaluators
evaluators = [
    # LLM-as-judge for quality
    LangChainStringEvaluator("criteria", config={
        "criteria": {
            "correctness": "Is the answer factually correct?",
            "helpfulness": "Is the answer helpful to the user?",
            "conciseness": "Is the answer appropriately concise?",
        }
    }),
    # Deterministic checks
    LangChainStringEvaluator("exact_match"),
    LangChainStringEvaluator("embedding_distance"),
]

# Run evaluation
results = evaluate(
    lambda inputs: rag_pipeline(inputs["question"]),
    data=dataset,
    evaluators=evaluators,
    experiment_prefix="rag-v2.1",
    num_repetitions=3,  # run each example 3 times for variance
)

print(results.to_pandas().describe())
# Track scores over time — alert if correctness drops > 5%

Continuous evaluation in CI/CD:

  1. Maintain a golden test set of 200-500 question-answer pairs
  2. Run evaluation on every PR that changes prompts, models, or retrieval
  3. Block merge if quality drops below threshold on key metrics
  4. Weekly scheduled evaluation against production to detect drift
  5. Store all evaluation results — track quality trends over time

User feedback loop: Thumbs up/down on every response. Flag for review. These become labeled examples for eval set expansion and fine-tuning data collection. Close the loop between production feedback and model improvement.

6How do you implement structured output extraction reliably from LLMs?
from pydantic import BaseModel, Field
from openai import OpenAI
from anthropic import Anthropic
import instructor

# instructor — structured outputs with automatic retry and validation
client = instructor.from_openai(OpenAI())

class OrderExtraction(BaseModel):
    product_name: str = Field(description="Name of the product ordered")
    quantity: int = Field(ge=1, description="Number of units ordered")
    total_price: float = Field(ge=0, description="Total price in USD")
    delivery_date: str | None = Field(None, description="Delivery date if mentioned, ISO format")
    confidence: float = Field(ge=0, le=1, description="Confidence in extraction accuracy")

# instructor handles retry on validation failure automatically
order = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Extract order details from: {email_text}"}],
    response_model=OrderExtraction,
    max_retries=3,  # retry if Pydantic validation fails
)
print(order.product_name, order.quantity)

# Anthropic with structured output
client_a = instructor.from_anthropic(Anthropic())
result = client_a.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
    response_model=OrderExtraction,
)

# For classification tasks — constrained decoding (faster, more reliable)
from outlines import models, generate

model = models.transformers("meta-llama/Meta-Llama-3-8B-Instruct")
# Force output to be one of these exact strings:
generator = generate.choice(model, ["POSITIVE", "NEGATIVE", "NEUTRAL"])
sentiment = generator(f"Classify: {text}")  # guaranteed valid output

Constrained decoding (Outlines, Guidance, LMQL): Modify the sampling process at the token level to enforce output schema. Guaranteed valid JSON/structure — no parsing errors. Works by masking invalid tokens at each step. 2-3× faster than prompt-based approaches for structured tasks.

Responsible AI

5 questions
1What are the main categories of AI bias? How do you measure and mitigate them?

Sources of bias:

  • Historical bias: Training data reflects historical injustices. Loan approval models trained on historical decisions inherit discriminatory patterns.
  • Representation bias: Certain groups underrepresented in training data. Facial recognition fails more on darker skin tones if training data is majority lighter-skinned.
  • Measurement bias: Proxy labels that correlate with protected attributes. Using arrest records (biased by discriminatory policing) as a proxy for criminal behavior.
  • Aggregation bias: One-size-fits-all model fails for subgroups. A medical model trained on population averages may underperform for specific demographics.
  • Deployment bias: Model used in a different context than it was trained for.

Measuring fairness:

from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import accuracy_score

# Demographic Parity: P(ŷ=1|A=a) = P(ŷ=1|A=b)  [same positive rate across groups]
dpd = demographic_parity_difference(y_true, y_pred, sensitive_features=gender)

# Equalized Odds: P(ŷ=1|Y=y,A=a) = P(ŷ=1|Y=y,A=b)  [same TPR and FPR]
eod = equalized_odds_difference(y_true, y_pred, sensitive_features=gender)

# Slice-based evaluation: measure performance per demographic group
for group in df["race"].unique():
    subset = df[df["race"] == group]
    acc = accuracy_score(subset["y_true"], subset["y_pred"])
    print(f"{group}: {acc:.3f}")

Mitigation strategies:

  • Pre-processing: resampling, reweighting, data augmentation for underrepresented groups
  • In-processing: fairness constraints in the loss function (Fairlearn's ExponentiatedGradient)
  • Post-processing: calibrate decision thresholds per group (equalized odds post-processing)
  • Regular auditing: monitor slice performance in production, retrain when gaps emerge
2What is AI hallucination? Why do LLMs hallucinate and how do you reduce it?

Hallucination refers to LLM outputs that are fluent and confident but factually incorrect, fabricated, or unsupported by the provided context.

Why LLMs hallucinate:

  • Training objective mismatch: LLMs are trained to predict the next token, not to be truthful. The model learns to generate plausible-sounding text, not necessarily accurate text.
  • Knowledge gaps: When the model lacks knowledge about a topic, it may confabulate rather than acknowledge uncertainty.
  • Context vs. parametric knowledge conflict: When provided context contradicts what the model "learned" during training, it may favor training knowledge.
  • Sycophancy: RLHF training that optimizes for human approval can cause models to agree with users even when wrong.
  • Compounding errors: In long generations, early errors propagate and snowball.

Reduction strategies:

  • RAG: Ground responses in retrieved documents. Add explicit instructions: "Only use information from the provided context."
  • Uncertainty expression: Train models to say "I don't know" or express confidence levels.
  • Citation requirement: Require the model to cite specific sources for every claim.
  • Temperature reduction: Lower temperature = more conservative, less creative, fewer hallucinations.
  • Self-consistency: Sample multiple responses and select the majority answer — inconsistency flags potential hallucination.
  • Verification step: Second LLM call that checks facts in the first response against the context.
  • Fine-tuning on factuality: RLHF with reward models that penalize factual errors. Constitutional AI with self-critique.
3What is model interpretability and explainability? Explain SHAP, LIME, and attention visualization.

Interpretability vs Explainability: Interpretability — the model structure is inherently understandable (decision tree, linear regression). Explainability — post-hoc techniques that explain a black-box model's decisions.

SHAP (SHapley Additive exPlanations): Based on game theory's Shapley values. Assigns each feature a contribution value for a specific prediction. The sum of SHAP values equals the difference between prediction and mean prediction.

import shap

explainer = shap.TreeExplainer(xgboost_model)  # fast for tree models
shap_values = explainer.shap_values(X_test)

# Global importance: mean |SHAP| per feature
shap.summary_plot(shap_values, X_test)

# Local explanation: why did the model predict X for this instance?
shap.waterfall_plot(explainer(X_test)[0])
# Shows: base value + each feature's contribution = final prediction

# For transformers (slower):
explainer = shap.Explainer(bert_model, tokenizer)
shap_values = explainer(["This movie was great!"])

LIME (Local Interpretable Model-agnostic Explanations): Fit a simple interpretable model (linear regression) locally around the prediction of interest. Perturb the input, get predictions, fit local linear model. Less theoretically grounded than SHAP but faster for text and images.

Attention visualization: Display attention weights as a heatmap over input tokens. Shows what the model "focuses on" for each output token. However, attention is not directly equal to importance — high attention doesn't always mean high causal impact. Use with caution as an exploratory tool, not a definitive explanation.

Mechanistic interpretability (cutting edge): Understand the internal circuits of neural networks — what specific neurons and attention heads compute. Anthropic's work on superposition, features, and circuits. Informs model safety by understanding how capabilities are encoded.

4What is AI safety? Explain alignment, RLHF limitations, and key safety concerns for powerful AI systems.

AI alignment: Ensuring AI systems pursue goals that are beneficial to humans and consistent with human values. As AI systems become more capable, misalignment between AI objectives and human values becomes increasingly dangerous.

Key safety concerns:

  • Reward hacking: AI finds unintended ways to maximize its reward function that violate the spirit of the objective. Classic example: boat racing game AI discovered it could maximize score by going in circles and hitting boost pads rather than completing the race.
  • Goal misgeneralization: Model learned the "right" behavior in training distribution but pursues a different goal in deployment.
  • Deceptive alignment: A sufficiently capable model could behave well during training/evaluation but differently when deployed.
  • Power-seeking behavior: Instrumental convergence theory — most goals are better achieved with more resources, self-preservation, and avoiding shutdown. Capable systems may naturally develop these sub-goals.
  • Emergent capabilities: Unexpected capabilities appearing at scale that weren't present in smaller models — hard to predict or test for in advance.

RLHF limitations: Reward model is an imperfect proxy for human values. Reward hacking the RM (Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"). Raters disagree on what's "better." Optimizes for appearing helpful over being genuinely helpful (sycophancy). Can't capture long-term consequences.

Safety techniques: Constitutional AI (self-critique and revision based on principles), RLHF with diverse rater pool, red-teaming, capability evaluations, interpretability research, scalable oversight (using AI to help humans oversee AI).

5What are the key regulations and compliance considerations for AI systems? (EU AI Act, GDPR, CCPA)

EU AI Act (2024, phased enforcement 2024-2027): The world's first comprehensive AI regulation. Risk-based approach:

  • Unacceptable risk (banned): Social scoring, real-time biometric surveillance in public spaces, exploiting vulnerabilities of children, manipulating behavior using subliminal techniques
  • High risk (strict requirements): AI in critical infrastructure, education, employment, essential services, law enforcement, border control, justice. Requirements: conformity assessment, human oversight, transparency, accuracy, robustness, data governance, documentation
  • Limited risk (transparency obligations): Chatbots must disclose they are AI. Deepfakes must be labeled.
  • Minimal risk (no obligation): Spam filters, AI in video games, recommendation systems

GDPR implications for ML:

  • Right to explanation: Individuals have the right to explanation for automated decisions that significantly affect them
  • Right to erasure ("right to be forgotten"): If a user's data was used in training, they can request its removal — technically challenging with ML models
  • Data minimization: Only collect data necessary for the stated purpose
  • Purpose limitation: Can't reuse data collected for one purpose for training a different model without consent
  • Transfer restrictions: Personal data can't be transferred to non-EU countries without adequacy decision or appropriate safeguards

Practical compliance steps: Data lineage tracking (what data trained this model?); model cards documenting intended use, limitations, and performance; bias audits; human oversight mechanisms for high-stakes decisions; privacy-preserving ML techniques (differential privacy, federated learning); legal review for each deployment context.