PyTorch Cheat Sheet
Pytorch has become a popular deep learning framework and questions related to it have been asked in every possible data science interview. So use this cheatsheet to quickly recall important points, functions, best practice while developing, training, and debugging machine learning models
Design Philosophy of PyTorch
Core Philosophy
PyTorch gives importance to ease of experimentation over static graph optimization. It is built for the developer productivity, flexibility, and debuggability.
Reference:
PyTorch Official Docs, PyTorch 2.x Overview
Key Design Principles
- Eager execution
- Define-by-run dynamic computation graph
- Python-first API with a high-performance C++ backend
- Clear Python stack traces for easier debugging
Interview Perspective
- Static graph optimization is not as important to PyTorch as developer productivity. This gap is partially filled by PyTorch 2.x (torch.compile).
Why It is used
Overview
Because of its adaptability, efficiency, and robust ecosystem, PyTorch is a deep learning framework that is extensively used in both research and production.
Reference:
PyTorch Ecosystem
Important Features
- Dynamic computation graph
- NumPy-like, Pythonic syntax
- Powerful GPU acceleration via CUDA
- Extensive use in research and production systems
NumPy vs Tensor (PyTorch)
Comparison
| Aspect | NumPy Array | Tensor (PyTorch) |
|---|---|---|
| Primary use | Numerical computing | Machine learning & deep learning |
| Library | NumPy | PyTorch / TensorFlow |
| GPU support | CPU only | CPU & GPU |
| Automatic differentiation | Not supported | Supported (autograd) |
| Performance for ML | Good | Optimized for large-scale ML |
| Mutability | Mutable | Usually mutable |
| Data type flexibility | Single dtype per array | Single dtype per tensor |
| Broadcasting | Yes | Yes |
| Matrix operations | Via NumPy / SciPy | Built-in & optimized |
| Parallelism | Limited | Advanced (GPU / TPU) |
| Serialization | .npy, .npz | Framework-specific formats |
| Integration | Scientific Python stack | Deep learning ecosystems |
| Conversion | — | Easily convertible to/from NumPy |
| Device awareness | No device concept | Tracks CPU/GPU device |
| Training models | Not suitable | Core functionality |
Framework Selection Guide
When to Choose PyTorch
- Research and experimentation
- Quick prototyping
- Custom or novel architectures
- Learning deep learning concepts
- NLP and advanced model development
When to Select TensorFlow
- Large-scale production systems
- Embedded or mobile deployment
- TPU-intensive workloads
- Enterprise ML pipelines
PyTorch Core Modules
Modules
- torch → tensors and mathematical operations
- torch.nn → neural network layers and modules
- torch.optim → optimization algorithms
- torch.autograd → automatic differentiation
- torch.utils.data → datasets and dataloaders
Typical PyTorch Workflow
End-to-End Flow
- Load data
- Define model
- Define loss function
- Define optimizer
- Training loop
- Evaluation
Tensors
Tensor Components
- Storage: 1D contiguous memory buffer
- Sizes (shape)
- Strides: Steps (in elements) to move across dimensions
- Offset: Starting position in storage
- dtype
- device
Example: Accessing Storage & Strides
x = torch.randn(3, 4)
x.storage()
x.stride()Creating Tensors
- torch.tensor([1, 2, 3])
- torch.zeros(3, 4)
- torch.ones(2, 2)
- torch.empty(5, 5)
- torch.eye(3)
- torch.arange(0, 10, 2)
- torch.linspace(0, 1, 5)
Random Tensors
- torch.rand(3, 3)
- torch.randn(3, 3) # Normal distribution
From NumPy
- import numpy as np
- a = np.array([1, 2, 3])
- t = torch.from_numpy(a)
Dynamic vs Static Graphs
| Aspect | Dynamic Graph | Static Graph |
|---|---|---|
| Graph creation | Built at runtime | Built before execution |
| Flexibility | Very high | Limited |
| Control flow | Native Python (if, for) | Special graph constructs |
| Debugging | Easy (step-by-step) | Harder |
| Code style | Imperative | Declarative |
| Performance | Slight overhead | Highly optimized |
| Memory optimization | Limited | Advanced |
| Reusability | Less reusable | Easily saved & reused |
| Production deployment | Weaker historically | Strong |
| Learning curve | Easy | Steep |
Tensor Properties
Attributes
- t.shape
- t.dtype
- t.device
- t.requires_grad
Data Types
- torch.float32
- torch.float64
- torch.int64
- torch.bool
Type Casting
- t.float()
- t.long()
- t.to(torch.float32)
Tensor Operations
Arithmetic
- a + b
- a - b
- a * b
- a / b
- torch.add(a, b)
Matrix Operations
- torch.matmul(a, b)
- a @ b
- a.T
- torch.mm(a, b)
Reductions
- torch.sum(a)
- torch.mean(a)
- torch.max(a)
- torch.min(a)
Broadcasting Rules
- Dimensions align from right
- Size must match or be 1
Indexing & Slicing
Basic Indexing
- t[0]
- t[:, 1]
- t[1:3]
- t[t > 0]
Advanced Indexing
- indices = torch.tensor([0, 2])
- t[indices]
Reshaping
- t.view(2, 3)
- t.reshape(2, 3)
- t.squeeze()
- t.unsqueeze(1)
Strides Explained (FAANG favorite)
Example
- x = torch.randn(3, 4) # shape (3,4), strides (4,1)
- Move across columns: 1 step
- Move across rows: 4 steps
- Transpose: y = x.t() # shape (4,3), strides (1,4)
- No data duplication
Crucial Info
Many tensor operations return views rather than copies.
View vs Copy
View (Memory Sharing)
- view
- reshape (occasionally)
- transpose
- squeeze/release
Copy (New Memory)
- clone
FAANG Trap Question
view() fails for non-contiguous tensors. Use x.contiguous().view(-1) after transposing.
Memory Continuity
Key Points
- Contiguous tensor: memory arranged in row-major order
- Necessary for many CUDA kernels
- Check with x.is_contiguous()
- Call .contiguous() after transpose/permute before heavy operations
Device Management (CPU / GPU)
Basic Usage
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
- t = t.to(device)
Common Pitfall
❌ Mixing CPU and GPU tensors
✅ Always move model & data to the same device
Autograd Overview
What Autograd Is
- Dynamic automatic differentiation engine
- Records tensor operations during forward pass
- Constructs a Directed Acyclic Graph (DAG)
- reverse-mode autodiff executed via .backward()
- Non-leaf tensors have .grad_fn, leaf tensors accumulate .grad
Interview Perspective
Understand dynamic graph construction and backward mechanics for FAANG interviews.
Leaf vs Non-Leaf Tensors
Leaf Tensor
- Created directly by user
- requires_grad=True
- .grad populated after backward
- Example: x = torch.randn(3, requires_grad=True)
Non-Leaf Tensor
- Result of an operation
- Has grad_fn
- .grad NOT retained by default
- Example: y = x * 2
- To retain gradient: y.retain_grad()
Interview Trap
Why is y.grad None after backward? Because non-leaf tensors do not retain gradients by default.
Computation Graph Internals
Key Points
- Each operation creates a Function node
- Example: x → MulBackward → MeanBackward → z
- Graph is freed after .backward() by default
- Use retain_graph=True to keep the graph
Backward Mechanics
Default Gradient
- z.backward() # equivalent to z.backward(torch.ones_like(z))
Non-Scalar Backward
- y.backward(gradient=torch.ones_like(y))
FAANG Expectation
Understand Jacobian-vector product (JVP) intuition.
Gradient Accumulation
Key Points
- Gradients accumulate by default
- Example: loss.backward() twice → grads doubled
- Must clear gradients: optimizer.zero_grad()
Interview Insight
Allows gradient accumulation for large batches.
Version Counters & In-Place Ops
Key Points
- Every tensor has a version counter
- In-place modification increases version
- Autograd checks version consistency
- Example Error: x.add_(1) breaks graph for y.backward()
Detach vs No-Grad vs Inference Mode
.detach()
- Returns a tensor sharing storage
- Stops gradient tracking
- Example: y = x.detach()
torch.no_grad()
- Context manager to temporarily disable graph construction
- Example: with torch.no_grad(): y = model(x)
torch.inference_mode()
- Stronger than no_grad
- Disables version counters
- Faster inference
FAANG Rule
Use inference_mode for production inference.
requires_grad Propagation Rules
Key Points
- x.requires_grad = True → y = x*2 inherits requires_grad
- Exceptions: integer tensors cannot require gradients
- Operations with constants inherit from tensors
Saved Tensors & Memory
Key Points
- Autograd saves intermediate tensors during forward pass
- Needed for backward computation
- Memory-heavy ops: attention, large matmuls, conv layers
Optimization
Use torch.utils.checkpoint to trade compute for memory.
Custom Autograd Functions
Why Custom Functions?
- Custom CUDA ops
- Memory optimization
- Numerical stability
Implementation Example
class MyFunc(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x ** 2
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
return grad_output * 2 * x
y = MyFunc.apply(x)Rules
No in-place ops; backward must return gradients for each input
Higher-Order Gradients
Key Points
- Enable graph during backward: loss.backward(create_graph=True)
- Used in meta-learning, gradient penalties, second-order optimizers
- ⚠️ Memory explosion risk
Hooks (Advanced & Dangerous)
Tensor Hooks
- x.register_hook(lambda grad: grad * 0.5)
Module Hooks
- Forward hooks
- Backward hooks
Use Cases
Debugging, gradient clipping, visualization
FAANG Warning
Hooks can break distributed training if misused.
Common Silent Bugs (Interview Gold)
Examples
- Using .data: x.data *= 2 # breaks autograd silently
- In-place ReLU: nn.ReLU(inplace=True)
- Forgetting model.train() / model.eval()
- Mixing numpy and torch tensors
Numerical Stability
Tips
- Use logsumexp
- Avoid softmax + log
- Use fused losses like CrossEntropyLoss
Mental Model FAANG Wants
Visualize
- Graph creation during forward pass
- Gradient flow during backward pass
- Memory saved per operation
- Where synchronization happens
Computation Graph
Key Points
- DAG of tensor operations
- Built dynamically
- Freed after .backward() unless retain_graph=True
- Disable gradients with: with torch.no_grad(): y = model(x)
What nn.Module Really Is
Core Features
- Parameter registration
- Submodule tracking
- Mode switching (train / eval)
- State serialization
- Hook infrastructure
- Device & dtype propagation
FAANG Insight
Everything in PyTorch training builds on this abstraction.
Parameter Registration (CRITICAL)
How Parameters Are Registered
- Attributes assigned as nn.Parameter are registered
- Stored in self._parameters
- Accessible via model.parameters(), state_dict(), and optimizer
class Model(nn.Module):
def __init__(self):
super().__init__()
self.w = nn.Parameter(torch.randn(10))What Is NOT Registered
- Plain tensors: self.w = torch.randn(10)
- Python lists: self.layers = [nn.Linear(10, 10)]
- Use nn.ModuleList, nn.ModuleDict, nn.ParameterList for collections
Interview Trap
Why doesn’t my model train even though loss decreases?
Buffers (Often Overlooked)
Key Points
- Tensors not optimized but saved in state_dict
- Move automatically with .to(device)
- Used in BatchNorm, running statistics, masks
self.register_buffer('running_mean', torch.zeros(10))FAANG Insight
If it must be saved & moved but not trained → buffer.
Forward Pass Mechanics
__call__ vs forward
- output = model(x) calls __call__
- __call__ does: pre-forward hooks, calls forward, post-forward hooks, handles autograd
- Never call forward() directly
model.train() vs model.eval()
Affects Only Certain Layers
- Dropout
- BatchNorm
- LayerNorm (partially)
Does NOT Affect
- Disable gradients
- Change requires_grad
FAANG Bug
Validation accuracy fluctuates wildly → forgot eval()
Proper Eval Context
State Dict Internals
Contents
- Parameters
- Buffers
- Names reflect module hierarchy
model.state_dict().keys()Saving Best Practice
torch.save(model.state_dict(), path); load with model.load_state_dict(torch.load(path))
FAANG Rule
Never pickle entire model for production.
Weight Initialization (Deep Dive)
Default Initialization
- Linear → Kaiming uniform
- Conv → Kaiming normal
Custom Initialization
Interview Question
Why bad initialization kills deep networks?
Module Hooks (Advanced)
Forward & Backward Hooks
def hook(module, inp, out):
...
handle = module.register_forward_hook(hook)
register_backward_hook (deprecated)
Used For
- Feature extraction
- Debugging gradients
- Visualization
FAANG Warning
Hooks + DDP = dangerous if misused.
Parameter Sharing
Key Points
- self.linear = nn.Linear(10, 10); self.linear2 = self.linear
- Same weights
- One set of gradients
Common Use Cases
- RNNs
- Siamese networks
- Tied embeddings
Freezing & Unfreezing
Freeze Parameters
FAANG Gotcha
Optimizer still holds frozen params unless recreated or filtered.
Clean Model Design Patterns
Pattern 1: Feature Extractor + Head
- self.backbone
- self.head
Pattern 2: Config-Driven Modules
- nn.ModuleDict
Pattern 3: Reusable Blocks
- class Block(nn.Module): ...
Debugging Models (Infra Style)
Key Checks
- Print named_parameters()
- Check gradient norms
- Verify device placement
- Assert shapes
for n, p in model.named_parameters():
print(n, p.grad.norm())Memory Leaks in Modules
Common Causes
- Storing tensors from forward pass
- Keeping graph references
- Hooks not removed
Remove Hook to Avoid Leaks
handle.remove()PyTorch FAANG Interview Cheatsheet
FAANG Interview Checklist
Reference:
Official PyTorch Documentation, Real Python
- Parameter vs buffer
- Why ModuleList matters
- How __call__ works
- Why eval() matters
- Why pickling models is bad
Common Layers
- Linear: nn.Linear(in_features, out_features)
- Convolution: nn.Conv2d(in_channels, out_channels, kernel_size)
- Pooling: nn.MaxPool2d(2), nn.AvgPool2d(2)
- Normalization: nn.BatchNorm1d(num_features), nn.LayerNorm(shape)
Activation Functions
- nn.ReLU()
- nn.LeakyReLU(0.01)
- nn.Sigmoid()
- nn.Tanh()
- nn.Softmax(dim=1)
Loss Functions
- Regression: nn.MSELoss(), nn.L1Loss()
- Classification: nn.CrossEntropyLoss() # includes Softmax, nn.BCELoss(), nn.BCEWithLogitsLoss()
Optimizers
- lr
- weight_decay
- momentum
import torch.optim as optim
optim.SGD(model.parameters(), lr=0.01)
optim.Adam(model.parameters(), lr=0.001)
optim.AdamW(model.parameters())
optim.RMSprop(model.parameters())Training Loop (Canonical)
for epoch in range(epochs):
model.train()
for x, y in dataloader:
optimizer.zero_grad()
pred = model(x)
loss = criterion(pred, y)
loss.backward()
optimizer.step()
# Evaluation
model.eval()
with torch.no_grad():
...Dataset & DataLoader
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)DataLoader Performance
- Key Parameters: batch_size, shuffle, num_workers, pin_memory
Model Saving & Loading
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))
model.eval()
# Full model save (Not Recommended)
torch.save(model, 'model.pth')Weight Initialization
- nn.init.xavier_uniform_(layer.weight)
- nn.init.kaiming_normal_(layer.weight)
Regularization
- Dropout: nn.Dropout(p=0.5)
- Weight Decay: optim.Adam(model.parameters(), weight_decay=1e-4)
- Early Stopping: Monitor validation loss, stop when loss increases
Learning Rate Scheduling
- ReduceLROnPlateau
- CosineAnnealingLR
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
scheduler.step()Transfer Learning
- Freeze Layers: for param in model.parameters(): param.requires_grad = False
- Replace Head: model.fc = nn.Linear(512, num_classes)
CNN Interview Essentials
- Conv Output Size: O = (W − K + 2P) / S + 1
- Key Concepts: Receptive field, Stride, Padding, Channels vs spatial dims
RNN / LSTM Basics
- nn.RNN(input_size, hidden_size)
- nn.LSTM(input_size, hidden_size)
- nn.GRU(input_size, hidden_size)
- Shapes: (batch, seq_len, features)
Transformers (High Level)
- Core Blocks: Embedding, Multi-head self-attention, Feedforward, Residual + LayerNorm
- nn.TransformerEncoderLayer(d_model, nhead)
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
loss = criterion(pred, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Debugging Tips
- Common Errors: Shape mismatch, Forgetting .zero_grad(), Using .item() inside graph, Wrong loss for task
- Debug Tools: print(t.shape), torch.isnan(t).any()
Performance Optimization
- Use torch.no_grad() for inference
- Use .contiguous()
- Avoid Python loops
- Batch operations
PyTorch vs NumPy (Interview)
| Feature | NumPy | PyTorch |
|---|---|---|
| GPU | ❌ | ✅ |
| Autograd | ❌ | ✅ |
| DL | ❌ | ✅ |
