PyTorch Cheat Sheet

Pytorch has become a popular deep learning framework and questions related to it have been asked in every possible data science interview. So use this cheatsheet to quickly recall important points, functions, best practice while developing, training, and debugging machine learning models

Design Philosophy of PyTorch

Core Philosophy

PyTorch gives importance to ease of experimentation over static graph optimization. It is built for the developer productivity, flexibility, and debuggability.

Reference: PyTorch Official Docs, PyTorch 2.x Overview

Key Design Principles

Eager execution
Define-by-run dynamic computation graph
Python-first API with a high-performance C++ backend
Clear Python stack traces for easier debugging

Interview Perspective

Static graph optimization is not as important to PyTorch as developer productivity. This gap is partially filled by PyTorch 2.x (torch.compile).

Why It is used

Overview

Because of its adaptability, efficiency, and robust ecosystem, PyTorch is a deep learning framework that is extensively used in both research and production.

Reference: PyTorch Ecosystem

Important Features

Dynamic computation graph
NumPy-like, Pythonic syntax
Powerful GPU acceleration via CUDA
Extensive use in research and production systems

NumPy vs Tensor (PyTorch)

Comparison

Aspect	NumPy Array	Tensor (PyTorch)
Primary use	Numerical computing	Machine learning & deep learning
Library	NumPy	PyTorch / TensorFlow
GPU support	CPU only	CPU & GPU
Automatic differentiation	Not supported	Supported (autograd)
Performance for ML	Good	Optimized for large-scale ML
Mutability	Mutable	Usually mutable
Data type flexibility	Single dtype per array	Single dtype per tensor
Broadcasting	Yes	Yes
Matrix operations	Via NumPy / SciPy	Built-in & optimized
Parallelism	Limited	Advanced (GPU / TPU)
Serialization	.npy, .npz	Framework-specific formats
Integration	Scientific Python stack	Deep learning ecosystems
Conversion	—	Easily convertible to/from NumPy
Device awareness	No device concept	Tracks CPU/GPU device
Training models	Not suitable	Core functionality

Framework Selection Guide

When to Choose PyTorch

Research and experimentation
Quick prototyping
Custom or novel architectures
Learning deep learning concepts
NLP and advanced model development

When to Select TensorFlow

Large-scale production systems
Embedded or mobile deployment
TPU-intensive workloads
Enterprise ML pipelines

PyTorch Core Modules

Modules

torch → tensors and mathematical operations
torch.nn → neural network layers and modules
torch.optim → optimization algorithms
torch.autograd → automatic differentiation
torch.utils.data → datasets and dataloaders

Typical PyTorch Workflow

End-to-End Flow

Load data
Define model
Define loss function
Define optimizer
Training loop
Evaluation

Tensors

Tensor Components

Storage: 1D contiguous memory buffer
Sizes (shape)
Strides: Steps (in elements) to move across dimensions
Offset: Starting position in storage
dtype
device

Example: Accessing Storage & Strides

x = torch.randn(3, 4)
x.storage()
x.stride()

Creating Tensors

torch.tensor([1, 2, 3])
torch.zeros(3, 4)
torch.ones(2, 2)
torch.empty(5, 5)
torch.eye(3)
torch.arange(0, 10, 2)
torch.linspace(0, 1, 5)

Random Tensors

torch.rand(3, 3)
torch.randn(3, 3) # Normal distribution

From NumPy

import numpy as np
a = np.array([1, 2, 3])
t = torch.from_numpy(a)

Dynamic vs Static Graphs

Aspect	Dynamic Graph	Static Graph
Graph creation	Built at runtime	Built before execution
Flexibility	Very high	Limited
Control flow	Native Python (if, for)	Special graph constructs
Debugging	Easy (step-by-step)	Harder
Code style	Imperative	Declarative
Performance	Slight overhead	Highly optimized
Memory optimization	Limited	Advanced
Reusability	Less reusable	Easily saved & reused
Production deployment	Weaker historically	Strong
Learning curve	Easy	Steep

Tensor Properties

Attributes

t.shape
t.dtype
t.device
t.requires_grad

Data Types

torch.float32
torch.float64
torch.int64
torch.bool

Type Casting

t.float()
t.long()
t.to(torch.float32)

Tensor Operations

Arithmetic

a + b
a - b
a * b
a / b
torch.add(a, b)

Matrix Operations

torch.matmul(a, b)
a @ b
a.T
torch.mm(a, b)

Reductions

torch.sum(a)
torch.mean(a)
torch.max(a)
torch.min(a)

Broadcasting Rules

Dimensions align from right
Size must match or be 1

Indexing & Slicing

Basic Indexing

t[0]
t[:, 1]
t[1:3]
t[t > 0]

Advanced Indexing

indices = torch.tensor([0, 2])
t[indices]

Reshaping

t.view(2, 3)
t.reshape(2, 3)
t.squeeze()
t.unsqueeze(1)

Strides Explained (FAANG favorite)

Example

x = torch.randn(3, 4) # shape (3,4), strides (4,1)
Move across columns: 1 step
Move across rows: 4 steps
Transpose: y = x.t() # shape (4,3), strides (1,4)
No data duplication

Crucial Info

Many tensor operations return views rather than copies.

View vs Copy

View (Memory Sharing)

view
reshape (occasionally)
transpose
squeeze/release

Copy (New Memory)

clone

FAANG Trap Question

view() fails for non-contiguous tensors. Use x.contiguous().view(-1) after transposing.

Memory Continuity

Key Points

Contiguous tensor: memory arranged in row-major order
Necessary for many CUDA kernels
Check with x.is_contiguous()
Call .contiguous() after transpose/permute before heavy operations

Device Management (CPU / GPU)

Basic Usage

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
t = t.to(device)

Common Pitfall

❌ Mixing CPU and GPU tensors ✅ Always move model & data to the same device

Autograd Overview

What Autograd Is

Dynamic automatic differentiation engine
Records tensor operations during forward pass
Constructs a Directed Acyclic Graph (DAG)
reverse-mode autodiff executed via .backward()
Non-leaf tensors have .grad_fn, leaf tensors accumulate .grad

Interview Perspective

Understand dynamic graph construction and backward mechanics for FAANG interviews.

Leaf vs Non-Leaf Tensors

Leaf Tensor

Created directly by user
requires_grad=True
.grad populated after backward
Example: x = torch.randn(3, requires_grad=True)

Non-Leaf Tensor

Result of an operation
Has grad_fn
.grad NOT retained by default
Example: y = x * 2
To retain gradient: y.retain_grad()

Interview Trap

Why is y.grad None after backward? Because non-leaf tensors do not retain gradients by default.

Computation Graph Internals

Key Points

Each operation creates a Function node
Example: x → MulBackward → MeanBackward → z
Graph is freed after .backward() by default
Use retain_graph=True to keep the graph

Backward Mechanics

Default Gradient

z.backward() # equivalent to z.backward(torch.ones_like(z))

Non-Scalar Backward

y.backward(gradient=torch.ones_like(y))

FAANG Expectation

Understand Jacobian-vector product (JVP) intuition.

Gradient Accumulation

Key Points

Gradients accumulate by default
Example: loss.backward() twice → grads doubled
Must clear gradients: optimizer.zero_grad()

Interview Insight

Allows gradient accumulation for large batches.

Version Counters & In-Place Ops

Key Points

Every tensor has a version counter
In-place modification increases version
Autograd checks version consistency
Example Error: x.add_(1) breaks graph for y.backward()

Detach vs No-Grad vs Inference Mode

.detach()

Returns a tensor sharing storage
Stops gradient tracking
Example: y = x.detach()

torch.no_grad()

Context manager to temporarily disable graph construction
Example: with torch.no_grad(): y = model(x)

torch.inference_mode()

Stronger than no_grad
Disables version counters
Faster inference

FAANG Rule

Use inference_mode for production inference.

requires_grad Propagation Rules

Key Points

x.requires_grad = True → y = x*2 inherits requires_grad
Exceptions: integer tensors cannot require gradients
Operations with constants inherit from tensors

Saved Tensors & Memory

Key Points

Autograd saves intermediate tensors during forward pass
Needed for backward computation
Memory-heavy ops: attention, large matmuls, conv layers

Optimization

Use torch.utils.checkpoint to trade compute for memory.

Custom Autograd Functions

Why Custom Functions?

Custom CUDA ops
Memory optimization
Numerical stability

Implementation Example

class MyFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 2

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        return grad_output * 2 * x

y = MyFunc.apply(x)

Rules

No in-place ops; backward must return gradients for each input

Higher-Order Gradients

Key Points

Enable graph during backward: loss.backward(create_graph=True)
Used in meta-learning, gradient penalties, second-order optimizers
⚠️ Memory explosion risk

Hooks (Advanced & Dangerous)

Tensor Hooks

x.register_hook(lambda grad: grad * 0.5)

Module Hooks

Forward hooks
Backward hooks

Use Cases

Debugging, gradient clipping, visualization

FAANG Warning

Hooks can break distributed training if misused.

Common Silent Bugs (Interview Gold)

Examples

Using .data: x.data *= 2 # breaks autograd silently
In-place ReLU: nn.ReLU(inplace=True)
Forgetting model.train() / model.eval()
Mixing numpy and torch tensors

Numerical Stability

Tips

Use logsumexp
Avoid softmax + log
Use fused losses like CrossEntropyLoss

Mental Model FAANG Wants

Visualize

Graph creation during forward pass
Gradient flow during backward pass
Memory saved per operation
Where synchronization happens

Computation Graph

Key Points

DAG of tensor operations
Built dynamically
Freed after .backward() unless retain_graph=True
Disable gradients with: with torch.no_grad(): y = model(x)

What nn.Module Really Is

Core Features

Parameter registration
Submodule tracking
Mode switching (train / eval)
State serialization
Hook infrastructure
Device & dtype propagation

FAANG Insight

Everything in PyTorch training builds on this abstraction.

Parameter Registration (CRITICAL)

How Parameters Are Registered

Attributes assigned as nn.Parameter are registered
Stored in self._parameters
Accessible via model.parameters(), state_dict(), and optimizer

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn(10))

What Is NOT Registered

Plain tensors: self.w = torch.randn(10)
Python lists: self.layers = [nn.Linear(10, 10)]
Use nn.ModuleList, nn.ModuleDict, nn.ParameterList for collections

Interview Trap

Why doesn’t my model train even though loss decreases?

Buffers (Often Overlooked)

Key Points

Tensors not optimized but saved in state_dict
Move automatically with .to(device)
Used in BatchNorm, running statistics, masks

self.register_buffer('running_mean', torch.zeros(10))

FAANG Insight

If it must be saved & moved but not trained → buffer.

Forward Pass Mechanics

call vs forward

output = model(x) calls __call__
__call__ does: pre-forward hooks, calls forward, post-forward hooks, handles autograd
Never call forward() directly

model.train() vs model.eval()

Affects Only Certain Layers

Dropout
BatchNorm
LayerNorm (partially)

Does NOT Affect

Disable gradients
Change requires_grad

FAANG Bug

Validation accuracy fluctuates wildly → forgot eval()

Proper Eval Context

State Dict Internals

Parameters
Buffers
Names reflect module hierarchy

model.state_dict().keys()

Saving Best Practice

torch.save(model.state_dict(), path); load with model.load_state_dict(torch.load(path))

FAANG Rule

Never pickle entire model for production.

Weight Initialization (Deep Dive)

Default Initialization

Linear → Kaiming uniform
Conv → Kaiming normal

Custom Initialization

Interview Question

Why bad initialization kills deep networks?

Module Hooks (Advanced)

Forward & Backward Hooks

def hook(module, inp, out):
    ...
handle = module.register_forward_hook(hook)

register_backward_hook (deprecated)

Used For

Feature extraction
Debugging gradients
Visualization

FAANG Warning

Hooks + DDP = dangerous if misused.

Parameter Sharing

Key Points

self.linear = nn.Linear(10, 10); self.linear2 = self.linear
Same weights
One set of gradients

Common Use Cases

RNNs
Siamese networks
Tied embeddings

Freezing & Unfreezing

Freeze Parameters

FAANG Gotcha

Optimizer still holds frozen params unless recreated or filtered.

Clean Model Design Patterns

Pattern 1: Feature Extractor + Head

self.backbone
self.head

Pattern 2: Config-Driven Modules

nn.ModuleDict

Pattern 3: Reusable Blocks

class Block(nn.Module): ...

Debugging Models (Infra Style)

Key Checks

Print named_parameters()
Check gradient norms
Verify device placement
Assert shapes

for n, p in model.named_parameters():
    print(n, p.grad.norm())

Memory Leaks in Modules

Common Causes

Storing tensors from forward pass
Keeping graph references
Hooks not removed

Remove Hook to Avoid Leaks

handle.remove()

PyTorch FAANG Interview Cheatsheet

FAANG Interview Checklist

Reference: Official PyTorch Documentation, Real Python

Parameter vs buffer
Why ModuleList matters
How __call__ works
Why eval() matters
Why pickling models is bad

Common Layers

Linear: nn.Linear(in_features, out_features)
Convolution: nn.Conv2d(in_channels, out_channels, kernel_size)
Pooling: nn.MaxPool2d(2), nn.AvgPool2d(2)
Normalization: nn.BatchNorm1d(num_features), nn.LayerNorm(shape)

Activation Functions

nn.ReLU()
nn.LeakyReLU(0.01)
nn.Sigmoid()
nn.Tanh()
nn.Softmax(dim=1)

Loss Functions

Regression: nn.MSELoss(), nn.L1Loss()
Classification: nn.CrossEntropyLoss() # includes Softmax, nn.BCELoss(), nn.BCEWithLogitsLoss()

Optimizers

lr
weight_decay
momentum

import torch.optim as optim
optim.SGD(model.parameters(), lr=0.01)
optim.Adam(model.parameters(), lr=0.001)
optim.AdamW(model.parameters())
optim.RMSprop(model.parameters())

Training Loop (Canonical)

for epoch in range(epochs):
    model.train()
    for x, y in dataloader:
        optimizer.zero_grad()
        pred = model(x)
        loss = criterion(pred, y)
        loss.backward()
        optimizer.step()

# Evaluation
model.eval()
with torch.no_grad():
    ...

Dataset & DataLoader

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

DataLoader Performance

Key Parameters: batch_size, shuffle, num_workers, pin_memory

Model Saving & Loading

torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))
model.eval()

# Full model save (Not Recommended)
torch.save(model, 'model.pth')

Weight Initialization

nn.init.xavier_uniform_(layer.weight)
nn.init.kaiming_normal_(layer.weight)

Regularization

Dropout: nn.Dropout(p=0.5)
Weight Decay: optim.Adam(model.parameters(), weight_decay=1e-4)
Early Stopping: Monitor validation loss, stop when loss increases

Learning Rate Scheduling

ReduceLROnPlateau
CosineAnnealingLR

from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
scheduler.step()

Transfer Learning

Freeze Layers: for param in model.parameters(): param.requires_grad = False
Replace Head: model.fc = nn.Linear(512, num_classes)

CNN Interview Essentials

Conv Output Size: O = (W − K + 2P) / S + 1
Key Concepts: Receptive field, Stride, Padding, Channels vs spatial dims

RNN / LSTM Basics

nn.RNN(input_size, hidden_size)
nn.LSTM(input_size, hidden_size)
nn.GRU(input_size, hidden_size)
Shapes: (batch, seq_len, features)

Transformers (High Level)

Core Blocks: Embedding, Multi-head self-attention, Feedforward, Residual + LayerNorm
nn.TransformerEncoderLayer(d_model, nhead)

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    loss = criterion(pred, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Debugging Tips

Common Errors: Shape mismatch, Forgetting .zero_grad(), Using .item() inside graph, Wrong loss for task
Debug Tools: print(t.shape), torch.isnan(t).any()

Performance Optimization

Use torch.no_grad() for inference
Use .contiguous()
Avoid Python loops
Batch operations

PyTorch vs NumPy (Interview)

Feature	NumPy	PyTorch
GPU	❌	✅
Autograd	❌	✅
DL	❌	✅