PyTorch Cheat Sheet

Pytorch has become a popular deep learning framework and questions related to it have been asked in every possible data science interview. So use this cheatsheet to quickly recall important points, functions, best practice while developing, training, and debugging machine learning models

Design Philosophy of PyTorch

Core Philosophy

PyTorch gives importance to ease of experimentation over static graph optimization. It is built for the developer productivity, flexibility, and debuggability.

Key Design Principles

  • Eager execution
  • Define-by-run dynamic computation graph
  • Python-first API with a high-performance C++ backend
  • Clear Python stack traces for easier debugging

Interview Perspective

  • Static graph optimization is not as important to PyTorch as developer productivity. This gap is partially filled by PyTorch 2.x (torch.compile).

Why It is used

Overview

Because of its adaptability, efficiency, and robust ecosystem, PyTorch is a deep learning framework that is extensively used in both research and production.

Reference: PyTorch Ecosystem

Important Features

  • Dynamic computation graph
  • NumPy-like, Pythonic syntax
  • Powerful GPU acceleration via CUDA
  • Extensive use in research and production systems

NumPy vs Tensor (PyTorch)

Comparison

AspectNumPy ArrayTensor (PyTorch)
Primary useNumerical computingMachine learning & deep learning
LibraryNumPyPyTorch / TensorFlow
GPU supportCPU onlyCPU & GPU
Automatic differentiationNot supportedSupported (autograd)
Performance for MLGoodOptimized for large-scale ML
MutabilityMutableUsually mutable
Data type flexibilitySingle dtype per arraySingle dtype per tensor
BroadcastingYesYes
Matrix operationsVia NumPy / SciPyBuilt-in & optimized
ParallelismLimitedAdvanced (GPU / TPU)
Serialization.npy, .npzFramework-specific formats
IntegrationScientific Python stackDeep learning ecosystems
ConversionEasily convertible to/from NumPy
Device awarenessNo device conceptTracks CPU/GPU device
Training modelsNot suitableCore functionality

Framework Selection Guide

When to Choose PyTorch

  • Research and experimentation
  • Quick prototyping
  • Custom or novel architectures
  • Learning deep learning concepts
  • NLP and advanced model development

When to Select TensorFlow

  • Large-scale production systems
  • Embedded or mobile deployment
  • TPU-intensive workloads
  • Enterprise ML pipelines

PyTorch Core Modules

Modules

  • torch → tensors and mathematical operations
  • torch.nn → neural network layers and modules
  • torch.optim → optimization algorithms
  • torch.autograd → automatic differentiation
  • torch.utils.data → datasets and dataloaders

Typical PyTorch Workflow

End-to-End Flow

  • Load data
  • Define model
  • Define loss function
  • Define optimizer
  • Training loop
  • Evaluation

Tensors

Tensor Components

  • Storage: 1D contiguous memory buffer
  • Sizes (shape)
  • Strides: Steps (in elements) to move across dimensions
  • Offset: Starting position in storage
  • dtype
  • device

Example: Accessing Storage & Strides

    x = torch.randn(3, 4)
    x.storage()
    x.stride()

    Creating Tensors

    • torch.tensor([1, 2, 3])
    • torch.zeros(3, 4)
    • torch.ones(2, 2)
    • torch.empty(5, 5)
    • torch.eye(3)
    • torch.arange(0, 10, 2)
    • torch.linspace(0, 1, 5)

    Random Tensors

    • torch.rand(3, 3)
    • torch.randn(3, 3) # Normal distribution

    From NumPy

    • import numpy as np
    • a = np.array([1, 2, 3])
    • t = torch.from_numpy(a)

    Dynamic vs Static Graphs

    AspectDynamic GraphStatic Graph
    Graph creationBuilt at runtimeBuilt before execution
    FlexibilityVery highLimited
    Control flowNative Python (if, for)Special graph constructs
    DebuggingEasy (step-by-step)Harder
    Code styleImperativeDeclarative
    PerformanceSlight overheadHighly optimized
    Memory optimizationLimitedAdvanced
    ReusabilityLess reusableEasily saved & reused
    Production deploymentWeaker historicallyStrong
    Learning curveEasySteep

    Tensor Properties

    Attributes

    • t.shape
    • t.dtype
    • t.device
    • t.requires_grad

    Data Types

    • torch.float32
    • torch.float64
    • torch.int64
    • torch.bool

    Type Casting

    • t.float()
    • t.long()
    • t.to(torch.float32)

    Tensor Operations

    Arithmetic

    • a + b
    • a - b
    • a * b
    • a / b
    • torch.add(a, b)

    Matrix Operations

    • torch.matmul(a, b)
    • a @ b
    • a.T
    • torch.mm(a, b)

    Reductions

    • torch.sum(a)
    • torch.mean(a)
    • torch.max(a)
    • torch.min(a)

    Broadcasting Rules

    • Dimensions align from right
    • Size must match or be 1

    Indexing & Slicing

    Basic Indexing

    • t[0]
    • t[:, 1]
    • t[1:3]
    • t[t > 0]

    Advanced Indexing

    • indices = torch.tensor([0, 2])
    • t[indices]

    Reshaping

    • t.view(2, 3)
    • t.reshape(2, 3)
    • t.squeeze()
    • t.unsqueeze(1)

    Strides Explained (FAANG favorite)

    Example

    • x = torch.randn(3, 4) # shape (3,4), strides (4,1)
    • Move across columns: 1 step
    • Move across rows: 4 steps
    • Transpose: y = x.t() # shape (4,3), strides (1,4)
    • No data duplication

    Crucial Info

    Many tensor operations return views rather than copies.

    View vs Copy

    View (Memory Sharing)

    • view
    • reshape (occasionally)
    • transpose
    • squeeze/release

    Copy (New Memory)

    • clone

    FAANG Trap Question

    view() fails for non-contiguous tensors. Use x.contiguous().view(-1) after transposing.

    Memory Continuity

    Key Points

    • Contiguous tensor: memory arranged in row-major order
    • Necessary for many CUDA kernels
    • Check with x.is_contiguous()
    • Call .contiguous() after transpose/permute before heavy operations

    Device Management (CPU / GPU)

    Basic Usage

    • device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    • t = t.to(device)

    Common Pitfall

    ❌ Mixing CPU and GPU tensors ✅ Always move model & data to the same device

    Autograd Overview

    What Autograd Is

    • Dynamic automatic differentiation engine
    • Records tensor operations during forward pass
    • Constructs a Directed Acyclic Graph (DAG)
    • reverse-mode autodiff executed via .backward()
    • Non-leaf tensors have .grad_fn, leaf tensors accumulate .grad

    Interview Perspective

    Understand dynamic graph construction and backward mechanics for FAANG interviews.

    Leaf vs Non-Leaf Tensors

    Leaf Tensor

    • Created directly by user
    • requires_grad=True
    • .grad populated after backward
    • Example: x = torch.randn(3, requires_grad=True)

    Non-Leaf Tensor

    • Result of an operation
    • Has grad_fn
    • .grad NOT retained by default
    • Example: y = x * 2
    • To retain gradient: y.retain_grad()

    Interview Trap

    Why is y.grad None after backward? Because non-leaf tensors do not retain gradients by default.

    Computation Graph Internals

    Key Points

    • Each operation creates a Function node
    • Example: x → MulBackward → MeanBackward → z
    • Graph is freed after .backward() by default
    • Use retain_graph=True to keep the graph

    Backward Mechanics

    Default Gradient

    • z.backward() # equivalent to z.backward(torch.ones_like(z))

    Non-Scalar Backward

    • y.backward(gradient=torch.ones_like(y))

    FAANG Expectation

    Understand Jacobian-vector product (JVP) intuition.

    Gradient Accumulation

    Key Points

    • Gradients accumulate by default
    • Example: loss.backward() twice → grads doubled
    • Must clear gradients: optimizer.zero_grad()

    Interview Insight

    Allows gradient accumulation for large batches.

    Version Counters & In-Place Ops

    Key Points

    • Every tensor has a version counter
    • In-place modification increases version
    • Autograd checks version consistency
    • Example Error: x.add_(1) breaks graph for y.backward()

    Detach vs No-Grad vs Inference Mode

    .detach()

    • Returns a tensor sharing storage
    • Stops gradient tracking
    • Example: y = x.detach()

    torch.no_grad()

    • Context manager to temporarily disable graph construction
    • Example: with torch.no_grad(): y = model(x)

    torch.inference_mode()

    • Stronger than no_grad
    • Disables version counters
    • Faster inference

    FAANG Rule

    Use inference_mode for production inference.

    requires_grad Propagation Rules

    Key Points

    • x.requires_grad = True → y = x*2 inherits requires_grad
    • Exceptions: integer tensors cannot require gradients
    • Operations with constants inherit from tensors

    Saved Tensors & Memory

    Key Points

    • Autograd saves intermediate tensors during forward pass
    • Needed for backward computation
    • Memory-heavy ops: attention, large matmuls, conv layers

    Optimization

    Use torch.utils.checkpoint to trade compute for memory.

    Custom Autograd Functions

    Why Custom Functions?

    • Custom CUDA ops
    • Memory optimization
    • Numerical stability

    Implementation Example

      class MyFunc(torch.autograd.Function):
          @staticmethod
          def forward(ctx, x):
              ctx.save_for_backward(x)
              return x ** 2
      
          @staticmethod
          def backward(ctx, grad_output):
              x, = ctx.saved_tensors
              return grad_output * 2 * x
      
      y = MyFunc.apply(x)

      Rules

      No in-place ops; backward must return gradients for each input

      Higher-Order Gradients

      Key Points

      • Enable graph during backward: loss.backward(create_graph=True)
      • Used in meta-learning, gradient penalties, second-order optimizers
      • ⚠️ Memory explosion risk

      Hooks (Advanced & Dangerous)

      Tensor Hooks

      • x.register_hook(lambda grad: grad * 0.5)

      Module Hooks

      • Forward hooks
      • Backward hooks

      Use Cases

      Debugging, gradient clipping, visualization

      FAANG Warning

      Hooks can break distributed training if misused.

      Common Silent Bugs (Interview Gold)

      Examples

      • Using .data: x.data *= 2 # breaks autograd silently
      • In-place ReLU: nn.ReLU(inplace=True)
      • Forgetting model.train() / model.eval()
      • Mixing numpy and torch tensors

      Numerical Stability

      Tips

      • Use logsumexp
      • Avoid softmax + log
      • Use fused losses like CrossEntropyLoss

      Mental Model FAANG Wants

      Visualize

      • Graph creation during forward pass
      • Gradient flow during backward pass
      • Memory saved per operation
      • Where synchronization happens

      Computation Graph

      Key Points

      • DAG of tensor operations
      • Built dynamically
      • Freed after .backward() unless retain_graph=True
      • Disable gradients with: with torch.no_grad(): y = model(x)

      What nn.Module Really Is

      Core Features

      • Parameter registration
      • Submodule tracking
      • Mode switching (train / eval)
      • State serialization
      • Hook infrastructure
      • Device & dtype propagation

      FAANG Insight

      Everything in PyTorch training builds on this abstraction.

      Parameter Registration (CRITICAL)

      How Parameters Are Registered

      • Attributes assigned as nn.Parameter are registered
      • Stored in self._parameters
      • Accessible via model.parameters(), state_dict(), and optimizer
      class Model(nn.Module):
          def __init__(self):
              super().__init__()
              self.w = nn.Parameter(torch.randn(10))

      What Is NOT Registered

      • Plain tensors: self.w = torch.randn(10)
      • Python lists: self.layers = [nn.Linear(10, 10)]
      • Use nn.ModuleList, nn.ModuleDict, nn.ParameterList for collections

      Interview Trap

      Why doesn’t my model train even though loss decreases?

      Buffers (Often Overlooked)

      Key Points

      • Tensors not optimized but saved in state_dict
      • Move automatically with .to(device)
      • Used in BatchNorm, running statistics, masks
      self.register_buffer('running_mean', torch.zeros(10))

      FAANG Insight

      If it must be saved & moved but not trained → buffer.

      Forward Pass Mechanics

      __call__ vs forward

      • output = model(x) calls __call__
      • __call__ does: pre-forward hooks, calls forward, post-forward hooks, handles autograd
      • Never call forward() directly

      model.train() vs model.eval()

      Affects Only Certain Layers

      • Dropout
      • BatchNorm
      • LayerNorm (partially)

      Does NOT Affect

      • Disable gradients
      • Change requires_grad

      FAANG Bug

      Validation accuracy fluctuates wildly → forgot eval()

      Proper Eval Context

      State Dict Internals

      Contents

      • Parameters
      • Buffers
      • Names reflect module hierarchy
      model.state_dict().keys()

      Saving Best Practice

      torch.save(model.state_dict(), path); load with model.load_state_dict(torch.load(path))

      FAANG Rule

      Never pickle entire model for production.

      Weight Initialization (Deep Dive)

      Default Initialization

      • Linear → Kaiming uniform
      • Conv → Kaiming normal

      Custom Initialization

      Interview Question

      Why bad initialization kills deep networks?

      Module Hooks (Advanced)

      Forward & Backward Hooks

      def hook(module, inp, out):
          ...
      handle = module.register_forward_hook(hook)
      register_backward_hook (deprecated)

      Used For

      • Feature extraction
      • Debugging gradients
      • Visualization

      FAANG Warning

      Hooks + DDP = dangerous if misused.

      Parameter Sharing

      Key Points

      • self.linear = nn.Linear(10, 10); self.linear2 = self.linear
      • Same weights
      • One set of gradients

      Common Use Cases

      • RNNs
      • Siamese networks
      • Tied embeddings

      Freezing & Unfreezing

      Freeze Parameters

      FAANG Gotcha

      Optimizer still holds frozen params unless recreated or filtered.

      Clean Model Design Patterns

      Pattern 1: Feature Extractor + Head

      • self.backbone
      • self.head

      Pattern 2: Config-Driven Modules

      • nn.ModuleDict

      Pattern 3: Reusable Blocks

      • class Block(nn.Module): ...

      Debugging Models (Infra Style)

      Key Checks

      • Print named_parameters()
      • Check gradient norms
      • Verify device placement
      • Assert shapes
      for n, p in model.named_parameters():
          print(n, p.grad.norm())

      Memory Leaks in Modules

      Common Causes

      • Storing tensors from forward pass
      • Keeping graph references
      • Hooks not removed

      Remove Hook to Avoid Leaks

        handle.remove()

        PyTorch FAANG Interview Cheatsheet

        FAANG Interview Checklist

        • Parameter vs buffer
        • Why ModuleList matters
        • How __call__ works
        • Why eval() matters
        • Why pickling models is bad

        Common Layers

        • Linear: nn.Linear(in_features, out_features)
        • Convolution: nn.Conv2d(in_channels, out_channels, kernel_size)
        • Pooling: nn.MaxPool2d(2), nn.AvgPool2d(2)
        • Normalization: nn.BatchNorm1d(num_features), nn.LayerNorm(shape)

        Activation Functions

        • nn.ReLU()
        • nn.LeakyReLU(0.01)
        • nn.Sigmoid()
        • nn.Tanh()
        • nn.Softmax(dim=1)

        Loss Functions

        • Regression: nn.MSELoss(), nn.L1Loss()
        • Classification: nn.CrossEntropyLoss() # includes Softmax, nn.BCELoss(), nn.BCEWithLogitsLoss()

        Optimizers

        • lr
        • weight_decay
        • momentum
        import torch.optim as optim
        optim.SGD(model.parameters(), lr=0.01)
        optim.Adam(model.parameters(), lr=0.001)
        optim.AdamW(model.parameters())
        optim.RMSprop(model.parameters())

        Training Loop (Canonical)

          for epoch in range(epochs):
              model.train()
              for x, y in dataloader:
                  optimizer.zero_grad()
                  pred = model(x)
                  loss = criterion(pred, y)
                  loss.backward()
                  optimizer.step()
          
          # Evaluation
          model.eval()
          with torch.no_grad():
              ...

          Dataset & DataLoader

            from torch.utils.data import Dataset, DataLoader
            
            class MyDataset(Dataset):
                def __init__(self, X, y):
                    self.X = X
                    self.y = y
            
                def __len__(self):
                    return len(self.X)
            
                def __getitem__(self, idx):
                    return self.X[idx], self.y[idx]
            
            dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

            DataLoader Performance

            • Key Parameters: batch_size, shuffle, num_workers, pin_memory

            Model Saving & Loading

              torch.save(model.state_dict(), 'model.pth')
              model.load_state_dict(torch.load('model.pth'))
              model.eval()
              
              # Full model save (Not Recommended)
              torch.save(model, 'model.pth')

              Weight Initialization

              • nn.init.xavier_uniform_(layer.weight)
              • nn.init.kaiming_normal_(layer.weight)

              Regularization

              • Dropout: nn.Dropout(p=0.5)
              • Weight Decay: optim.Adam(model.parameters(), weight_decay=1e-4)
              • Early Stopping: Monitor validation loss, stop when loss increases

              Learning Rate Scheduling

              • ReduceLROnPlateau
              • CosineAnnealingLR
              from torch.optim.lr_scheduler import StepLR
              scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
              scheduler.step()

              Transfer Learning

              • Freeze Layers: for param in model.parameters(): param.requires_grad = False
              • Replace Head: model.fc = nn.Linear(512, num_classes)

              CNN Interview Essentials

              • Conv Output Size: O = (W − K + 2P) / S + 1
              • Key Concepts: Receptive field, Stride, Padding, Channels vs spatial dims

              RNN / LSTM Basics

              • nn.RNN(input_size, hidden_size)
              • nn.LSTM(input_size, hidden_size)
              • nn.GRU(input_size, hidden_size)
              • Shapes: (batch, seq_len, features)

              Transformers (High Level)

              • Core Blocks: Embedding, Multi-head self-attention, Feedforward, Residual + LayerNorm
              • nn.TransformerEncoderLayer(d_model, nhead)

              Mixed Precision Training

                from torch.cuda.amp import autocast, GradScaler
                scaler = GradScaler()
                
                with autocast():
                    loss = criterion(pred, y)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

                Debugging Tips

                • Common Errors: Shape mismatch, Forgetting .zero_grad(), Using .item() inside graph, Wrong loss for task
                • Debug Tools: print(t.shape), torch.isnan(t).any()

                Performance Optimization

                • Use torch.no_grad() for inference
                • Use .contiguous()
                • Avoid Python loops
                • Batch operations

                PyTorch vs NumPy (Interview)

                FeatureNumPyPyTorch
                GPU
                Autograd
                DL