26  Hardware Factors for Deep Learning and LLM Performance

The hardware requirements for inference and training are quite different. Let me break down the key factors:

26.1 GPU/Accelerator Specifications

Memory (VRAM)

  • Training: Needs to hold model parameters, gradients, optimizer states, and batch data
  • Inference: Only needs model parameters and smaller batch sizes
Training Memory Requirements:
┌──────────────────────────────────┐
│         Total VRAM Needed        │
├──────────────────────────────────┤
│  Model Parameters (FP16/32)      │
│  + Gradients (same size)         │
│  + Optimizer States (2-4x params)│
│  + Activations (batch dependent) │
│  + Temporary buffers             │
└──────────────────────────────────┘

Inference Memory:
┌──────────────────────────────────┐
│    Much Smaller VRAM Needed      │
├──────────────────────────────────┤
│  Model Parameters                │
│  + Small batch activations       │
│  + KV cache (for transformers)   │
└──────────────────────────────────┘

Compute Power (TFLOPS)

  • Measured in FP32, FP16, INT8 operations
  • Tensor Cores (NVIDIA) / Matrix Units (Apple) provide massive speedups
  • Training typically uses FP16/BF16 mixed precision
  • Inference can use INT8/INT4 quantization for speed

26.2 Memory Hierarchy and Bandwidth

Speed Hierarchy (fastest to slowest):
     
┌─────────────┐
│  Registers  │ < 1 cycle
├─────────────┤
│  L1 Cache   │ ~4 cycles
├─────────────┤
│  L2 Cache   │ ~12 cycles  
├─────────────┤
│   VRAM      │ ~200 cycles
├─────────────┤
│  System RAM │ ~300+ cycles (through PCIe)
├─────────────┤
│  NVMe SSD   │ ~10,000+ cycles
└─────────────┘

Critical Bandwidth Metrics:

  • HBM (High Bandwidth Memory) in modern GPUs: 2-3 TB/s
  • PCIe 4.0 x16: ~64 GB/s bidirectional
  • System RAM DDR5: ~80-100 GB/s

26.3 CPU Considerations

While GPUs do the heavy lifting, CPUs matter for:

  • Data preprocessing and loading
  • Managing multiple GPU coordination
  • Running the training loop logic

Key specs:

  • Core count (for parallel data loading)
  • Clock speed (for sequential operations)
  • Cache size (for frequently accessed data)

26.4 Storage I/O

Training needs fast storage for:

Dataset Loading Pipeline:
                                  
Storage → RAM → CPU Processing → GPU
  ↑                                ↓
  └──────── Prefetch Next ─────────┘
  • NVMe SSDs essential for large datasets
  • Sequential read speeds: 5-7 GB/s (PCIe 4.0 NVMe)
  • Random read IOPS matters for shuffled data

Inference typically loads model once, so storage less critical

26.5 Multi-GPU Scaling

Training Parallelism Types:

Data Parallel:
GPU0: Model Copy + Batch 0
GPU1: Model Copy + Batch 1
GPU2: Model Copy + Batch 2
→ Sync gradients after each step

Model Parallel:
GPU0: Layers 1-10
GPU1: Layers 11-20  
GPU2: Layers 21-30
→ Pass activations between GPUs

Interconnect Bandwidth:

  • NVLink (NVIDIA): 600-900 GB/s
  • Infinity Fabric (AMD): 400-800 GB/s
  • PCIe 5.0: ~128 GB/s

26.6 Specific Hardware Examples

High-end Training:

  • NVIDIA H100: 80GB HBM3, 3.35 TB/s bandwidth
  • 8x H100 cluster with NVLink for large models

Inference Optimization:

  • NVIDIA L4: 24GB, optimized for INT8
  • Apple M3 Max: Unified memory architecture, good for medium models
  • Quantized models on consumer GPUs (RTX 4090: 24GB)

26.7 Practical Bottleneck Identification

Training Bottlenecks:
┌────────────────┐
│ GPU Utilization│ < 90% → CPU/IO bottleneck
├────────────────┤
│ VRAM Usage     │ = 100% → Reduce batch size
├────────────────┤
│ PCIe Traffic   │ High → Data loading issue
├────────────────┤
│ Disk I/O Wait  │ High → Need faster storage
└────────────────┘