26 Hardware Factors for Deep Learning and LLM Performance

The hardware requirements for inference and training are quite different. Let me break down the key factors:

26.1 GPU/Accelerator Specifications

Memory (VRAM)

Training: Needs to hold model parameters, gradients, optimizer states, and batch data
Inference: Only needs model parameters and smaller batch sizes

Training Memory Requirements:
┌──────────────────────────────────┐
│         Total VRAM Needed        │
├──────────────────────────────────┤
│  Model Parameters (FP16/32)      │
│  + Gradients (same size)         │
│  + Optimizer States (2-4x params)│
│  + Activations (batch dependent) │
│  + Temporary buffers             │
└──────────────────────────────────┘

Inference Memory:
┌──────────────────────────────────┐
│    Much Smaller VRAM Needed      │
├──────────────────────────────────┤
│  Model Parameters                │
│  + Small batch activations       │
│  + KV cache (for transformers)   │
└──────────────────────────────────┘

Compute Power (TFLOPS)

Measured in FP32, FP16, INT8 operations
Tensor Cores (NVIDIA) / Matrix Units (Apple) provide massive speedups
Training typically uses FP16/BF16 mixed precision
Inference can use INT8/INT4 quantization for speed

26.2 Memory Hierarchy and Bandwidth

Speed Hierarchy (fastest to slowest):
     
┌─────────────┐
│  Registers  │ < 1 cycle
├─────────────┤
│  L1 Cache   │ ~4 cycles
├─────────────┤
│  L2 Cache   │ ~12 cycles  
├─────────────┤
│   VRAM      │ ~200 cycles
├─────────────┤
│  System RAM │ ~300+ cycles (through PCIe)
├─────────────┤
│  NVMe SSD   │ ~10,000+ cycles
└─────────────┘

Critical Bandwidth Metrics:

HBM (High Bandwidth Memory) in modern GPUs: 2-3 TB/s
PCIe 4.0 x16: ~64 GB/s bidirectional
System RAM DDR5: ~80-100 GB/s

26.3 CPU Considerations

While GPUs do the heavy lifting, CPUs matter for:

Data preprocessing and loading
Managing multiple GPU coordination
Running the training loop logic

Key specs:

Core count (for parallel data loading)
Clock speed (for sequential operations)
Cache size (for frequently accessed data)

26.4 Storage I/O

Training needs fast storage for:

Dataset Loading Pipeline:
                                  
Storage → RAM → CPU Processing → GPU
  ↑                                ↓
  └──────── Prefetch Next ─────────┘

NVMe SSDs essential for large datasets
Sequential read speeds: 5-7 GB/s (PCIe 4.0 NVMe)
Random read IOPS matters for shuffled data

Inference typically loads model once, so storage less critical

26.5 Multi-GPU Scaling

Training Parallelism Types:

Data Parallel:
GPU0: Model Copy + Batch 0
GPU1: Model Copy + Batch 1
GPU2: Model Copy + Batch 2
→ Sync gradients after each step

Model Parallel:
GPU0: Layers 1-10
GPU1: Layers 11-20  
GPU2: Layers 21-30
→ Pass activations between GPUs

Interconnect Bandwidth:

NVLink (NVIDIA): 600-900 GB/s
Infinity Fabric (AMD): 400-800 GB/s
PCIe 5.0: ~128 GB/s

26.6 Specific Hardware Examples

High-end Training:

NVIDIA H100: 80GB HBM3, 3.35 TB/s bandwidth
8x H100 cluster with NVLink for large models

Inference Optimization:

NVIDIA L4: 24GB, optimized for INT8
Apple M3 Max: Unified memory architecture, good for medium models
Quantized models on consumer GPUs (RTX 4090: 24GB)

26.7 Practical Bottleneck Identification

Training Bottlenecks:
┌────────────────┐
│ GPU Utilization│ < 90% → CPU/IO bottleneck
├────────────────┤
│ VRAM Usage     │ = 100% → Reduce batch size
├────────────────┤
│ PCIe Traffic   │ High → Data loading issue
├────────────────┤
│ Disk I/O Wait  │ High → Need faster storage
└────────────────┘