26 Hardware Factors for Deep Learning and LLM Performance
The hardware requirements for inference and training are quite different. Let me break down the key factors:
26.1 GPU/Accelerator Specifications
Memory (VRAM)
- Training: Needs to hold model parameters, gradients, optimizer states, and batch data
- Inference: Only needs model parameters and smaller batch sizes
Training Memory Requirements:
┌──────────────────────────────────┐
│ Total VRAM Needed │
├──────────────────────────────────┤
│ Model Parameters (FP16/32) │
│ + Gradients (same size) │
│ + Optimizer States (2-4x params)│
│ + Activations (batch dependent) │
│ + Temporary buffers │
└──────────────────────────────────┘
Inference Memory:
┌──────────────────────────────────┐
│ Much Smaller VRAM Needed │
├──────────────────────────────────┤
│ Model Parameters │
│ + Small batch activations │
│ + KV cache (for transformers) │
└──────────────────────────────────┘
Compute Power (TFLOPS)
- Measured in FP32, FP16, INT8 operations
- Tensor Cores (NVIDIA) / Matrix Units (Apple) provide massive speedups
- Training typically uses FP16/BF16 mixed precision
- Inference can use INT8/INT4 quantization for speed
26.2 Memory Hierarchy and Bandwidth
Speed Hierarchy (fastest to slowest):
┌─────────────┐
│ Registers │ < 1 cycle
├─────────────┤
│ L1 Cache │ ~4 cycles
├─────────────┤
│ L2 Cache │ ~12 cycles
├─────────────┤
│ VRAM │ ~200 cycles
├─────────────┤
│ System RAM │ ~300+ cycles (through PCIe)
├─────────────┤
│ NVMe SSD │ ~10,000+ cycles
└─────────────┘
Critical Bandwidth Metrics:
- HBM (High Bandwidth Memory) in modern GPUs: 2-3 TB/s
- PCIe 4.0 x16: ~64 GB/s bidirectional
- System RAM DDR5: ~80-100 GB/s
26.3 CPU Considerations
While GPUs do the heavy lifting, CPUs matter for:
- Data preprocessing and loading
- Managing multiple GPU coordination
- Running the training loop logic
Key specs:
- Core count (for parallel data loading)
- Clock speed (for sequential operations)
- Cache size (for frequently accessed data)
26.4 Storage I/O
Training needs fast storage for:
Dataset Loading Pipeline:
Storage → RAM → CPU Processing → GPU
↑ ↓
└──────── Prefetch Next ─────────┘
- NVMe SSDs essential for large datasets
- Sequential read speeds: 5-7 GB/s (PCIe 4.0 NVMe)
- Random read IOPS matters for shuffled data
Inference typically loads model once, so storage less critical
26.5 Multi-GPU Scaling
Training Parallelism Types:
Data Parallel:
GPU0: Model Copy + Batch 0
GPU1: Model Copy + Batch 1
GPU2: Model Copy + Batch 2
→ Sync gradients after each step
Model Parallel:
GPU0: Layers 1-10
GPU1: Layers 11-20
GPU2: Layers 21-30
→ Pass activations between GPUs
Interconnect Bandwidth:
- NVLink (NVIDIA): 600-900 GB/s
- Infinity Fabric (AMD): 400-800 GB/s
- PCIe 5.0: ~128 GB/s
26.6 Specific Hardware Examples
High-end Training:
- NVIDIA H100: 80GB HBM3, 3.35 TB/s bandwidth
- 8x H100 cluster with NVLink for large models
Inference Optimization:
- NVIDIA L4: 24GB, optimized for INT8
- Apple M3 Max: Unified memory architecture, good for medium models
- Quantized models on consumer GPUs (RTX 4090: 24GB)
26.7 Practical Bottleneck Identification
Training Bottlenecks:
┌────────────────┐
│ GPU Utilization│ < 90% → CPU/IO bottleneck
├────────────────┤
│ VRAM Usage │ = 100% → Reduce batch size
├────────────────┤
│ PCIe Traffic │ High → Data loading issue
├────────────────┤
│ Disk I/O Wait │ High → Need faster storage
└────────────────┘