27 AI Server Scaling Framework for Hospital Deployment
Q: Please help me forming an idea/framwork for AI server scaling in these scenarios:
- local LLM inference in the hospital with multiple concurrent users (e.g, via Chatbot interface). The result is stream in realtime.
- local LLM inference in the hospital with a few user, which use by API call (e.g., LLM structural data extraction for research). The result can be wait.
- local LLM fine tuning or training CNN model.
In each of the above use cases, discuss about:
- Scaling: vertical approach (e.g., add more L40s GPU in a single server [max 2 slots]) or horizontal approach (e.g, add more AI servers), which is more efficient?
- Provide System architecture that could effectively server those needs.
27.1 Scenario 1: Real-time Multi-user Chatbot
Workload Characteristics:
- High concurrency (10-100+ simultaneous users)
- Low latency requirement (<1s first token)
- Streaming responses (tokens/second matters)
- Variable load throughout the day
Scaling Recommendation: Horizontal** (with initial vertical)**
Why Horizontal for Chatbot:
┌─────────────────────────────────────┐
│ Concurrent Users vs Response Time │
├─────────────────────────────────────┤
│ Users │ 1 GPU │ 2 GPU │ 4 Servers │
├───────┼───────┼───────┼─────────────┤
│ 10 │ 1.2s │ 0.8s │ 0.3s │
│ 50 │ 6.5s │ 3.4s │ 0.8s │
│ 100 │ Queue │ 8.2s │ 1.5s │
└─────────────────────────────────────┘
Architecture:
Hospital Network
│
┌───────┴────────┐
│ Load Balancer │
│ (HAProxy) │
└───────┬────────┘
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
│ 1x L40S │ │ 1x L40S │ │ 1x L40S │
│ vLLM │ │ vLLM │ │ vLLM │
└──────────┘ └──────────┘ └──────────┘
│ │ │
┌────┴─────────────────┴─────────────────┴────┐
│ Shared Model Storage (NFS) │
│ (Read-only model weights) │
└─────────────────────────────────────────────┘
Key Components: - vLLM or TGI for efficient inference serving - Continuous batching for optimal throughput - Model sharding if using larger models (70B+)
27.2 Scenario 2: Batch API Processing for Research
Workload Characteristics: - Low concurrency (1-5 researchers) - Latency tolerant (can wait minutes) - Large batch processing - Predictable workload
Scaling Recommendation: Vertical** (maximize single node)**
Why Vertical for Batch:
┌──────────────────────────────────────┐
│ Batch Processing Efficiency │
├──────────────────────────────────────┤
│ Config │ Throughput │ Cost/tok │
├─────────────┼────────────┼───────────┤
│ 1x L40S │ 50 tok/s │ $0.10 │
│ 2x L40S │ 95 tok/s │ $0.08 │
│ 2 Servers │ 90 tok/s │ $0.12 │
└──────────────────────────────────────┘
Architecture:
Research Workstations
│
▼
┌───────────────────────────────┐
│ API Gateway │
│ (Queue Management) │
└───────────┬───────────────────┘
│
┌───────────▼───────────────────┐
│ Inference Server │
│ ┌────────────────────┐ │
│ │ 2x L40S (NVLink) │ │
│ │ Tensor Parallel │ │
│ └────────────────────┘ │
│ │
│ Job Queue (Redis/RabbitMQ) │
│ Result Cache (PostgreSQL) │
└───────────────────────────────┘
│
┌───────▼───────────┐
│ DICOM/HL7 Export │
│ Integration │
└───────────────────┘
Optimization for Research: - Larger batch sizes (32-128) - Result caching for repeated queries - Async processing with job queues
27.3 Scenario 3: Fine-tuning/Training
Workload Characteristics: - Single long-running job - Memory intensive - Requires gradient computation - Checkpoint saving
Scaling Recommendation: Vertical first, then horizontal for larger models**
Training Scaling Decision Tree:
Model Size < 7B?
│
├─Yes→ 1x L40S (48GB)
│
└─No→ Model Size < 30B?
│
├─Yes→ 2x L40S (96GB, NVLink)
│
└─No→ Multi-node DDP/FSDP
Architecture:
┌─────────────────────────────────────┐
│ Training Orchestrator │
│ (Kubernetes + Kubeflow) │
└────────────┬────────────────────────┘
│
┌────────────▼────────────────────────┐
│ Primary Training Node │
│ ┌──────────────────────────┐ │
│ │ 2x L40S with NVLink │ │
│ │ 256GB RAM, 2TB NVMe │ │
│ └──────────────────────────┘ │
│ │
│ DeepSpeed/PyTorch Lightning │
└─────────────┬───────────────────────┘
│
┌─────────────▼───────────────────────┐
│ Storage Cluster │
│ ┌─────────────────────────┐ │
│ │ Training Data (RAID 10) │ │
│ │ Checkpoints (3 replicas)│ │
│ └─────────────────────────┘ │
└─────────────────────────────────────┘
27.4 Unified Infrastructure Design
For a hospital setting, I recommend a hybrid approach:
Complete Hospital AI Infrastructure:
┌─────────────────────────┐
│ Hospital Network │
└───────────┬─────────────┘
│
┌───────────▼─────────────┐
│ AI Orchestrator │
│ (K8s Control Plane) │
└───────────┬─────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Inference Pool │ │ Training Node │ │ Batch Node │
│ 3x Servers │ │ 1x Server │ │ 1x Server │
│ 1x L40S each │ │ 2x L40S │ │ 2x L40S │
│ (Horizontal) │ │ (Vertical) │ │ (Vertical) │
└────────────────┘ └────────────────┘ └────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌───────────▼─────────────┐
│ Shared Resources │
│ - Model Repository │
│ - Dataset Storage │
│ - Monitoring (Grafana) │
└─────────────────────────┘
27.5 Cost-Benefit Analysis
Scaling Comparison Matrix:
┌───────────────────────────────────────────────┐
│ Metric │ Vertical │ Horizontal │
├─────────────────┼───────────┼─────────────────┤
│ Initial Cost │ High $$ │ Lower $ │
│ Complexity │ Low │ High │
│ Redundancy │ None │ Built-in │
│ Max Capability │ Limited │ Unlimited │
│ Network Overhead│ None │ Significant │
│ Maintenance │ Easier │ Complex │
└───────────────────────────────────────────────┘
27.6 Practical Implementation Steps
Phase 1: Start Vertical (Months 1-3)
Single Server Configuration:
- 2x NVIDIA L40S (48GB each)
- 256GB System RAM
- 2x 2TB NVMe in RAID 1
- Estimated: $35,000-40,000
Phase 2: Add Horizontal for Inference (Months 4-6)
Add 2-3 inference nodes:
- 1x L40S per node
- 128GB RAM each
- Focus on chatbot scaling
Phase 3: Optimize Based on Metrics
Monitoring Metrics:
┌──────────────────────┐
│ Queue Length │ > 10 → Add inference node
│ GPU Memory Usage │ > 90% → Consider model optimization
│ Response Time P95 │ > 2s → Scale horizontally
│ Training Time │ > 24h → Add training GPUs
└──────────────────────┘
27.7 Specific Recommendations for Your Hospital
- Start with 2x L40S vertical setup - handles all three scenarios initially
- Implement queue-based architecture from day one for easy scaling
- Use Kubernetes for orchestration (even single-node initially)
- Monitor actual usage patterns for 2-3 months before scaling
- Consider NVIDIA AI Enterprise license for hospital support
For Medical Imaging Specifically: - L40S’s 48GB VRAM excellent for high-res DICOM processing - Consider adding A100 (80GB) if working with 3D volumes - Implement DICOM-specific caching layer