27  AI Server Scaling Framework for Hospital Deployment

Note

Q: Please help me forming an idea/framwork for AI server scaling in these scenarios:

  1. local LLM inference in the hospital with multiple concurrent users (e.g, via Chatbot interface). The result is stream in realtime. 
  2. local LLM inference in the hospital with a few user, which use by API call (e.g., LLM structural data extraction for research). The result can be wait. 
  3. local LLM fine tuning or training CNN model. 

In each of the above use cases, discuss about: 

  1. Scaling: vertical approach (e.g., add more L40s GPU in a single server [max 2 slots]) or horizontal approach (e.g, add more AI servers), which is more efficient? 
  2. Provide System architecture that could effectively server those needs.

27.1 Scenario 1: Real-time Multi-user Chatbot

Workload Characteristics:

  • High concurrency (10-100+ simultaneous users)
  • Low latency requirement (<1s first token)
  • Streaming responses (tokens/second matters)
  • Variable load throughout the day

Scaling Recommendation: Horizontal** (with initial vertical)**

Why Horizontal for Chatbot:
┌─────────────────────────────────────┐
│  Concurrent Users vs Response Time  │
├─────────────────────────────────────┤
│ Users │ 1 GPU │ 2 GPU │ 4 Servers   │
├───────┼───────┼───────┼─────────────┤
│  10   │  1.2s │  0.8s │   0.3s      │
│  50   │  6.5s │  3.4s │   0.8s      │
│  100  │ Queue │  8.2s │   1.5s      │
└─────────────────────────────────────┘

Architecture:

                    Hospital Network
                           │
                   ┌───────┴────────┐
                   │  Load Balancer │
                   │   (HAProxy)    │
                   └───────┬────────┘
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │ Server 1 │     │ Server 2 │     │ Server 3 │
    │  1x L40S │     │  1x L40S │     │  1x L40S │
    │  vLLM    │     │  vLLM    │     │  vLLM    │
    └──────────┘     └──────────┘     └──────────┘
         │                 │                 │
    ┌────┴─────────────────┴─────────────────┴────┐
    │         Shared Model Storage (NFS)          │
    │         (Read-only model weights)           │
    └─────────────────────────────────────────────┘

Key Components: - vLLM or TGI for efficient inference serving - Continuous batching for optimal throughput - Model sharding if using larger models (70B+)

27.2 Scenario 2: Batch API Processing for Research

Workload Characteristics: - Low concurrency (1-5 researchers) - Latency tolerant (can wait minutes) - Large batch processing - Predictable workload

Scaling Recommendation: Vertical** (maximize single node)**

Why Vertical for Batch:
┌──────────────────────────────────────┐
│     Batch Processing Efficiency      │
├──────────────────────────────────────┤
│ Config      │ Throughput │ Cost/tok  │
├─────────────┼────────────┼───────────┤
│ 1x L40S     │  50 tok/s  │   $0.10   │
│ 2x L40S     │  95 tok/s  │   $0.08   │
│ 2 Servers   │  90 tok/s  │   $0.12   │
└──────────────────────────────────────┘

Architecture:

    Research Workstations
            │
            ▼
    ┌───────────────────────────────┐
    │      API Gateway              │
    │   (Queue Management)          │
    └───────────┬───────────────────┘
                │
    ┌───────────▼───────────────────┐
    │   Inference Server            │
    │   ┌────────────────────┐      │
    │   │ 2x L40S (NVLink)   │      │
    │   │ Tensor Parallel    │      │
    │   └────────────────────┘      │
    │                               │
    │   Job Queue (Redis/RabbitMQ)  │
    │   Result Cache (PostgreSQL)   │
    └───────────────────────────────┘
            │
    ┌───────▼───────────┐
    │  DICOM/HL7 Export │
    │   Integration     │
    └───────────────────┘

Optimization for Research: - Larger batch sizes (32-128) - Result caching for repeated queries - Async processing with job queues

27.3 Scenario 3: Fine-tuning/Training

Workload Characteristics: - Single long-running job - Memory intensive - Requires gradient computation - Checkpoint saving

Scaling Recommendation: Vertical first, then horizontal for larger models**

Training Scaling Decision Tree:

Model Size < 7B?
    │
    ├─Yes→ 1x L40S (48GB)
    │
    └─No→ Model Size < 30B?
           │
           ├─Yes→ 2x L40S (96GB, NVLink)
           │
           └─No→ Multi-node DDP/FSDP

Architecture:

    ┌─────────────────────────────────────┐
    │      Training Orchestrator          │
    │     (Kubernetes + Kubeflow)         │
    └────────────┬────────────────────────┘
                 │
    ┌────────────▼────────────────────────┐
    │     Primary Training Node           │
    │  ┌──────────────────────────┐       │
    │  │ 2x L40S with NVLink      │       │
    │  │ 256GB RAM, 2TB NVMe      │       │
    │  └──────────────────────────┘       │
    │                                     │
    │  DeepSpeed/PyTorch Lightning        │
    └─────────────┬───────────────────────┘
                  │
    ┌─────────────▼───────────────────────┐
    │        Storage Cluster              │
    │   ┌─────────────────────────┐       │
    │   │ Training Data (RAID 10) │       │
    │   │ Checkpoints (3 replicas)│       │
    │   └─────────────────────────┘       │
    └─────────────────────────────────────┘

27.4 Unified Infrastructure Design

For a hospital setting, I recommend a hybrid approach:

Complete Hospital AI Infrastructure:

                    ┌─────────────────────────┐
                    │   Hospital Network      │
                    └───────────┬─────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │   AI Orchestrator       │
                    │  (K8s Control Plane)    │
                    └───────────┬─────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌────────────────┐   ┌────────────────┐   ┌────────────────┐
│ Inference Pool │   │  Training Node │   │   Batch Node   │
│  3x Servers    │   │   1x Server    │   │   1x Server    │
│  1x L40S each  │   │   2x L40S      │   │   2x L40S      │
│  (Horizontal)  │   │   (Vertical)   │   │   (Vertical)   │
└────────────────┘   └────────────────┘   └────────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │   Shared Resources      │
                    │  - Model Repository     │
                    │  - Dataset Storage      │
                    │  - Monitoring (Grafana) │
                    └─────────────────────────┘

27.5 Cost-Benefit Analysis

Scaling Comparison Matrix:
┌───────────────────────────────────────────────┐
│ Metric          │ Vertical  │ Horizontal      │
├─────────────────┼───────────┼─────────────────┤
│ Initial Cost    │ High $$   │ Lower $         │
│ Complexity      │ Low       │ High            │
│ Redundancy      │ None      │ Built-in        │
│ Max Capability  │ Limited   │ Unlimited       │
│ Network Overhead│ None      │ Significant     │
│ Maintenance     │ Easier    │ Complex         │
└───────────────────────────────────────────────┘

27.6 Practical Implementation Steps

Phase 1: Start Vertical (Months 1-3)

Single Server Configuration:
- 2x NVIDIA L40S (48GB each)
- 256GB System RAM
- 2x 2TB NVMe in RAID 1
- Estimated: $35,000-40,000

Phase 2: Add Horizontal for Inference (Months 4-6)

Add 2-3 inference nodes:
- 1x L40S per node
- 128GB RAM each
- Focus on chatbot scaling

Phase 3: Optimize Based on Metrics

Monitoring Metrics:
┌──────────────────────┐
│ Queue Length         │ > 10 → Add inference node
│ GPU Memory Usage     │ > 90% → Consider model optimization
│ Response Time P95    │ > 2s → Scale horizontally
│ Training Time        │ > 24h → Add training GPUs
└──────────────────────┘

27.7 Specific Recommendations for Your Hospital

  1. Start with 2x L40S vertical setup - handles all three scenarios initially
  2. Implement queue-based architecture from day one for easy scaling
  3. Use Kubernetes for orchestration (even single-node initially)
  4. Monitor actual usage patterns for 2-3 months before scaling
  5. Consider NVIDIA AI Enterprise license for hospital support

For Medical Imaging Specifically: - L40S’s 48GB VRAM excellent for high-res DICOM processing - Consider adding A100 (80GB) if working with 3D volumes - Implement DICOM-specific caching layer