Design an API gateway routing across multiple LLMs

viaGlassdoor

Problem Design an API gateway that routes inference requests across multiple LLM backends based on load and availability.

Functional requirements

Accept inference requests and route each to a healthy model backend.
Support fallback/failover to an alternate model when the primary is slow or down.
Per-client rate limiting and request/response logging.

Non-functional requirements

Key components

Gateway/router service, backend registry + health checker, load balancer, rate limiter, request queue, metrics/tracing pipeline.

Deep dives / trade-offs

Routing strategy: round-robin vs least-loaded vs latency/cost-aware; sticky routing for streaming.
Timeout and retry handling for slow inference; circuit breakers to shed load from unhealthy backends.
Streaming responses (SSE/WebSocket) and how that interacts with retries.
Observability: per-model latency, error rate, token throughput, and queue depth.

Add a follow-up question they asked

No follow-ups yet. Be the first to add one.

asked …