Design an API gateway routing across multiple LLMs
viaGlassdoor
Problem Design an API gateway that routes inference requests across multiple LLM backends based on load and availability.
Functional requirements
- Accept inference requests and route each to a healthy model backend.
- Support fallback/failover to an alternate model when the primary is slow or down.
- Per-client rate limiting and request/response logging.
Non-functional requirements
- Low added latency, high availability, horizontal scalability.
- Graceful degradation under spiky load and slow model inference.
Key components
- Gateway/router service, backend registry + health checker, load balancer, rate limiter, request queue, metrics/tracing pipeline.
Deep dives / trade-offs
- Routing strategy: round-robin vs least-loaded vs latency/cost-aware; sticky routing for streaming.
- Timeout and retry handling for slow inference; circuit breakers to shed load from unhealthy backends.
- Streaming responses (SSE/WebSocket) and how that interacts with retries.
- Observability: per-model latency, error rate, token throughput, and queue depth.
asked …