2dbi

Design an API gateway routing across multiple LLMs

viaGlassdoor

Problem Design an API gateway that routes inference requests across multiple LLM backends based on load and availability.

Functional requirements

  • Accept inference requests and route each to a healthy model backend.
  • Support fallback/failover to an alternate model when the primary is slow or down.
  • Per-client rate limiting and request/response logging.

Non-functional requirements

  • Low added latency, high availability, horizontal scalability.
  • Graceful degradation under spiky load and slow model inference.

Key components

  • Gateway/router service, backend registry + health checker, load balancer, rate limiter, request queue, metrics/tracing pipeline.

Deep dives / trade-offs

  • Routing strategy: round-robin vs least-loaded vs latency/cost-aware; sticky routing for streaming.
  • Timeout and retry handling for slow inference; circuit breakers to shed load from unhealthy backends.
  • Streaming responses (SSE/WebSocket) and how that interacts with retries.
  • Observability: per-model latency, error rate, token throughput, and queue depth.
Add a follow-up question they asked
No follow-ups yet. Be the first to add one.
asked …
LeaderboardSalary
Language
Account