2dbi

Design a document-ingestion pipeline for a vector database

viaGlassdoor

Problem Design a system to ingest and process millions of documents into a vector database for retrieval.

Functional requirements

  • Accept documents, split them into chunks, generate embeddings, and store vectors with metadata.
  • Support re-ingestion/updates and deletion of documents.

Non-functional requirements

  • High throughput over millions of documents; fault-tolerant and resumable.
  • Idempotent processing so retries don't create duplicates.

Key components

  • Ingestion API, durable message queue (e.g. Kafka), stateless worker pool, chunker, embedding service, vector DB, dead-letter queue, metadata store.

Deep dives / trade-offs

  • Backpressure between ingestion and workers; batching embedding calls to respect provider rate limits.
  • Idempotency and retrying failed batches (dedupe keys, checkpoints/offsets).
  • Chunking strategy (size/overlap) and its effect on retrieval quality.
  • Monitoring: lag, throughput, failure rate, and cost per document.
Add a follow-up question they asked
No follow-ups yet. Be the first to add one.
asked …
LeaderboardSalary
Language
Account