Design a document-ingestion pipeline for a vector database

viaGlassdoor

Problem Design a system to ingest and process millions of documents into a vector database for retrieval.

Functional requirements

Accept documents, split them into chunks, generate embeddings, and store vectors with metadata.
Support re-ingestion/updates and deletion of documents.

Non-functional requirements

Key components

Ingestion API, durable message queue (e.g. Kafka), stateless worker pool, chunker, embedding service, vector DB, dead-letter queue, metadata store.

Deep dives / trade-offs

Backpressure between ingestion and workers; batching embedding calls to respect provider rate limits.
Idempotency and retrying failed batches (dedupe keys, checkpoints/offsets).
Chunking strategy (size/overlap) and its effect on retrieval quality.
Monitoring: lag, throughput, failure rate, and cost per document.

Add a follow-up question they asked

No follow-ups yet. Be the first to add one.

asked …