Design a document-ingestion pipeline for a vector database
viaGlassdoor
Problem Design a system to ingest and process millions of documents into a vector database for retrieval.
Functional requirements
- Accept documents, split them into chunks, generate embeddings, and store vectors with metadata.
- Support re-ingestion/updates and deletion of documents.
Non-functional requirements
- High throughput over millions of documents; fault-tolerant and resumable.
- Idempotent processing so retries don't create duplicates.
Key components
- Ingestion API, durable message queue (e.g. Kafka), stateless worker pool, chunker, embedding service, vector DB, dead-letter queue, metadata store.
Deep dives / trade-offs
- Backpressure between ingestion and workers; batching embedding calls to respect provider rate limits.
- Idempotency and retrying failed batches (dedupe keys, checkpoints/offsets).
- Chunking strategy (size/overlap) and its effect on retrieval quality.
- Monitoring: lag, throughput, failure rate, and cost per document.
asked …