Retrieval-augmented generation sounds straightforward in demos. In production — with messy documents, compliance requirements, and users who need citations — it's a different story. Here's what we've learned.
Why RAG Breaks in Production
- Documents are messy. Scanned PDFs with OCR errors. Excel files with merged cells. Word docs with inconsistent formatting. PowerPoints with text in images.
- Context gets lost. When you chunk a 200-page filing into 500-token blocks, “as described above” becomes meaningless.
- Retrieval is noisy. Semantic search returns topically related but not actually relevant chunks — leading to hallucinated answers.
- Users need citations. In regulated industries, the answer isn't enough. Users need exact document, page, and paragraph references.
- Data changes. Documents get updated or superseded. Your RAG system needs version control and recency handling.
Chunking Strategy Matters More Than You Think
Semantic Chunking Over Fixed-Size
Fixed-size chunking is easy but terrible for quality — cutting sentences in half and stripping context. We use semantic chunking that respects document structure, splitting on section headers and logical breaks.
Parent-Child Chunk Architecture
Our biggest quality improvement. Each chunk has a “parent” — a larger block of surrounding context. Small chunks ensure retrieval precision; parent chunks ensure the model has enough context.
Parent chunk (section level, ~2000 tokens): ├── Child chunk 1 (paragraph, ~300 tokens) ← indexed for retrieval ├── Child chunk 2 (paragraph, ~300 tokens) ← indexed for retrieval └── Child chunk 3 (paragraph, ~300 tokens) ← indexed for retrieval Retrieval: Search against child chunks Context: Pass parent chunk to LLM
Metadata-Enriched Chunks
Every chunk carries metadata: document title, section header, page number, type, date, version. Used for filtering and citation generation.
Retrieval Quality Is Everything
We use a multi-stage retrieval pipeline:
- Query expansion. The user's question is reformulated into multiple search queries to increase recall.
- Hybrid search. Combine semantic (vector) search with keyword (BM25) search. Semantic catches concepts; keyword catches exact terms.
- Re-ranking. A cross-encoder model evaluates query-document relevance more accurately than embedding similarity.
- Relevance filtering. Below-threshold chunks are excluded. Better to say “I don't have enough information” than generate from marginal context.
Citations and Explainability
- Inline citations. The LLM cites sources with numbered references mapping to specific chunks with full metadata.
- Source preview. Click any citation to see original text with the relevant passage highlighted.
- Confidence indicators. Each answer includes a confidence level based on retrieval quality.
- Audit logging. Every query, retrieval result, and answer is logged for regulatory compliance.
“The citation system convinced our compliance team to approve the tool. They can verify every statement against the source document in seconds.”
— Chief Compliance Officer
Our Production RAG Stack
Document Processing: ├── Apache Tika (PDF/Word/Excel parsing) ├── Custom chunking pipeline (semantic + parent-child) └── Metadata extraction (titles, dates, versions) Embedding & Storage: ├── Voyage AI embeddings (domain-optimized) ├── pgvector (PostgreSQL vector extension) └── Document metadata in PostgreSQL Retrieval: ├── Hybrid search (pgvector + pg_trgm for BM25) ├── Cross-encoder re-ranking └── Relevance threshold filtering Generation: ├── Claude API (structured citation prompts) ├── Response validation (hallucination detection) └── Citation mapping and source linking
The most important lesson: RAG isn't one system — it's a pipeline of interconnected systems, each independently testable and optimizable.
Need a RAG system that works in production? We've deployed retrieval-augmented generation for financial services, compliance, and construction clients. Let's discuss your use case →
