Building RAG Systems That Actually Work in Enterprise

Retrieval-augmented generation sounds simple in demos. In production — with messy documents, compliance requirements, and users who need citations — it's a different story. Here's what we've learned.

PT

ParagonID Team

ParagonID

Dec 10, 202511 min read

Retrieval-augmented generation sounds straightforward in demos. In production — with messy documents, compliance requirements, and users who need citations — it's a different story. Here's what we've learned.

Why RAG Breaks in Production

Documents are messy. Scanned PDFs with OCR errors. Excel files with merged cells. Word docs with inconsistent formatting. PowerPoints with text in images.
Context gets lost. When you chunk a 200-page filing into 500-token blocks, “as described above” becomes meaningless.
Retrieval is noisy. Semantic search returns topically related but not actually relevant chunks — leading to hallucinated answers.
Users need citations. In regulated industries, the answer isn't enough. Users need exact document, page, and paragraph references.
Data changes. Documents get updated or superseded. Your RAG system needs version control and recency handling.

Chunking Strategy Matters More Than You Think

Semantic Chunking Over Fixed-Size

Fixed-size chunking is easy but terrible for quality — cutting sentences in half and stripping context. We use semantic chunking that respects document structure, splitting on section headers and logical breaks.

Parent-Child Chunk Architecture

Our biggest quality improvement. Each chunk has a “parent” — a larger block of surrounding context. Small chunks ensure retrieval precision; parent chunks ensure the model has enough context.

chunk-architecture.txt

Parent chunk (section level, ~2000 tokens):
├── Child chunk 1 (paragraph, ~300 tokens) ← indexed for retrieval
├── Child chunk 2 (paragraph, ~300 tokens) ← indexed for retrieval
└── Child chunk 3 (paragraph, ~300 tokens) ← indexed for retrieval

Retrieval: Search against child chunks
Context: Pass parent chunk to LLM

Metadata-Enriched Chunks

Every chunk carries metadata: document title, section header, page number, type, date, version. Used for filtering and citation generation.

Retrieval Quality Is Everything

We use a multi-stage retrieval pipeline:

Query expansion. The user's question is reformulated into multiple search queries to increase recall.
Hybrid search. Combine semantic (vector) search with keyword (BM25) search. Semantic catches concepts; keyword catches exact terms.
Re-ranking. A cross-encoder model evaluates query-document relevance more accurately than embedding similarity.
Relevance filtering. Below-threshold chunks are excluded. Better to say “I don't have enough information” than generate from marginal context.

Citations and Explainability

Inline citations. The LLM cites sources with numbered references mapping to specific chunks with full metadata.
Source preview. Click any citation to see original text with the relevant passage highlighted.
Confidence indicators. Each answer includes a confidence level based on retrieval quality.
Audit logging. Every query, retrieval result, and answer is logged for regulatory compliance.

“The citation system convinced our compliance team to approve the tool. They can verify every statement against the source document in seconds.”
— Chief Compliance Officer

Our Production RAG Stack

production-rag-stack.txt

Document Processing:
  ├── Apache Tika (PDF/Word/Excel parsing)
  ├── Custom chunking pipeline (semantic + parent-child)
  └── Metadata extraction (titles, dates, versions)

Embedding & Storage:
  ├── Voyage AI embeddings (domain-optimized)
  ├── pgvector (PostgreSQL vector extension)
  └── Document metadata in PostgreSQL

Retrieval:
  ├── Hybrid search (pgvector + pg_trgm for BM25)
  ├── Cross-encoder re-ranking
  └── Relevance threshold filtering

Generation:
  ├── Claude API (structured citation prompts)
  ├── Response validation (hallucination detection)
  └── Citation mapping and source linking

The most important lesson: RAG isn't one system — it's a pipeline of interconnected systems, each independently testable and optimizable.

Need a RAG system that works in production? We've deployed retrieval-augmented generation for financial services, compliance, and construction clients. Let's discuss your use case →

AIEngineeringRAGEnterprise

Building RAG Systems That Actually Work in Enterprise

Why RAG Breaks in Production

Chunking Strategy Matters More Than You Think

Semantic Chunking Over Fixed-Size

Parent-Child Chunk Architecture

Metadata-Enriched Chunks

Retrieval Quality Is Everything

Citations and Explainability

Our Production RAG Stack

Related Articles

How We Ship SaaS Products in Weeks, Not Months

How AI Turned a 3-Hour Quoting Process Into 5 Minutes

Why Private LLMs Are Non-Negotiable for Financial Services