May 20, 20265 min read

What nobody tells you about RAG in production

I spent two days tuning chunk sizes. None of it mattered as much as one prompt constraint I added at the end.

I spent two days tuning chunk sizes and overlap percentages for DeepScholar. Sentence boundaries, 256 tokens, 512 tokens. The WCSS curve eventually flattened. None of that mattered as much as one line I added at the end of the RAG prompt: 'If you cannot cite a retrieved passage, say so. Do not invent references.' That constraint cut hallucinated citations to near zero.

PDFs are a disaster. PyMuPDF is solid, but academic papers with two-column layouts, footnotes baked into the margin, and scanned pages encoded as image objects will break any extraction pipeline eventually. I kept a folder of known-bad PDFs to test against after every pipeline change. That folder grew faster than I expected.

The cost model surprised me. Embedding a 30-page paper during ingestion is cheap. Per-query retrieval costs add up fast if you're not caching. I added a simple hash-based query cache for repeated questions and the API bill dropped noticeably within a week.

The hardest part was not the retrieval. It was convincing myself the answers were actually grounded. I wrote a spot-check script that pulled 20 random QA pairs and verified citations manually. That loop taught me more about the system's failure modes than any automated metric ever did.