writing · 2025-02-18
Teaching a RAG pipeline to pick the right document
Generation quality is the part everyone demos. Retrieval quality is the part that decides whether the demo was a lie. Notes on auto prompt-tuning for document selection.
There is a comfortable illusion in RAG demos. The answer reads beautifully, so the system must be working. But a fluent answer built on the wrong document is worse than no answer. It is a confident wrong answer, and in an enterprise support setting that carries a real cost.
So the question I keep coming back to is not “how good is the generation?” It is “did we retrieve the right thing?” Everything downstream is decoration on top of that one decision.
Prompts are part of retrieval
It is easy to think of the prompt as a generation-time concern. But the prompt that asks the model to select among candidate documents is doing retrieval, and it is tunable. Treating prompt selection as an optimization target rather than a fixed string changes how you work.
Two families of approaches have served me well. The first is search over prompt variations: grid and random search across phrasings, orderings, and instructions. It is unglamorous, but it sets a baseline and surfaces which prompt structures actually move selection accuracy. The second is reinforcement-style tuning, with PPO and bandit methods that treat “did the model pick the right document?” as the reward. That is where the gains compound, because the objective is exactly the thing you care about.
The metric that matters
If you only measure answer quality, you will optimize for confident prose. Measure document-selection accuracy directly and the whole system starts pulling in the right direction. Force the model to explain why it chose a document and you get two things at once: explainability for the user, and a debugging signal for you.
Generation is the part you show people. Retrieval is the part that decides whether you should have.