How We Evaluate and Test AI Chatbot Quality

Why evaluation matters

You cannot improve what you cannot measure. Before any chatbot goes to production, we build an evaluation suite — a set of test questions with expected answers — and run it against the system to establish a quality baseline.

Building the evaluation set

A good evaluation set contains 50–150 question-answer pairs that represent the range of questions real users will ask. We build this from:

Your existing FAQ content
Support ticket history (anonymised)
Questions identified during discovery as critical
Edge cases — questions near the boundary of the bot's knowledge

Metrics we track

Answer accuracy

Does the bot answer the question correctly? We score each answer as correct, partially correct, or incorrect against the expected answer.

Retrieval precision

Are the right document chunks being retrieved for each query? Good generation from bad context is impossible — retrieval quality is a prerequisite for answer quality.

Hallucination rate

Does the bot ever make claims not supported by the retrieved context? Any hallucination rate above 2% needs to be investigated and fixed before production.

Refusal rate

How often does the bot correctly recognise that it does not know the answer and escalate? A bot that confidently answers questions it cannot answer is worse than one that says "I do not know".

The improvement loop

After launch, we analyse conversations weekly for the first month. Common patterns we look for:

Questions the bot failed to answer correctly — usually indicates a content gap
Questions that were retrieved correctly but answered poorly — usually a prompt issue
Questions that were escalated unnecessarily — usually a confidence threshold issue

Content gaps are fixed by adding documentation. Prompt issues are fixed by adjusting the system prompt. Confidence issues are fixed by calibrating the threshold at which the bot escalates versus answers.