Skip to main content
Back to docs
AI Chatbots

How We Evaluate and Test AI Chatbot Quality

The methodology we use to measure and improve chatbot accuracy before deployment — evaluation datasets, metrics, and the ongoing quality improvement loop.

Why evaluation matters

You cannot improve what you cannot measure. Before any chatbot goes to production, we build an evaluation suite — a set of test questions with expected answers — and run it against the system to establish a quality baseline.

Building the evaluation set

A good evaluation set contains 50–150 question-answer pairs that represent the range of questions real users will ask. We build this from:

  • Your existing FAQ content
  • Support ticket history (anonymised)
  • Questions identified during discovery as critical
  • Edge cases — questions near the boundary of the bot's knowledge

Metrics we track

Answer accuracy

Does the bot answer the question correctly? We score each answer as correct, partially correct, or incorrect against the expected answer.

Retrieval precision

Are the right document chunks being retrieved for each query? Good generation from bad context is impossible — retrieval quality is a prerequisite for answer quality.

Hallucination rate

Does the bot ever make claims not supported by the retrieved context? Any hallucination rate above 2% needs to be investigated and fixed before production.

Refusal rate

How often does the bot correctly recognise that it does not know the answer and escalate? A bot that confidently answers questions it cannot answer is worse than one that says "I do not know".

The improvement loop

After launch, we analyse conversations weekly for the first month. Common patterns we look for:

  • Questions the bot failed to answer correctly — usually indicates a content gap
  • Questions that were retrieved correctly but answered poorly — usually a prompt issue
  • Questions that were escalated unnecessarily — usually a confidence threshold issue

Content gaps are fixed by adding documentation. Prompt issues are fixed by adjusting the system prompt. Confidence issues are fixed by calibrating the threshold at which the bot escalates versus answers.

Still have questions?

Our team is happy to walk you through anything — just send us a message.