Evaluating hallucination detection performance #27

Marnolean · 2025-05-15T09:42:01Z

Marnolean
May 15, 2025

Hi, I am following the Black Box demo example notebook and would like to better understand your method for hallucination detection performance (Section 3).

What is meant by 'update the grading method accordingly'? In my use case I have my own prompts/questions and want to evaluate hallucination risk. Is my understanding correct that you do not have a built-in processor to determine result_df["response_correct"] for longer text responses (which is what I have)? In your math example, it seems like it is simple to do because you are determining if a response is correct or not through checking if the correct number is somewhere in the LLM response?

If I want to evaluate 5-10 sentence long LLM responses, do I need to build my own grading method to classify an answer as correct or not?

PS: Great work, thank you so much!

Answered by dylanbouchard

May 15, 2025

Hi 👋

Thank you for your question! The idea behind the 'grading' step is that we need to know whether the LLM actually hallucinated (by comparing LLM response vs. ground-truth/ideal response) if we want to evaluate the performance of the UQ-based scorers. This is something that would typically be done offline to see how well the different scorers detect hallucinations, which can inform the choice among various scorers.

You are correct that the complexity of this grading task will vary based on question/answer type. For math questions in the demos, we simply postprocess the LLM response (extracting the integer part) and check if it matches the correct (integer) answer. Similar approaches ca…

View full answer

dylanbouchard · 2025-05-15T13:38:36Z

dylanbouchard
May 15, 2025
Maintainer

Hi 👋

Thank you for your question! The idea behind the 'grading' step is that we need to know whether the LLM actually hallucinated (by comparing LLM response vs. ground-truth/ideal response) if we want to evaluate the performance of the UQ-based scorers. This is something that would typically be done offline to see how well the different scorers detect hallucinations, which can inform the choice among various scorers.

You are correct that the complexity of this grading task will vary based on question/answer type. For math questions in the demos, we simply postprocess the LLM response (extracting the integer part) and check if it matches the correct (integer) answer. Similar approaches can be done for multiple choice questions.

For open-ended tasks like summarization, these match-based comparisons do not work as a grading approach. Instead, you might consider using a pre-trained SLM that is designed to compare a 'ground truth' text (ideal response) to the generated response, such as vectara/hallucination_evaluation_model. This classifies a pair of texts (LLM response vs. ideal response) as hallucinated or consistent. Alternatively, a human can grade the LLM responses against the ideal responses, but this would of course involve more effort.

Something we had considered is having a utility function (or class) that offers a few grading methods, such as math, multiple-choice, and one using something like vectara/hallucination_evaluation_model. Perhaps we can add this to our roadmap.

I hope that helps! Please do let me know if you have additional questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluating hallucination detection performance #27

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Evaluating hallucination detection performance #27

Uh oh!

Marnolean May 15, 2025

Replies: 1 comment

Uh oh!

Uh oh!

dylanbouchard May 15, 2025 Maintainer

Marnolean
May 15, 2025

dylanbouchard
May 15, 2025
Maintainer