Evaluating hallucination detection performance #27
-
Hi, I am following the Black Box demo example notebook and would like to better understand your method for hallucination detection performance (Section 3). What is meant by 'update the grading method accordingly'? In my use case I have my own prompts/questions and want to evaluate hallucination risk. Is my understanding correct that you do not have a built-in processor to determine If I want to evaluate 5-10 sentence long LLM responses, do I need to build my own grading method to classify an answer as correct or not? PS: Great work, thank you so much! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi 👋 Thank you for your question! The idea behind the 'grading' step is that we need to know whether the LLM actually hallucinated (by comparing LLM response vs. ground-truth/ideal response) if we want to evaluate the performance of the UQ-based scorers. This is something that would typically be done offline to see how well the different scorers detect hallucinations, which can inform the choice among various scorers. You are correct that the complexity of this grading task will vary based on question/answer type. For math questions in the demos, we simply postprocess the LLM response (extracting the integer part) and check if it matches the correct (integer) answer. Similar approaches can be done for multiple choice questions. For open-ended tasks like summarization, these match-based comparisons do not work as a grading approach. Instead, you might consider using a pre-trained SLM that is designed to compare a 'ground truth' text (ideal response) to the generated response, such as Something we had considered is having a utility function (or class) that offers a few grading methods, such as math, multiple-choice, and one using something like I hope that helps! Please do let me know if you have additional questions. |
Beta Was this translation helpful? Give feedback.
Hi 👋
Thank you for your question! The idea behind the 'grading' step is that we need to know whether the LLM actually hallucinated (by comparing LLM response vs. ground-truth/ideal response) if we want to evaluate the performance of the UQ-based scorers. This is something that would typically be done offline to see how well the different scorers detect hallucinations, which can inform the choice among various scorers.
You are correct that the complexity of this grading task will vary based on question/answer type. For math questions in the demos, we simply postprocess the LLM response (extracting the integer part) and check if it matches the correct (integer) answer. Similar approaches ca…