Skip to content

Evaluating hallucination detection performance #27

Answered by dylanbouchard
Marnolean asked this question in Q&A
Discussion options

You must be logged in to vote

Hi 👋

Thank you for your question! The idea behind the 'grading' step is that we need to know whether the LLM actually hallucinated (by comparing LLM response vs. ground-truth/ideal response) if we want to evaluate the performance of the UQ-based scorers. This is something that would typically be done offline to see how well the different scorers detect hallucinations, which can inform the choice among various scorers.

You are correct that the complexity of this grading task will vary based on question/answer type. For math questions in the demos, we simply postprocess the LLM response (extracting the integer part) and check if it matches the correct (integer) answer. Similar approaches ca…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Marnolean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants