Handling with long answers #42

orannahum-qualifire · 2025-05-27T12:04:01Z

orannahum-qualifire
May 27, 2025

Hi,
I tried to test your methods on known indices for classifying hallucinations and saw that all of your answers are false when the answer is long, and I noticed that in all of your examples and indices the answers are very short, one or two words.

My question is this:
I want to test yours on a well-known benchmark:
https://huggingface.co/datasets/flowaicom/RAGTruth_test
The datasets there are labeled as hallucination or not(score column).
Even just on the "qa" dataset there.

The answers there are long and varied and that's how you can understand if your model is able to detect hallucinations better.

In long answers, there can be an infinite number of correct answers and an infinite number of hallucinations for the same question.

I have seen that you always do:
result_df["response_correct"] = [ math_postprocessor(r) == a for r, a in zip(result_df["response"], gsm8k["answer"]) ]
Which is good for single-valued answers.

Can your system handle long answers? Even with minor changes?

Thanks

Answered by dylanbouchard

May 30, 2025

👋 @orannahum-qualifire,

Thanks for creating this discussion! Building on what @dskarbrevik said above, we are actively working on two scorers specifically designed for longform responses: #19 graph-based scorers proposed by Jiang et al., 2024 and #46 LUQ proposed by Zhang et al., 2024. The former decomposes responses into claims, while the latter averages across sentences.

Thanks for sharing the RAGTruth benchmark. I took a look, and from what I understand, each row in this benchmark dataset contains a) a prompt containing a question and context, b) a generated response, and c) an indicator of whether the generated response contains a hallucination. To evaluate the effectiveness of UQLM s…

View full answer

dskarbrevik · 2025-05-30T16:17:58Z

dskarbrevik
May 30, 2025

Hey @orannahum-qualifire ! Thanks for your interest in UQLM. Long form response is something we're actively interested in finding robust techniques to incorporate into UQLM.

We have issue #19 to consider incorporating Graph UQ as one potential path to better address this hallucination scenario. I don't believe the authors behind Graph UQ specifically considered the RAGTruth benchmark but I will keep it in mind while I'm considering an implementation.

Please let me know if you have any other thoughts on this topic. Perhaps other benchmarks or techniques that you find promising. Thanks again :)

0 replies

dylanbouchard · 2025-05-30T17:33:35Z

dylanbouchard
May 30, 2025
Maintainer

👋 @orannahum-qualifire,

Thanks for creating this discussion! Building on what @dskarbrevik said above, we are actively working on two scorers specifically designed for longform responses: #19 graph-based scorers proposed by Jiang et al., 2024 and #46 LUQ proposed by Zhang et al., 2024. The former decomposes responses into claims, while the latter averages across sentences.

Thanks for sharing the RAGTruth benchmark. I took a look, and from what I understand, each row in this benchmark dataset contains a) a prompt containing a question and context, b) a generated response, and c) an indicator of whether the generated response contains a hallucination. To evaluate the effectiveness of UQLM scorers on this benchmark, there are are a few options:

Directly use LLM-as-a-Judge to evaluate the responses contained in the benchmark dataset to see how well the judges classify hallucination/no hallucination
Generate new responses + confidence scores using generate_and_score method with WhiteBoxUQ (for white-box scorers) or BlackBoxUQ (for black-box scorers) and use those confidence score. The caveat is that, to see how well these scorers classify hallucination/no hallucination, you will need to now grade those responses. As you point out, our demo notebooks have an offline 'grading' step to see how well the hallucination/confidence scorers perform, but this becomes trickier with long, open-ended questions such as summarization. For this, I would refer you to discussion Evaluating hallucination detection performance #27 where a similar question was asked about to how to conduct the 'grading' step of the demos for summarization-type tasks.

Please let us know if this helps or if you have any further questions. As @dskarbrevik mentioned, we are always looking for feedback on new metrics, benchmarks, etc. that we should look to include in UQLM.

Cheers,
Dylan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling with long answers #42

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling with long answers #42

Uh oh!

Uh oh!

orannahum-qualifire May 27, 2025

Replies: 2 comments

Uh oh!

dskarbrevik May 30, 2025

Uh oh!

dylanbouchard May 30, 2025 Maintainer

orannahum-qualifire
May 27, 2025

dskarbrevik
May 30, 2025

dylanbouchard
May 30, 2025
Maintainer