Handling with long answers #42
-
Hi, My question is this: The answers there are long and varied and that's how you can understand if your model is able to detect hallucinations better. In long answers, there can be an infinite number of correct answers and an infinite number of hallucinations for the same question. I have seen that you always do: Can your system handle long answers? Even with minor changes? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hey @orannahum-qualifire ! Thanks for your interest in UQLM. Long form response is something we're actively interested in finding robust techniques to incorporate into UQLM. We have issue #19 to consider incorporating Graph UQ as one potential path to better address this hallucination scenario. I don't believe the authors behind Graph UQ specifically considered the RAGTruth benchmark but I will keep it in mind while I'm considering an implementation. Please let me know if you have any other thoughts on this topic. Perhaps other benchmarks or techniques that you find promising. Thanks again :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for creating this discussion! Building on what @dskarbrevik said above, we are actively working on two scorers specifically designed for longform responses: #19 graph-based scorers proposed by Jiang et al., 2024 and #46 LUQ proposed by Zhang et al., 2024. The former decomposes responses into claims, while the latter averages across sentences. Thanks for sharing the RAGTruth benchmark. I took a look, and from what I understand, each row in this benchmark dataset contains a) a prompt containing a question and context, b) a generated response, and c) an indicator of whether the generated response contains a hallucination. To evaluate the effectiveness of UQLM scorers on this benchmark, there are are a few options:
Please let us know if this helps or if you have any further questions. As @dskarbrevik mentioned, we are always looking for feedback on new metrics, benchmarks, etc. that we should look to include in UQLM. Cheers, |
Beta Was this translation helpful? Give feedback.
👋 @orannahum-qualifire,
Thanks for creating this discussion! Building on what @dskarbrevik said above, we are actively working on two scorers specifically designed for longform responses: #19 graph-based scorers proposed by Jiang et al., 2024 and #46 LUQ proposed by Zhang et al., 2024. The former decomposes responses into claims, while the latter averages across sentences.
Thanks for sharing the RAGTruth benchmark. I took a look, and from what I understand, each row in this benchmark dataset contains a) a prompt containing a question and context, b) a generated response, and c) an indicator of whether the generated response contains a hallucination. To evaluate the effectiveness of UQLM s…