Calibrating Large Language Models with Sample Consistency
CoRR(2024)
摘要
Accurately gauging the confidence level of Large Language Models' (LLMs)
predictions is pivotal for their reliable application. However, LLMs are often
uncalibrated inherently and elude conventional calibration techniques due to
their proprietary nature and massive scale. In this work, we explore the
potential of deriving confidence from the distribution of multiple randomly
sampled model generations, via three measures of consistency. We perform an
extensive evaluation across various open and closed-source models on nine
reasoning datasets. Results show that consistency-based calibration methods
outperform existing post-hoc approaches. Meanwhile, we find that factors such
as intermediate explanations, model scaling, and larger sample sizes enhance
calibration, while instruction-tuning makes calibration more difficult.
Moreover, confidence scores obtained from consistency have the potential to
enhance model performance. Finally, we offer practical guidance on choosing
suitable consistency metrics for calibration, tailored to the characteristics
of various LMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要