סמינר: Machine Learning Seminar

קהילת נשות הנדסת חשמל ומחשבים

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Date: January,19,2026 Start Time: 14:00 - 15:00
Location: 506, Zisapel Building
Add to:
Lecturer: Hadas Orgad
Large language models often produce errors—factual inaccuracies, biases, and reasoning failures known as “hallucinations”. We show that LLMs’ internal representations encode rich information about truthfulness, but in surprising ways. First, truthfulness information concentrates in specific tokens, allowing a dramatic improvement in error detection compared to using other token locations. However, these detectors don’t generalize across datasets, revealing that truthfulness encoding is multifaceted rather than universal. Second, internal representations can predict the types of errors a model will make, enabling targeted mitigation strategies. Finally, we uncover a striking discrepancy: LLMs sometimes internally encode correct answers while consistently generating incorrect ones. We’ll also discuss follow-up work building on these findings and their implications for developing more reliable language models.
Hadas is a Research Fellow at Harvard’s Kempner Institute, where she studies the internal mechanics of large AI models to improve their robustness, safety, and reliability. She focuses on problems where scaling compute and data falls short—such as hallucinations, harmful outputs, and biases—with the broader goal of developing controllable AI systems. She completed her PhD in the Technion under the supervision of Yonatan Belinkov.

 

כל הסמינרים
דילוג לתוכן