סמינר: Machine Learning Seminar
Pragmatic LLM interpretability via learning on internal representations
Date:
July,01,2026
Start Time:
11:30 - 12:30
Location:
506, Zisapel Building
Zoom:
Zoom link
Add to:
Lecturer:
Guy Bar Shalom
Research Areas:
| As Large Language Models (LLMs) are used more in research, creativity, and decision-making, it becomes crucial to ensure their outputs are safe, reliable, and accurate. Yet they often generate errors – such as factual inaccuracies, biases, and reasoning failures – collectively referred to as hallucinations. Previous work has shown that the internal computational traces of LLMs, namely, their hidden states, carry useful signals about the truthfulness of model outputs and can be exploited for detecting such errors. However, these existing approaches largely ignore the structure and complexity of these traces and therefore fail to impose appropriate inductive biases when processing them for detecting hallucinations. This talk builds on three papers in which we characterize three types of data structures that appear in LLMs’ computational traces and develop corresponding deep learning architectures capable of learning from each. Those include (1) the complete output of the LLM’s final layer (2) a data structure that holds all intermediate activations across all layers and tokens and (3) intermediate attention matrices. Our techniques outperform prior approaches, and our architectural design enables new error-detection capabilities, including generalization to new generation tasks and to new LLMs. |
| Ph.D. student Under the supervision of Prof. Haggai Maron.
|

