Pragmatic LLM interpretability via learning on internal representations

Date: July,01,2026 Start Time: 11:30 - 12:30

Location: 506, Zisapel Building

Zoom: Zoom link

Add to:

Lecturer: Guy Bar Shalom

Research Areas:

למידת מכונה ומערכות נבונות

As Large Language Models (LLMs) are used more in research, creativity, and decision-making, it becomes crucial to ensure their outputs are safe, reliable, and accurate. Yet they often generate errors – such as factual inaccuracies, biases, and reasoning failures – collectively referred to as hallucinations. Previous work has shown that the internal computational traces of LLMs, namely, their hidden states, carry useful signals about the truthfulness of model outputs and can be exploited for detecting such errors. However, these existing approaches largely ignore the structure and complexity of these traces and therefore fail to impose appropriate inductive biases when processing them for detecting hallucinations. This talk builds on three papers in which we characterize three types of data structures that appear in LLMs’ computational traces and develop corresponding deep learning architectures capable of learning from each. Those include (1) the complete output of the LLM’s final layer (2) a data structure that holds all intermediate activations across all layers and tokens and (3) intermediate attention matrices. Our techniques outperform prior approaches, and our architectural design enables new error-detection capabilities, including generalization to new generation tasks and to new LLMs.

Ph.D. student Under the supervision of Prof. Haggai Maron.

סמינר: Machine Learning Seminar

סמינרים

Pragmatic LLM interpretability via learning on internal representations

סמינרים

Pragmatic LLM interpretability via learning on internal representations

סמינרים קרובים

HLS Trojan Detection using Machine Learning Technique

Advances in 3D Vision: Image-Based Scene and Mesh Understanding

Self-Supervision and Adaptation for Robust Speech Processing