Seminar: Guest Lecture
LLM Post-Training and Reasoning via Efficient Value-Based RL.
Reinforcement learning (RL) has a newfound killer application inย post-training LLMs pre-trained to predict next token to adapt to tasks like instruction following, math-problem solving, and generating content or recommendations that maximize user outcomes. But are the same RL algorithms that animated robots and conquered Atari the right ones toย post-trainย LLMs? In this talk I will present new value-based algorithms forย post-training and for scaling test-time compute that leverage both the unique structure of autoregressive LLMs and recent advances on increasing efficiency by changing the Q-learning loss function. I will show how (and argue why) these new algorithms achieve state-of-the-art performance on frontier math reasoning tasks with smaller models and at a fraction of test-time FLOPs.
The talk covers these papers:
https://arxiv.org/abs/2505.17373ย Value-Guided Search for Efficient Chain-of-Thought Reasoning (NeurIPS ’25)
https://arxiv.org/abs/2502.20548ย Q#: Provably Optimal Distributional RL for LLMย Post-Training (NeurIPS ’25)
https://arxiv.org/abs/2409.12799ย The Central Role of the Loss Function in Reinforcement Learning (Statistical Science)
Bio:
Nathan is an associate professor of operations research and information engineering at Cornell Tech and Cornell Engineering. He also serves as Research Director for Machine Learning and Inference at Netflix.
Kallusโs research interests include causal inference, especially when combined with machine learning; the statistics of optimization under uncertainty; sequential and dynamic decision making; and algorithmic fairness. He is the author of the book โApplied Causal Inference Powered by ML and AI.โ

