Seminar: ceClub: The Technion Computer Engineering Club
Efficient LLM Systems: From Algorithm Design to Deployment
Large Language Models (LLMs) have transformed what machines can do and how systems are designed to serve them. These models are both computationally and memory demanding, revealing the limits of traditional optimization methods that once sufficed for conventional systems. A central challenge in building LLM systems is improving system metrics while ensuring response quality.
This talk presents approaches for reducing latency in LLM systems to support interactive applications, from scheduling algorithm design to deployment. It introduces scheduling frameworks that use lightweight predictions of request behavior to make informed decisions about prioritization and memory management across two core settings: standalone LLM inference and API-augmented LLMs that interact with external tools. Across both settings, prediction-guided scheduling delivers substantial latency reductions while remaining practical for deployment.
Bio:
Rana Shahout is a Postdoctoral Fellow at Harvard University, working with Michael Mitzenmacher and Minlan Yu. She received her Ph.D. in Computer Science from the Technion and previously worked as a Senior Software Engineer at Mellanox (now NVIDIA). Her research combines machine learning, systems, and algorithmic theory to design efficient and scalable AI systems. Rana is a recipient of the Eric and Wendy Schmidt Postdoctoral Award, the Zuckerman Postdoctoral Fellowship, the Weizmann Institute Womenโs Postdoctoral Career Development Award, the VATAT Postdoctoral Fellowship, and first place in the ACC Feder Family Award for Best Student Work in Communications.

