ECE Women Community

Accelerating Distributed Training by Reducing Compute Time Variance

Date: April,10,2024 Start Time: 16:00 - 17:00

Location: ZOOM

Zoom: Zoom link

Add to:

Lecturer: Shahar Gottlieb

Research Areas:

Machine learning and intelligent systems

Background. Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers.

Results. We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

M.Sc. student under the supervision of Prof. Daniel Soudry.

Seminar: Graduate Seminar

Seminars

Accelerating Distributed Training by Reducing Compute Time Variance

Seminars

Accelerating Distributed Training by Reducing Compute Time Variance

Upcoming Seminars

The free energy of an enriched continuous random energy model in the weak correlation regime

3D Technology Overview: Going into 3rd Dimension

Extremum Encoding for Distributed Time-Delay Estimation