WeChat Mini Program
Old Version Features

Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent

NEURIPS 2022(2022)

Assistant Professor

Cited 4|Views32
Abstract
Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Synchronous Stochastic Gradient Descent (SSGD) 1 is the de facto DDL optimization method. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large learning rate may harm convergence in SSGD and training could easily diverge. Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve distributed training speed. In this paper, we find that DPSGD not only has a system-wise run-time benefit but also a significant convergence benefit over SSGD in the large batch setting. Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise that automatically adjusts the effective learning rate to improve convergence. In addition, we theoretically show that this noise smoothes the loss landscape, hence allowing a larger learning rate. We conduct extensive studies over 18 state-of-the-art DL models/tasks and demonstrate that DPSGD often converges in cases where SSGD diverges for large learning rates in the large batch setting. Our findings are consistent across two different application domains: Computer Vision (CIFAR10 and ImageNet-1K) and Automatic Speech Recognition (SWB300 and SWB2000), and two different types of neural network models: Convolutional Neural Networks and Long Short-Term Memory Recurrent Neural Networks.
More
Translated text
Key words
Decentralized Training,Loss Landscape Dependent Noise,Self-Adjusting Learning Rate,Learning Dynamics
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一种基于损失景观依赖的自我调整学习率方法,在 decentralized stochastic gradient descent 中有效提高了大规模分布式深度学习的收敛性。

方法】:通过分析Decentralized Parallel SGD(DPSGD)的学习动态,发现其引入的额外景观依赖噪声能够自动调整有效学习率,进而改善收敛性。

实验】:在18种最新深度学习模型/任务上进行了广泛研究,使用CIFAR10、ImageNet-1K、SWB300和SWB2000数据集,结果表明DPSGD在大批量设置下,学习率较大时仍能收敛,而SSGD则会发散。