核心概念
A novel logarithmic step size for stochastic gradient descent (SGD) with warm restarts is proposed, which achieves an optimal convergence rate of O(1/√T) for smooth non-convex functions.
摘要
The paper introduces a new logarithmic step size for the stochastic gradient descent (SGD) algorithm with warm restarts. The key highlights are:
-
The new logarithmic step size exhibits slower convergence to zero compared to many existing step sizes, yet converges faster than the cosine step size. This leads to a higher probability of selecting points from the final iterations compared to the cosine step size.
-
For the new logarithmic step size, the authors establish a convergence rate of O(1/√T) for smooth non-convex functions, which matches the best-known convergence rate for such functions.
-
Extensive experiments are conducted on the FashionMNIST, CIFAR10, and CIFAR100 datasets, comparing the new logarithmic step size with 9 other popular step size methods. The results demonstrate the effectiveness of the new step size, particularly on the CIFAR100 dataset where it achieves a 0.9% improvement in test accuracy over the cosine step size when using a convolutional neural network model.
統計資料
The paper does not contain any explicit numerical data or statistics to support the key claims. The results are presented in the form of figures and tables comparing the performance of the proposed method with other step size techniques.
引述
"The new proposed step size offers a significant advantage over the cosine step size Li et al. [2021] in terms of its probability distribution, denoted as ηt/∑Tt=1 ηt in Theorem 3.1. This distribution plays a crucial role in determining the likelihood of selecting a specific output during the iterations."
"For the new step size, we establish the convergence results of the SGD algorithm. By considering that c ∝O(√T/ln T), which leads to the initial value of the step size is greater than the initial value of the step length mentioned in Li et al. [2021], we demonstrate a convergence rate of O(1/√T) for a smooth non-convex function."