******************* Non-Convergence and Convergence Rates of SGD methods in Deep Learning.
Non-Convergence and Convergence Rates of SGD methods in Deep Learning.
Người báo cáo: Đỗ Minh Thắng (Viện Toán học)

Thời gian: 14h Thứ 5, ngày 02/10/2025

Địa điểm: Phòng 507 nhà A6, Viện Toán học

Tóm tắt: Stochastic gradient descent (SGD) methods and adaptive optimization methods such as Adam are nowadays key tools in the training of deep neural networks (DNNs). Despite the great success of these methods, it remains a fundamental open problem of research to explain their success and limitations in rigorous theoretical terms. In this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSProp, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that it does not hold that the considered optimizer converges with high probability to global minimizers of the objective function or that the true risk converges in probability to the optimal true risk value. Even stronger, we prove that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width and depth of the ANN increase. Nonetheless, the risk may converge to a strictly suboptimal value. In a further main result we establish convergence rates for Adam for strongly convex stochastic optimization problems and illustrate the Adam symmetry theorem, which shows convergence if and only if the random variables are symmetrically distributed.

Back