AdaGrad has a Learning Rate

AdaGrad (PDF) is an interesting algorithm for adaptive element-wise learning rates. First consider a stochastic gradient descent (SGD) update

θ(t)=θ(t1)+γtgt, \theta^{(t)} = \theta^{(t-1)} + \gamma_t g_t,

where γt\gamma_t is a step size (learning rate) and gt=ft(θ(t1))g_t = \nabla f_t(\theta^{(t-1)}) is the gradient of a noisy objective function. The AdaGrad algorithm replaces the step sizes γt\gamma_t with a matrix inverse that scales the elements of the gradient:

θ(t)=θ(t1)+diag(Gt)12gt, \theta^{(t)} = \theta^{(t-1)} + \text{diag}(G_t)^{-\frac{1}{2}}g_t,

where Gt=i=1tgtgtTG_t = \sum_{i=1}^t g_t g_t^T. In this form, it is suggested that AdaGrad does not have a learning rate. However, with very basic algebra, the AdaGrad update is

θ(t)=θ(t1)+1tdiag(Gt)12gt, \theta^{(t)} = \theta^{(t-1)} + \frac{1}{\sqrt t}\text{diag}(G_t^*)^{-\frac{1}{2}}g_t,

where Gt=t1GtG_t^* = t^{-1} G_t. That is, if the matrix term producing the adaptive step sizes is considered as a mean instead of a sum, we see that AdaGrad has a learning rate γt=1/t\gamma_t = 1/\sqrt t.

OnlineStats

In OnlineStats.jl, different weighting schemes can be plugged into any statistic or model (see docs on Weights). OnlineStats implements AdaGrad in the mean form, which opens up a variety of learning rates to try rather than just the default γt=1t\gamma_t = \frac{1}{\sqrt{t}}. The plot below compares three different choices of learning rate. It's interesting to note that the blue line, which uses the above default, is the slowest at finding the optimum loss value.

To try out OnlineStats.ADAGRAD, see StatLearn

© Josh Day. Last modified: August 16, 2023. Powered by Franklin.jl.