where . That is, if the matrix term producing the adaptive step sizes is considered as a mean instead of a sum, we see that AdaGrad has a learning rate .
In OnlineStats.jl, different weighting schemes can be plugged into any statistic or model (see docs on Weights). OnlineStats implements AdaGrad in the mean form, which opens up a variety of learning rates to try rather than just the default . The plot below compares three different choices of learning rate. It's interesting to note that the blue line, which uses the above default, is the slowest at finding the optimum loss value.
To try out OnlineStats.ADAGRAD
, see StatLearn