Optimizer Momentum is Not a Ball (Part I)
If you’re like me and you learned about adaptive optimizers and momentum as practical tricks for training deep networks before you ever looked into optimization theory, you probably learned to accept momentum based on the ball analogy.
As the story goes, gradient descent is supposed to act like going down into the bottom of a valley. You look at the slope at each point, and use that slope to know which direction to go in order to reach the bottom. When you use momentum, the gradient builds up gradually as the optimizer goes “downhill,” so if a local minimum suddenly appears, the ball will skip right over it.
It’s a pretty picture, and comfortingly simple, but that truthiness is hiding some much more interesting behaviors, and local minima have nothing to do with anything.
This post has been drafted for a while, ever since I got annoyed about the
What Momentum Is
Let’s start by making sure we’re on the same page about what momentum actually does on the micro level. Don’t dig too much into the math, because I’m not going to motivate this yet—that’s what the rest of the article is for—and this section is totally skippable if you already roughly know what the equation looks like.
Normally, when we update a parameter
Gradient descent with momentum works the same way, except that we add a new state variable
The name “momentum” comes from how much this equation resembles the movement of an object with inertia.
I couldn’t even resist describing it as “velocity” earlier, it’s that similar.
It’s not exactly Newton’s First Law like you’re used to from physics class, though—besides integrating the force, the velocity is also decaying since
What Momentum Is Doing
The moment when the analogy started to fall apart for me was when I started actually imagining loss landscapes in detail. First of all, trying to visualize the loss made me finally confront the fact that there’s not just one. Literally every batch has its own loss function. That’s what makes it SGD (stochastic gradient descent) instead of just gradient descent.
Each batch is a sample from the training set, so the loss on one batch is just an approximation of the true loss.[2] Instead of continually optimizing a single loss function the way we’re taught to imagine it, we alternate between optimizing many different approximations of the loss, and over time that converges as if we had optimized the loss function on the entire training set.
If you have a background in signal processing or data science, you’ll probably recognize the momentum equation above as an exponential moving average. It replaces the raw values of the gradient (which are noisy in SGD because of the changing loss landscape) with a smoothed version that might be a better approximation of the true underlying loss function.
Until recently I thought that was the entire point of momentum, but then I discovered that it was invented for convex optimization problems, where the gradients are never noisy.
What Momentum Was Originally For
The Soviet mathematician Boris Polyak proposed momentum in the 60s[3] to speed up convergence in convex optimization.
Computers were so new that optimizing a nontrivial convex function was already an achievement, and in a lot of cases the
He described the method intuitively in terms of a “heavy ball” which could speed up convergence by accelerating in regions where the gradient is consistent, and proved that for certain problems, it could converge twice as fast, with
That “heavy ball” metaphor makes intuitive sense in flat gradients, but in practice it’s most useful when the function has a trough shape—when there is a steep gradient in one direction, but the way towards the real minimum is not very steep. In that case, regular gradient descent tends to oscillate back and forth along the direction where the gradient is strong, while making slow progress in the correct direction. With momentum, the conflicting gradients from the two steep walls average out to zero, so the optimizer can find the correct direction.
If you have a background in numerical methods, there’s a neat explanation for this! Momentum is functioning as an approximate second-order method, meaning it uses previous evaluations of the gradient as a proxy for the local curvature of the function being optimized. This is similar to multistep integration methods like Adams-Bashforth, where you carry forward information from previous steps to make better predictions about future behavior.
That’s All For Now
So far, we’ve seen that momentum was originally supposed to accelerate convergence in clean, convex problems. It didn’t have anything to do with local minima—and even in modern nonconvex problems, it still doesn’t.
The geometric intuition is real, but it’s completely different from a ball bouncing down a hill. In the next post, Part II of this series, we’ll look at how momentum actually accomplishes this, with some real examples and graphs.
That
term acts like viscous damping, as if the object is moving through a fluid. We could make that more precise, but that’s not what this article is about. ↩︎ This is true the same way the loss on the training set is only an approximation of the loss on the underlying distribution. We like to pretend that means we’ve optimized over the underlying distribution of the training data, but that’s only true in the limit as the training set grows. ↩︎
Polyak B. “Some methods of speeding up the convergence of iteration methods” USSR Computational Mathematics and Mathematical Physics, 1964. ↩︎
Though SGD did already exist as an approach to optimizing based on noisy observations, like adaptive controllers. ↩︎