I have two posts remaining in my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here. These last two posts will discuss improvements on the base stochastic gradient method. Below I have reproduced Figure 3.3, which suggests two general approaches. I will cover noise reduction in this post, and second-order methods in the next.
The left-to-right direction on the diagram signifies noise reduction techniques. We say that the SG search direction is “noisy” because it includes information from only one (randomly generated) sample per iteration. We use a noisy direction, of course, because it’s too expensive to use the entire gradient. But we can consider using a small batch of samples per iteration (a “minibatch”), or using information from previous iterations. The idea here is to find a happy medium between the far left of the diagram, which represents one sample per iteration, and the right, which represents using the full gradient.
Section 5 describes several noise reduction techniques. Dynamic sample size methods vary the number of samples in a minibatch per iteration, for example by increasing the batch size geometrically with the iteration count. Gradient aggregation, as the name suggests, involves the use of gradient information from past iterations. The SVRG method involves starting with a full batch gradient, then for subsequent iterations updating the gradient using gradient information at a single sample. The SAGA method involves “taking the average of stochastic gradients evaluated at previous iterates”. Finally, iterate averaging methods use the iterates from multiple previous steps to update the current iterate.
The motivations behind these various noise reduction methods are more or less the same: make more progress on a single step without paying too much of a computational cost. The primary tradeoff, in addition to increased computational cost per iteration, of course, is the extra storage associated with keeping extra state to compute search direction. Section 5 of the paper discusses these tradeoffs in light of convergence criteria.
[Updated 8/24/2016] Going back to our diagram, the up and down dimension of the diagram represents so-called “second-order methods”. Gradient-based methods, including SG, are first-order methods because they use a first-order (linear) approximation to the objective function we want to optimize. Second-order methods attempt to look at the curvature of the objective function to obtain better search directions. Once again, there is a tradeoff: using the curvature is more work, but we hope that by computing better search directions we’ll need far fewer iterations to get a good solution. I had originally intended on covering Section 6 of the paper, which describes several such methods in detail, but I will leave the interested reader to dig through that section themselves!