I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.
Last time we discussed stochastic and batch methods for optimization in machine learning. In both cases we’re trying to optimize a loss function that will give us good learning parameters for a deep learning model. We do this iteratively. A pure stochastic gradient (SG) approach picks one sample per iteration to define a search direction, whereas a batch method will pick multiple samples.
In this post I want to cover most of Section 4, which concerns the convergence and complexity of SG methods. For completeness I have reproduced the authors’ summary of a general purpose SG algorithm below:
Choose an iterate w_1
for k = 1, 2, … do
generate a random number s_k
compute a stochastic vector g(w_k, s_k)
choose a stepsize a_k > 0.
w_k+1 <- w_k – a_k g(w_k, s_k)
In this algorithm, g is the search direction. In this series, we’ve already discussed three different choices for g:
- g is the gradient. This is conventional gradient descent. In this case the random number is ignored.
- g is the gradient for a single sample. This is conventional stochastic gradient.
- g is the gradient over a subset of the samples. This is “mini-batch” SG.
As the authors show, it’s not easy to make definitive statements on comparison between mini batch and SG. The gist of what they show is that SG has better convergence properties in theory, but batch methods can provide certain practical advantages. That is: “one can, however, realize benefits of mini-batching in practice since it offers important opportunities for software optimization and parallelization; e.g., using sizeable mini-batches is often the only way to fully leverage a GPU processor.” The rest of Section 4 spells this out in more detail through convergence results and complexity analysis.
As a practitioner, I honestly don’t place huge weight on the convergence theorems presented in Section 4. The key result for me is Theorem 4.9, which states that for a nonconvex objective (which we have in most deep learning scenarios) the SG method converges when the stepsize diminishes according to a somewhat loosely defined schedule.
More interesting (for me) is Section 4.4, which discusses the overall work complexity for applying SG to deep learning scenarios. In many real-world big data scenarios, we’re optimizing a loss function using previously trained model parameters under a particular computational time limit. This is different from many traditional optimization scenarios, where we let the code run until we achieve a particular solution accuracy. In our situation, the total expected error in the model is the sum of three components: the expected risk using optimal parameters, the estimated error in the expected and empirical risk, and the optimization accuracy. Minimizing this error involves tradeoffs, for example “if one decides to make the optimization more accurate … one might need to make up for the additional computing time by: (i) reducing the sample size n, potentially increasing the estimation error; or (ii) simplifying the function family H, potentially increasing the approximation error.” These are familiar techniques for machine learning practitioners; the benefit provided here is a more formal characterization of how these techniques impact the overall solution error.
Returning to the choice of conventional versus batch approaches, the discussion in Section 4.5 shows that “Even though a batch approach possesses a better dependency on epsilon, this advantage does not make up for its dependence on n. […] In conclusion, we have found that a stochastic optimization algorithm performs better in terms of expected error, and, hence, makes a better learning algorithm in the sense considered here.” Again, even this statement could be mitigated somewhat by the computational benefits associated with batching on a particular computational infrastructure, that is, the benefits of GPUs and parallelism.
Section 4.5 is a commentary on some of the remaining challenges and questions that must be confronted when using SG for large-scale machine learning. Since the issues raised in Section 4.5 pair nicely with the discussion in Section 5, I’ll save both for next time.