I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.

In previous posts we discussed the use of **deep neural networks **(DNNs) in machine learning, and the pivotal role of the optimization of carefully selected **prediction functions** in training such models. These topics roughly correspond to the first two sections of the paper.

Section 3 provides an overview of optimization methods appropriate for DNNs. We begin where we left off in Section 2 by assuming a family of prediction functions parameterized by *w*. In other words, *w *represents the learning parameters. We also assume a loss function that depends on predicted and actual values, for example misclassification rate. Typically we’re minimizing the loss function *f *over a set of samples – the empirical risk. We call this *R_n(w)*, which in turn is the average of the loss for each sample: *(1/n) *Ʃ*_i∇f_i()*. Given all of this, there are two fundamental paths for minimizing risk. In each case we iteratively improve the learning parameters step by step, where we write the parameters at step *k* as *w_k*.

**stochastic**:*w_k+1 ← w_k − α_k ∇f_ik (w_k)*where the index*ik*is chosen randomly.**batch**:*w_k+1 ← w_k − (α_k / n)*Ʃ*_i∇f_i(w_k)*

The difference between the two paths is how many samples we consider on each step. The tradeoff is between per iteration cost (where stochastic wins) versus per iteration improvement (where batch wins). __Stochastic algorithms a good choice for DNNs because they employ information more efficiently__. If training samples are the same, or similar, than the added value of per step improvement from batch is not going to be worth it, because some (or most) of the added value turns out to be redundant. “If one believes that working with only, say, half of the data in the training set is sufficient to make good predictions on unseen data, then one may argue against working with the entire training set in every optimization iteration. Repeating this argument, working with only a quarter of the training set may be useful at the start, or even with only an eighth of the data, and so on. In this manner, we arrive at motivation for the idea that working with small samples, at least initially, can be quite appealing.”

The authors summarize: “there are intuitive, practical, and theoretical arguments in favor of stochastic over batch approaches in optimization methods for large-scale machine learning”. I will omit a summary of the theoretical motivations, but they are found in Section 3.3.

Next the authors consider improving on SG. One path involves trying to realize the best of both worlds between batch and stochastic methods, namely preserving the low per-iteration cost of stochastic methods while improving the per-iteration improvement. Another path is to “attempt to overcome the adverse effects of high nonlinearity and ill-conditioning”. This involves trying to employ information beyond just the gradient. We’ll examine these alternatives in future posts in this series, when we get to later sections in the paper.