Hey, so I mostly read a 93 page paper. The topic is a worthy one: optimization methods for large-scale machine learning. **Deep learning** powers best in class speech, image, and text intelligence on the web today, and deep learning is in turn powered by **optimization**. I will summarize “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal over the next few posts because it provides a useful operations research-centered evaluation of an important area in machine learning. In general, machine learning practitioners don’t know shit about operations research, and vice versa. This paper, along with work of Stephen Wright at Wisconsin (check out this talk), will certainly help to remedy this situation. I also predict that this paper will spur new advances in deep learning.

Here goes, and remember that I’m trying to summarize 93 pages!

The title of the paper is quite broad, but the focus is primarily on the use of the **stochastic gradient descent** (SGD) method (and variants) in deep learning applications. If you don’t have any previous experience with these topics, this series may not be for you, but I will try to summarize anyway. The term “deep learning” describes a range of machine learning algorithms that are used to classify or predict. Deep learning is primarily distinguished by:

- The use of much
**more input data**than is typical for machine learning, - Models that have
**many internal layers**of data manipulation and transformation, - A reliance on
**parallel and GPU**processing.

Training a deep learning algorithm involves finding model parameters that produce effective predictors or classifiers. Finding the values of variables that produce the best results for a particular objective (or “goal”) is the job of optimization. The stochastic gradient descent method is so-named because it repeatedly takes steps in the direction of steepest descent, which is defined by the gradient of the objective we want to optimize. If we think of the objective function as a hilly field, then the gradient always points in the steepest direction down, when we examine the immediate area around where we stand. The “stochastic” part of SGD applies because rather than looking at all of the samples over which the objective function is defined, we only look one (or a few) randomly determined sample. As compared to using the full gradient, this approach takes less time to take a single step, but the step is possibly less effective in improving the value of our objective function. In theory and practice we can establish that often the tradeoff is worth it. Characterizing these tradeoffs more concretely is one of the objectives of the paper. As as supplement to the paper and this post, check out this great post by Sebastian Ruder for an overview of gradient descent algorithms for machine learning.

In my next post in this series, I will cover Section 2 which describes the selection of a prediction function that is useful for modeling but practical for model training at scale.

**Updated 8/2/2016** to correctly summarize SGD. Thanks J-F!

Hi Nathan,

Not sure I agree with you summary of SGD definition. Rather than selecting randomly a subset of the dimensions, it selects randomly a sample to learn from. This may result in that, but it is not equivalent IMHO.

Of course you are correct! I have updated the post – thank you.