## A Lazier Way to Solve Knuth’s Skeleton Multiplication Puzzle

`        N N N  (1)    x N N N N  (2)    ---------        N N N 7  (3)    N N N N    (4)  N N N N      (5)  N N N        (6)  -----------    7 7 7 7 7 7  (N != 7)`

The answer found by Knuth is the following:

```        5 3 9  (1)
x 1 4 4 3  (2)
---------
1 6 1 7  (3)
2 1 5 6    (4)
2 1 5 6      (5)
5 3 9        (6)
-----------
7 7 7 7 7 7 ```

I was able to find a second solution where there was a leading zero in row (4). To find it, I considered all the prime factors of 777777 and reasoned about which pairs could fit the pattern of the puzzle.

In this post I want to show you how I originally found this solution.

I wrote a bit of code in Python to express part of the problem as a constraint programming (CSP) model. Instead of trying to work out the solution myself, I write down the conditions that must be true, and let the constraint programming solver do the work.

The basic approach for solving a puzzle using a solver is:

1. Write down the variables (unknowns) that I want the solver to find.
2. Write down all of the constraints (logical conditions) that need to be true.
3. Turn the above into the right API calls for the solver I am using.

Here’s how it works (and here’s a link to my code). I used the python_constraint package as my CSP solver. To specify the puzzle we need a Problem() object:

`problem = Problem()`

In a CSP I first define variables that I want the solver to find. The variables I care most about are n_1 and n_2. I know that n_1 < 1000 and n_2 < 10000. addVariable‘s first argument is the name of the variable and the second is the domain.

```problem.addVariable("n_1", list(range(100, 999+1)))
problem.addVariable("n_2", list(range(1000, 9999+1))) ```

I also am interested in the digits of each number. I will call the digits for the first (d_11, d_12, d_13), and the second (d_21, d_22, d_23, d_24). Remember that none of these digits can be a seven.

```not_seven = [0, 1, 2, 3, 4, 5, 6, 8, 9] # no sevens allowed!

# n_1 = d_13 d_12 d_11

# n_2 = d_24 d_23 d_22 d_21

Now I start writing down conditions that I know to be true. The first is that the product is a bunch of sevens:

`problem.addConstraint(lambda n_1, n_2: n_1 * n_2 == 777777, ['n_1', 'n_2'])`

Even if you know Python this may be confusing because you may not be familiar with the python_constraint package. The point is that I am providing an anonymous (lambda) function that evaluates the product of n_1 and n_2 as the first argument, and the names of the variables I defined earlier as the second.

Next, if I put the digits d_1x together, I get n_1:

```problem.addConstraint(lambda d_11, d_12, d_13, n_1:
100 * d_13 + 10 * d_12 + d_11 == n_0,
['d_11', 'd_12', 'd_13', 'n_1'])```

Same for n_2:

```problem.addConstraint(lambda d_21, d_22, d_23, d_24, n_2:
1000 * d_24 + 100 * d_23 + 10 * d_22 + d_21 == n_2,
['d_21', 'd_22', 'd_23', 'd_24', 'n_2']) ```

Also, as considered earlier, when I multiply the last two digits of n_1 and n_2, the result must end in 7:

`problem.addConstraint(lambda d_11, d_21: (d_11 * d_21) % 10 == 7, ['d_11', 'd_21'])`

Now, if I wanted to be complete, I would write down a bunch more conditions for all of the Xs that appear in rows 3-6, using the rules of long multiplication. But why would I want to do that? The point is to be lazy.

Let’s just solve the problem, ask for all the solutions, and print them out.

```s = problem.getSolutions()
# s is a list of solutions. Each solution is a dict of variable values.
for t in s:
print(f"{t['n_0']} x {t['n_1']} = {t['n_0'] * t['n_1']}") ```

Wow! The same four that I discovered above. This was “easier” for me in the sense that I already knew how to use the python_constraint package, so I could just do the coding above without having to “think”. Doing it “the old fashioned way” was in the end more fun, but requires more deliberation and focus for me.

## A Skeleton Multiplication Puzzle by Knuth

I want to share with you two ways to solve an interesting puzzle from Don Knuth’s “The Art of Computer Programming“. Here is the first!

The puzzle that caught my attention, quite by chance, is the “skeleton multiplication puzzle” given in Exercise 50 of Pre-Fascicle 9b (called “A Potpourri of Puzzles”). We want to find two numbers whose product is 777777, but we are given very little information. Here is the puzzle:

```        N N N  (1)
x N N N N  (2)
---------
N N N 7  (3)
N N N N    (4)
N N N N      (5)
N N N        (6)
-----------
7 7 7 7 7 7  (N != 7)```

All of the “Ns” are unknown, and none of them can be a 7. Notice that I’ve numbered the rows (1) – (6) because I am going to obsess about them below.

This doesn’t seem like a lot to go on, does it? I’ll give you some space to think about it, then I will show you Knuth’s solution.

….

OK, here’s what Knuth says:

“The only 3-digit divisors of 777777 = 3 x 7 x 7 x 11 x 13 x 37 that contain no 7s and don’t end in 1 are (143, 259, 429, 539). Their cofactors are (5439, 3003, 1813, 1443). And only 539 x 1443 gives just seven 7s.”

So that settles that, right? Well, not quite. Let’s dig into the details.

Let’s call the number in row (1) n_1, and the number in row (2) n_2. We want n_1 * n_2 = 777777.

Remember (or perhaps learn for the very first time…) that for every positive integer, there is a unique set of primes whose product is that number. This is called the prime factorization of a number.  We can use the prime factorization of 777777 to narrow down the possibilities for n_1 and n_2.

From Wolfram Alpha, we can get the prime factors: (3, 7, 7, 11, 13, 37).

Since n_1 * n_2 = 777777,  the prime factors of n_1 and n_2 also come from this set. Let’s focus on n_1. If I go through all the subsets of (3, 7, 7, 11, 13, 37) whose product has three digits, here’s what I find. Hopefully I didn’t miss any!

7 * 37 = 259
11 * 13 = 143
11 * 37 = 407
13 * 37 = 481
3 * 7 * 7 = 147
3 * 7 * 11 = 231
3 * 7 * 13 = 273
3 * 11 * 13 = 429
11 * 13 = 143
3 * 7 * 37 = 777
7 * 7 * 11 = 539
7 * 7 * 13 = 637

But remember that N != 7, so that means:

• n_1 can’t have a 7 as a digit
• The last number of n_1 cannot be 1, otherwise the last digit of n_2 would be 7, because the last digit of the product is 7: look at row (3).

So now we’re down to these possibilities:

259 (cofactor is 777777/259=3003)
143 (cofactor 5439)
429 (cofactor 1813)
539 (cofactor 1443)

But a full solution has to avoid 7s in rows (3)-(6). So we need to work out the long multiplication to see which of these possibilities are actually solutions. Go ahead and do that on paper now, or you can cheat here. When you do, you will see that Knuth’s solution using 539 and 1443 does solve the puzzle. Of course it does; he’s Don Frigging Knuth.

```        4 2 9  (1)
x 1 8 1 3  (2)
---------
1 2 8 7  (3)
0 4 2 9    (4)
3 4 3 2      (5)
4 2 9        (6)
-----------
7 7 7 7 7 7  ```

Hey, this works too! This new solution is a bit strange because it has a leading zero in row (4), but nobody said I can’t do that. When I wrote Prof. Knuth, he agreed.

There is one more plot twist. It turns out that what I have written here is *not* the way I discovered this alternate solution; I am far too lazy for that. Instead, I used Python to do the thinking for me. Read more next time!

## Optimizing 19th Century Typewriters

##### The long title for this post is: “Optimizing 19th Century Typewriters using 20th Century Code in the 21st Century”.

Patrick Honner recently shared Hardmath123’s wonderful article “Tuning a Typewriter“. In it, Hardmath123 explores finding the best way to order the letters A-Z on an old and peculiar typewriter. Rather than having a key for each letter as in a modern keyboard, the letters are laid out on a horizontal strip. You shift the strip left or right to find the letter you want, then press a key to enter it: What’s the best way to arrange the letters on the strip? You probably want to do it in such a way that you have to shift left and right as little as possible. If consecutive letters in the words you’re typing are close together on the strip, you will minimize shifting and type faster.

The author’s approach is to:

• Come up with an initial ordering at random,
• Compute the cost of the arrangement by counting how many shifts it takes to type out three well-known books,
• Try to find two letters that when you swap them results in a lower cost,
• Swap them and repeat until you can no longer find an improving swap.

This is a strong approach that leads to the same locally optimal arrangements, even when you start from very different initial orderings. It turns out that this is an instance of a more general optimization problem with an interesting history: quadratic assignment problems. I will explain what those are in a moment.

Each time I want to type a letter, I have to know how far to shift the letter strip. That depends on two factors:

1. The letter that I want to type in next, e.g. if I am trying to type THE and I am on “T”, “H” comes next.
2. The location of the next letter, relative to the current one T. For example, if H is immediately to the left of T, then the location is one shift away.

If I type in a bunch of letters, the total number of shifts can be computed by multiplying two matrices:

• A frequency matrix F. The entry in row R and column C is a count of how often letter R precedes letter C. If I encounter the word “THE” in my test set, then I will add 1 to F(“T”, “H”) and 1 to F(“H”, “E”).
• A distance matrix D. The entry in row X and column Y is the number of shifts between positions X and Y on the letter strip. For example, D(X, X+1) = 1 since position X is next to position X+1.

Since my problem is to assign letters to positions, if I permute the rows and columns of D and multiply this matrix with F, I will get the total number of shifts required. We can easily compute F and D for the typewriter problem:

• To obtain F, we can just count how often one letter follows another and record entries in the 26 x 26 matrix. Here is a heatmap for the matrix using the full Project Gutenberg files for the three test books: • The distance matrix D is simple: if position 0 is the extreme left of the strip and 25 the extreme right, d_ij = abs(i – j).

The total number of shifts is obtained by summing f_ij * d_p(i),p(j) for all i and j, where letter i is assigned to location p(i).

Our problem boils down to finding a permutation that minimizes this matrix multiplication. Since the cost depends on the product of two matrices, this is referred to as a Quadratic Assignment Problem (QAP). In fact, problems very similar to this one are part of the standard test suite of problems for QAP researchers, called “QAPLIB“. The so-called “bur” problems have similar flow matrices but different distance matrices.

We can use any QAP solution approach we like to try to solve the typewriter problem. Which one should we use? There are two types of approaches:

• Those that lead to provably global optimal solutions,
• Heuristic techniques that often provide good results, but no guarantees on “best”.

QAP is NP-hard, so finding provably optimal solutions is challenging. One approach for finding optimal solutions, called “branch and bound”, boils down to dividing and conquering by making partial assignments, solving less challenging versions of these problems, and pruning away assignments that cannot possibly lead to better solutions. I have written about this topic before. If you like allegories, try this post. If you prefer more details, try my PhD thesis.

The typewriter problem is size 26, which counts as “big” in the world of QAP. Around 20 years ago I wrote a very capable QAP solver, so I recompiled it and ran it on this problem – but didn’t let it finish. I am pretty sure it would take at least a day of CPU time to solve, and perhaps more. It would be interesting to see if someone could find a provably optimal solution!

In the meantime, this still leave us with heuristic approaches. Here are a few possibilities:

• Local optimization (Hardmath123’s approach finds a locally optimal “2-swap”)
• Simulated annealing
• Evolutionary algorithms

I ran a heuristic written by Éric Taillard called FANT (Fast ant system). I was able to re-run his 1998 code on my laptop and within seconds I was able to obtain the same permutation as Hardmath123. By the way, the zero-based permutation is [9, 21, 5, 6, 12, 19, 3, 10, 8, 24, 1, 16, 18, 7, 15, 22, 25, 14, 13, 11, 17, 2, 4, 23, 20, 0] (updated 12/7/2018 – a previous version of this post gave the wrong permutation. Thanks Paul Rubin for spotting the error!)

You can get the data for this problem, as well as a bit of Python code to experiment with, in this git repository.

It’s easy to think up variants to this problem. For example, what about mobile phones? Other languages? Adding punctuation? Gesture-based entry? With QAPs, anything is possible, even if optimality is not practical.

## Chaining Machine Learning and Optimization Models

Rahul Swamy recently wrote about mixed integer programming and machine learning. I encourage you to go and read his article.

Though Swamy’s article focuses on mixed integer programming (MIP), a specific category of optimization problems for which there is robust, efficient software, his article applies to optimization generally. Optimization is goal seeking; searching for the values of variables that lead to the best outcomes. Optimizers solve for the best variable values.

Swamy describes two relationships between optimization and machine learning:

1. Optimization as a means for doing machine learning,
2. Machine learning as a means for doing optimization.

I want to put forward a third, but we’ll get to that in a moment.

Relationship 1: you can always describe predicting in terms of solving. A typical flow for prediction in ML is

1. Get historical data for:
1. The thing you want to predict (the outcome).
2. Things that you believe may influence the predicted variable (“features” or “predictors”).
2. Train a model using the past data.
3. Use the trained model to predict future values of the outcome.

Training a model often means “find model parameters that minimize prediction error in the test set”. Training is solving. Here is a visual representation: Relationship 2. You can also use ML to optimize. Swami gives several examples of steps in optimization algorithms that can be described using the verbs “predict” or “classify”, so I won’t belabor the point. If the steps in our optimization algorithm are numbered 1, 2, 3, the relationship is like this: In these two relationships, one verb is used as a subroutine for the other: solving as part of predicting, or predicting as part of solving.

There is a third way in which optimization and ML relate: using the results of machine learning as input data for an optimization model. In other words, ML and optimization are independent operations but chained together sequentially, like this: My favorite example involves sales forecasting. Sales forecasting is a machine learning problem: predict sales given a set of features (weather, price, coupons, competition, etc). Typically business want to go further than this. They want to take actions that will increase future sales. This leads to the following chain of reasoning:

• If I can reliably predict future sales…
• and I can characterize the relationship between changes in feature values and changes in sales (‘elasticities’)…
• then I can find the set of feature values that will increase sales as much as possible.

The last step is an optimization problem.

But why are we breaking this apart? Why not just stick the machine learning (prediction) step inside the optimization? Why separate them? A couple of reasons:

• If the ML and optimization steps are separate, I can improve or change one without disturbing the other.
• I do not have to do the ML at the same time as I do the optimization.
• I can simplify or approximate the results of the ML model to produce a simpler optimization model, so it can run faster and/or at scale. Put a different way, I want the structure of the ML and optimization models to differ for practical reasons.

In the machine learning world it is common to refer to data pipelines. But ML pipelines can involve models feeding models, too! Chaining ML and optimization like this is often useful, so keep it in mind.

## Noise Reduction Methods for Large-Scale Machine Learning

I have two posts remaining in my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here. These last two posts will discuss improvements on the base stochastic gradient method. Below I have reproduced Figure 3.3, which suggests two general approaches. I will cover noise reduction in this post, and second-order methods in the next. The left-to-right direction on the diagram signifies noise reduction techniques. We say that the SG search direction is “noisy” because it includes information from only one (randomly generated) sample per iteration. We use a noisy direction, of course, because it’s too expensive to use the entire gradient. But we can consider using a small batch of samples per iteration (a “minibatch”), or using information from previous iterations. The idea here is to find a happy medium between the far left of the diagram, which represents one sample per iteration, and the right, which represents using the full gradient.

Section 5 describes several noise reduction techniques. Dynamic sample size methods vary the number of samples in a minibatch per iteration, for example by increasing the batch size geometrically with the iteration count. Gradient aggregation, as the name suggests, involves the use of gradient information from past iterations. The SVRG method involves starting with a full batch gradient, then for subsequent iterations updating the gradient using gradient information at a single sample. The SAGA method involves “taking the average of stochastic gradients evaluated at previous iterates”. Finally, iterate averaging methods use the iterates from multiple previous steps to update the current iterate.

The motivations behind these various noise reduction methods are more or less the same: make more progress on a single step without paying too much of a computational cost. The primary tradeoff, in addition to increased computational cost per iteration, of course, is the extra storage associated with keeping extra state to compute search direction. Section 5 of the paper discusses these tradeoffs in light of convergence criteria.

[Updated 8/24/2016] Going back to our diagram, the up and down dimension of the diagram represents so-called “second-order methods”. Gradient-based methods, including SG, are first-order methods because they use a first-order (linear) approximation to the objective function we want to optimize. Second-order methods attempt to look at the curvature of the objective function to obtain better search directions. Once again, there is a tradeoff: using the curvature is more work, but we hope that by computing better search directions we’ll need far fewer iterations to get a good solution. I had originally intended on covering Section 6 of the paper, which describes several such methods in detail, but I will leave the interested reader to dig through that section themselves!

## Analyses of Stochastic Gradient Methods

I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.

Last time we discussed stochastic and batch methods for optimization in machine learning. In both cases we’re trying to optimize a loss function that will give us good learning parameters for a deep learning model. We do this iteratively. A pure stochastic gradient (SG) approach picks one sample per iteration to define a search direction, whereas a batch method will pick multiple samples.

In this post I want to cover most of Section 4, which concerns the convergence and complexity of SG methods. For completeness I have reproduced the authors’ summary of a general purpose SG algorithm below:

`Choose an iterate w_1for k = 1, 2, … do   generate a random number s_k   compute a stochastic vector g(w_k, s_k)   choose a stepsize a_k > 0.   w_k+1 <- w_k – a_k g(w_k, s_k)`

In this algorithm, g is the search direction. In this series, we’ve already discussed three different choices for g:

• g is the gradient. This is conventional gradient descent. In this case the random number is ignored.
• g is the gradient for a single sample. This is conventional stochastic gradient.
• g is the gradient over a subset of the samples. This is “mini-batch” SG.

As the authors show, it’s not easy to make definitive statements on comparison between mini batch and SG. The gist of what they show is that SG has better convergence properties in theory, but batch methods can provide certain practical advantages. That is: “one can, however, realize benefits of mini-batching in practice since it offers important opportunities for software optimization and parallelization; e.g., using sizeable mini-batches is often the only way to fully leverage a GPU processor.” The rest of Section 4 spells this out in more detail through convergence results and complexity analysis.

As a practitioner, I honestly don’t place huge weight on the convergence theorems presented in Section 4. The key result for me is Theorem 4.9, which states that for a nonconvex objective (which we have in most deep learning scenarios) the SG method converges when the stepsize diminishes according to a somewhat loosely defined schedule.

More interesting (for me) is Section 4.4, which discusses the overall work complexity for applying SG to deep learning scenarios. In many real-world big data scenarios, we’re optimizing a loss function using previously trained model parameters under a particular computational time limit. This is different from many traditional optimization scenarios, where we let the code run until we achieve a particular solution accuracy. In our situation, the total expected error in the model is the sum of three components: the expected risk using optimal parameters, the estimated error in the expected and empirical risk, and the optimization accuracy. Minimizing this error involves tradeoffs, for example “if one decides to make the optimization more accurate … one might need to make up for the additional computing time by: (i) reducing the sample size n, potentially increasing the estimation error; or (ii) simplifying the function family H, potentially increasing the approximation error.” These are familiar techniques for machine learning practitioners; the benefit provided here is a more formal characterization of how these techniques impact the overall solution error.

Returning to the choice of conventional versus batch approaches, the discussion in Section 4.5 shows that “Even though a batch approach possesses a better dependency on epsilon, this advantage does not make up for its dependence on n. […] In conclusion, we have found that a stochastic optimization algorithm performs better in terms of expected error, and, hence, makes a better learning algorithm in the sense considered here.” Again, even this statement could be mitigated somewhat by the computational benefits associated with batching on a particular computational infrastructure, that is, the benefits of GPUs and parallelism.

Section 4.5 is a commentary on some of the remaining challenges and questions that must be confronted when using SG for large-scale machine learning. Since the issues raised in Section 4.5 pair nicely with the discussion in Section 5, I’ll save both for next time.

## Stochastic and Batch Methods for Machine Learning

I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.

In previous posts we discussed the use of deep neural networks (DNNs) in machine learning, and the pivotal role of the optimization of carefully selected prediction functions in training such models. These topics roughly correspond to the first two sections of the paper.

Section 3 provides an overview of optimization methods appropriate for DNNs. We begin where we left off in Section 2 by assuming a family of prediction functions parameterized by w. In other words, w represents the learning parameters. We also assume a loss function that depends on predicted and actual values, for example misclassification rate. Typically we’re minimizing the loss function f over a set of samples – the empirical risk. We call this R_n(w), which in turn is the average of the loss for each sample: (1/n) Ʃ_i∇f_i(). Given all of this, there are two fundamental paths for minimizing risk. In each case we iteratively improve the learning parameters step by step, where we write the parameters at step k as w_k.

• stochastic: w_k+1 ← w_k − α_k ∇f_ik (w_k) where the index ik is chosen randomly.
• batch: w_k+1 ← w_k − (α_k / n) Ʃ_i∇f_i(w_k)

The difference between the two paths is how many samples we consider on each step. The tradeoff is between per iteration cost (where stochastic wins) versus per iteration improvement (where batch wins). Stochastic algorithms a good choice for DNNs because they employ information more efficiently. If training samples are the same, or similar, than the added value of per step improvement from batch is not going to be worth it, because some (or most) of the added value turns out to be redundant. “If one believes that working with only, say, half of the data in the training set is sufficient to make good predictions on unseen data, then one may argue against working with the entire training set in every optimization iteration. Repeating this argument, working with only a quarter of the training set may be useful at the start, or even with only an eighth of the data, and so on. In this manner, we arrive at motivation for the idea that working with small samples, at least initially, can be quite appealing.”

The authors summarize: “there are intuitive, practical, and theoretical arguments in favor of stochastic over batch approaches in optimization methods for large-scale machine learning”. I will omit a summary of the theoretical motivations, but they are found in Section 3.3.

Next the authors consider improving on SG. One path involves trying to realize the best of both worlds between batch and stochastic methods, namely preserving the low per-iteration cost of stochastic methods while improving the per-iteration improvement. Another path is to “attempt to overcome the adverse effects of high nonlinearity and ill-conditioning”.  This involves trying to employ information beyond just the gradient. We’ll examine these alternatives in future posts in this series, when we get to later sections in the paper.

## Model Building for Large-Scale Machine Learning

In this post on my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal, I want to focus on model building in machine learning.

Section 2 of the paper describes several case studies, with the purpose of showing how “the process of machine learning leads to the selection of a prediction function through solving an optimization problem.” A prediction function is a mathematical function that links the model inputs to the quantity we wish to predict. From the practitioner’s point of view, a prediction function is implicitly specified by the technique the data scientist has chosen (for example, regression or neural networks) and trained model parameters (what is actually learned when the technique is applied to data).

For example, the structure of a neural network amounts to a description of a family of related functions. In the diagram below I have given two simple neural networks with corresponding prediction functions. The first simply adds the two inputs together. The second specifies a linear function involving a vector of inputs and training parameters W and b. Training the neural network amounts to choosing a particular function from the family corresponding to the nodes. Neural networks are interesting because they yield “large-scale, highly nonlinear, and nonconvex optimization problems”. For optimization practitioners, the “nonconvex” part of this statement is important because nonconvex optimization problems are particularly challenging. Here is a snippet from Stephen Boyd’s Convex Optimization I class that makes the point well.

With this in mind we may be tempted to avoid neural networks, and deep learning, altogether. However, as section 2.2 points out, certain classification tasks, like those involving speech and images, are “not well performed in an automated manner using computer programs based on sets of prescribed rules.” Deep neural networks (DNNs) involve many internal layers of manipulations and transformations, which lead to very flexible, highly parameterized models. Therefore while the corresponding optimization models for DNN are really damn hard, the potential payoff is worth it.

When a machine learning application is trying to classify data, for example in handwriting recognition, it is typical to minimize a function that relates to the misclassification rate. There are various choices for the specific function, as noted here and here (for empirical risk minimization). While we want to minimize a loss function relating to the misclassification rate, we also want classifiers that are general. In other words, if they work great on the data that we have at the time the classifier is learned, but poorly on data that comes in after that, our classifier is not very useful. For this we often divide our data into training, validation, and testing sets. Read here for more.

Section 2.3 considers the determination of a prediction function that accurately predicts outputs given inputs. We want this function to work well over the set of inputs that we will see in the real world, not just the training set. Therefore “one should choose the prediction function h by attempting to minimize a risk measure over an adequately selected family of prediction functions”. A family of functions can be described in many ways, for example as a particular functional form with parameters in it as in m x + b for parameters (m, b). Adequately selected means:

• Able to achieve low empirical risk by choosing a rich family of functions or by using knowledge about the problem domain.
• The gap between expected and empirical risk should be small, that is, they should not be biased towards or underfit the input data.
• Chosen so the resulting optimization problem can be solved efficiently.

These considerations are at odds with one another as some point towards broader, more complicated families of functions and others simpler. With regards to the first consideration, increasing the number of training samples is helpful. So is choosing a function family with a high “capacity”, which can be loosely described as a function’s “complexity, expressive power, richness, or flexibility.

Having considered what makes a good prediction function, the authors next consider procedures for finding them. The approach considered in Section 2.3 is called structural risk minimization – here is a good overview. A nice visual representation is given in Figure 2.5, but the point is to avoid both underfitting and overfitting. Underfitting happens which happens when the observed empirical risk (the frequency of observed misclassification) is high. This happens when the prediction function is insufficiently expressive to link inputs to outputs, which can happen when the network structure doesn’t make sense or is too simplistic. Overfitting happens when increasing the number or complexity of the model parameters begins to increase the misclassification rate in real-world data. This can happen even as the misclassification rate on our training data decreases. In other words, the model no longer effectively generalizes to the real world – it is too highly tuned to the data at hand. All of this implies that picking functional families that give good empirical risk may be counterproductive. The remedy to this situation is to split input data into training, validation, and testing sets, as alluded to in the first post in this series.

In the next post in this series, we’ll cover Section 3, which describes the optimization methods used to train these models.

## Optimization Methods for Large-Scale Machine Learning

Hey, so I mostly read a 93 page paper. The topic is a worthy one: optimization methods for large-scale machine learning. Deep learning powers best in class speech, image, and text intelligence on the web today, and deep learning is in turn powered by optimization. I will summarize “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal over the next few posts because it provides a useful operations research-centered evaluation of an important area in machine learning. In general, machine learning practitioners don’t know shit about operations research, and vice versa. This paper, along with work of Stephen Wright at Wisconsin (check out this talk), will certainly help to remedy this situation. I also predict that this paper will spur new advances in deep learning.

Here goes, and remember that I’m trying to summarize 93 pages!

The title of the paper is quite broad, but the focus is primarily on the use of the stochastic gradient descent (SGD) method (and variants) in deep learning applications. If you don’t have any previous experience with these topics, this series may not be for you, but I will try to summarize anyway. The term “deep learning” describes a range of machine learning algorithms that are used to classify or predict. Deep learning is primarily distinguished by:

1. The use of much more input data than is typical for machine learning,
2. Models that have many internal layers of data manipulation and transformation,
3. A reliance on parallel and GPU processing.

Training a deep learning algorithm involves finding model parameters that produce effective predictors or classifiers. Finding the values of variables that produce the best results for a particular objective (or “goal”) is the job of optimization. The stochastic gradient descent method is so-named because it repeatedly takes steps in the direction of steepest descent, which is defined by the gradient of the objective we want to optimize. If we think of the objective function as a hilly field, then the gradient always points in the steepest direction down, when we examine the immediate area around where we stand. The “stochastic” part of SGD applies because rather than looking at all of the samples over which the objective function is defined, we only look one (or a few) randomly determined sample. As compared to using the full gradient, this approach takes less time to take a single step, but the step is possibly less effective in improving the value of our objective function. In theory and practice we can establish that often the tradeoff is worth it. Characterizing these tradeoffs more concretely is one of the objectives of the paper. As as supplement to the paper and this post, check out this great post by Sebastian Ruder for an overview of gradient descent algorithms for machine learning.

In my next post in this series, I will cover Section 2 which describes the selection of a prediction function that is useful for modeling but practical for model training at scale.

Updated 8/2/2016 to correctly summarize SGD. Thanks J-F!

## Finding Optimal State Capitol Tours on the Cloud with NEOS

My last article showed you how to find an optimal tour of all 48 continental US state capitols using operations research. I used the Python API of the popular Gurobi solver to create and solve a traveling salesman problem (TSP) model in a few seconds.

In this post I want to show you how to use Concorde, the world’s best TSP solver for free on the cloud using the NEOS optimization service. In less than 100 lines of Python code, you can find the best tour. Here it is: Using NEOS is pretty easy. You need to do three things to solve an optimization problem:

1. Create a NEOS account.
2. Create an input file for the problem you want to solve.
3. Give the input file to NEOS, either through their web interface, or by calling an API.

Let’s walk through those steps for the state capitol problem. If you just want to skip to the punchline, here is my code.

Concorde requires a problem specification in the TSPLIB format. This is a text based format where we specify the distances between all pairs of cities. Recall that Randy Olson found the distances between all state capitols using the Google Maps API in this post. Here is a file with this information. Using the distances, I created a TSPLIB input file with the distance matrix – here it is.

The next step is to submit the file to NEOS. Using the xmlrpc Python module, I wrote a simple wrapper to submit TSPLIB files to NEOS. The NEOS submission is an XML file that wraps the contents of the TSPLIB data, and also tells NEOS that we want to use the Concorde solver. The XML file is given to NEOS via an XML-RPC call. NEOS returns the results as a string – the end of the string contains the optimal tour. Here is the body of the primary Python function that carries out these steps:

`def solve_tsp_neos_concorde(dist):    xml = make_neos_concorde(dist)    neos = NeosClient()    result = neos.run(xml)    return tour_from_neos_concorde_result(result)`

When I run this code, I obtain the same tour as in my initial post. Hooray! You can also extend my code (which is based on NEOS documentation) to solve many other kinds of optimization models.