## Chaining Machine Learning and Optimization Models

Rahul Swamy recently wrote about mixed integer programming and machine learning. I encourage you to go and read his article.

Though Swamy’s article focuses on mixed integer programming (MIP), a specific category of optimization problems for which there is robust, efficient software, his article applies to optimization generally. Optimization is goal seeking; searching for the values of variables that lead to the best outcomes. Optimizers solve for the best variable values.

Swamy describes two relationships between optimization and machine learning:

1. Optimization as a means for doing machine learning,
2. Machine learning as a means for doing optimization.

I want to put forward a third, but we’ll get to that in a moment.

Relationship 1: you can always describe predicting in terms of solving. A typical flow for prediction in ML is

1. Get historical data for:
1. The thing you want to predict (the outcome).
2. Things that you believe may influence the predicted variable (“features” or “predictors”).
2. Train a model using the past data.
3. Use the trained model to predict future values of the outcome.

Training a model often means “find model parameters that minimize prediction error in the test set”. Training is solving. Here is a visual representation:

Relationship 2. You can also use ML to optimize. Swami gives several examples of steps in optimization algorithms that can be described using the verbs “predict” or “classify”, so I won’t belabor the point. If the steps in our optimization algorithm are numbered 1, 2, 3, the relationship is like this:

In these two relationships, one verb is used as a subroutine for the other: solving as part of predicting, or predicting as part of solving.

There is a third way in which optimization and ML relate: using the results of machine learning as input data for an optimization model. In other words, ML and optimization are independent operations but chained together sequentially, like this:

My favorite example involves sales forecasting. Sales forecasting is a machine learning problem: predict sales given a set of features (weather, price, coupons, competition, etc). Typically business want to go further than this. They want to take actions that will increase future sales. This leads to the following chain of reasoning:

• If I can reliably predict future sales…
• and I can characterize the relationship between changes in feature values and changes in sales (‘elasticities’)…
• then I can find the set of feature values that will increase sales as much as possible.

The last step is an optimization problem.

But why are we breaking this apart? Why not just stick the machine learning (prediction) step inside the optimization? Why separate them? A couple of reasons:

• If the ML and optimization steps are separate, I can improve or change one without disturbing the other.
• I do not have to do the ML at the same time as I do the optimization.
• I can simplify or approximate the results of the ML model to produce a simpler optimization model, so it can run faster and/or at scale. Put a different way, I want the structure of the ML and optimization models to differ for practical reasons.

In the machine learning world it is common to refer to data pipelines. But ML pipelines can involve models feeding models, too! Chaining ML and optimization like this is often useful, so keep it in mind.

## Overview: Consciousness and the Brain

Consciousness and the Brain, written by French neuroscientist Stanislas Dehaene, is a fascinating overview of the mechanisms, boundaries, and possibilities of consciousness, from the point of view of an applied researcher. The work of great Dutch primatologist Frans de Waal led me to Dehaene; time permitting I will summarize some of de Waal’s insights in future posts.

The primary claim of Consciousness and the Brain is that genuine consciousness is indicated by conscious access: the ability for attended information to enter awareness and become reportable to others. It is the capacity to bring to mind accessible perceptions and thoughts. Vigilance, the state of wakefulness, and attention, the focusing of mental resources onto specific information, are not sufficient to constitute consciousness in Dehaene’s view.

“Consciousness is global information broadcasting within the cortex: it arises from a neuronal network whose reason to be is the massive sharing of pertinent information throughout the brain.” This network is referred to as a global neuronal workspace or simply “global workspace”. It enables a number of capabilities including:

• purely mental operations;
• the ability to keep data in mind indefinitely;
• performing arbitrary mental operations;
• reporting to others;
• the enablement of autonomy (augmenting the spontaneous activity that also occurs in the brain).

Experiments can teach us much about consciousness and the theory of the global workspace. Dehaene focuses on experiments with “minimal contrast”: “a pair of experimental situations that are minimally different but only one of which is consciously perceived”. Riding the line in this way helps determine the mechanisms that govern conscious perception.

Unconscious mechanisms play a massive role in our lives and being. While much of our mental activity, and much of what drives our lives, are unconscious mechanisms, virtually all of the brain’s regions can participate in both conscious and unconscious thought. Unconscious mechanisms can be of a higher order than we may assume; for example conscious attention is not required to bind the elements of a scene together. The unconscious binding together of systems occurs in vision, language, attention, and even certain mathematics. (Notably, several of these examples match areas of recent progress in machine learning using neural networks.) These unconscious operations can be detected and measured through careful experiment.

That said, consciousness provides unique capabilities that even sophisticated unconscious processing cannot. This is because “conscious perception transforms incoming information into an internal code that allows it to be processed in unique ways.” For example:

• the stable perception of objects moving through our visual space, even as we ourselves move;
• the ability to store and retain lasting thoughts, which permit us to learn over time;
• the ability to sequentially process information according to rules, similar to the functioning of a digital computer;
• the ability to route information to arbitrary systems in our brain;
• the ability to share our thoughts and perceptions through the use of language.

Through brain imaging and experimentation we are now able to identify the signatures of conscious thought. These signatures include:

1. The ignition of our parietal and prefrontal circuits.
2. A late slow wave in our brains, referred to as a “P3 wave”.
3. A late and sudden burst of high-frequency oscillations.
4. The synchronization of information exchanges across distant brain regions.

These signatures are detectable through various means including FMRI (Functional Magnetic Resonance Imaging), MEG (magnetoencephalography) and EEG (electroencephalography).

Because consciousness appears to emerge from the building, looping, and coordination of signals from different brain subsystems (see above), consciousness lags the real world. Consciousness is formed from loops in the brain. It is these loops that permit construction of mental images given incomplete sensory data; for example there are massive differences between our the raw, imperfect visual data that enters our eyes (e.g. the blind spot in our eyes directly behind our optic nerve, or our limited color range perception outside the center of our attention) and our conscious perception.

The ability to read the traces of conscious thought allow us to theorize about consciousness. Global neuronal workspace theory claims that the human brain has developed efficient long-distance networks to select relevant information and disseminate it throughout the brain. Consciousness is an evolved device that allows to attend to a piece of information and keep it active. Conscious information can then be routed to other areas based on goals.

Within our brain there are collections of neurons that send reinforcing signals to each other under certain conditions, for example when a particular person, event, or sensation is perceived or remembered. During conscious perception, a small subset of workspace neurons become active, while most others are inhibited. The panoply of signaling related to inhibited clusters, for example, all clusters that do not pertain to the 2016 Chicago Cubs, form a recognizable signal which is referred to as the P3 wave. In other words, a primary signifier of conscious thought is the signal resulting from the repression of neurons. The P3 wave is in a sense a “negative thought signal”.

Not all unconscious thought is the same. Several types can be defined. Preconscious thought is information already encoded by an active assembly. It can become conscious at anytime. Subliminal thought is input given or processed so weakly that we lack the capability of attending to it. Disconnected patterns are mental activities which have no relation to conscious thought, for example our regular breathing. Diluted mental activity is neural information that has been “downsampled” for use by other systems in the brain, and therefore cannot be brought to consciousness. For example a visual pattern that flickers so fast that you cannot see it. Early levels in our visual system may register this flickering but it is transformed by later levels.

Detecting and theorizing about consciousness allows us to ask, and potentially answer, deep questions. For example, a series of recent experiments seems to indicate the presence of consciousness within severely injured patients who cannot express their consciousness. An example is the case of Jean-Dominique Bauby, the author and subject of The Diving Bell and the Butterfly. A patient without the ability to communicate or move can be asked to imagine, for example, riding a bicycle for 30 seconds. The patient’s brain can be scanned via MRI during this period, and compared to a control group to establish that the neuronal clusters responding to bike riding, along with the signatures of consciousness, are present.

These questions can be asked of non-humans as well. Dehaene says “I would not be surprised if we discovered that all mammals, and probably many species of birds and fish, show evidence of a convergent evolution to the same sort of conscious workspace” as found in humans. (Here is the primary connection to Frans de Waal’s work.) Experiments show that monkeys, dolphins, and even rats and pigeons possess at least the rudiments of metacognition: thinking about thinking. “Animal behavior bears the hallmark of a conscious and reflexive mind.”

It is then worth asking what is uniquely human about human consciousness. Dehaene provides informed speculation. Perhaps it is our ability to combine our core brain systems using a “language of thought”: the inner voice that is nearly always present inside of us. Perhaps also it is our capacity to compose our thoughts using nested or recursive structures of symbols. This implies that language evolved as an internal representational device, not just as a communication system with other hairless apes. The ability to compose and nest may underlie many of our unique human abilities, such as the ability to craft complex tools, perform higher mathematics, and our self-consciousness. An examination of brain areas that are particularly well-developed among humans (as opposed to other primates) seems to support these theories at a high level.

Dehaene closes by theorizing what it would take to build artificial consciousness using computers. This topic, while fascinating, is not a primary subject of Dehaene’s book, and is best saved for another time.

## Other Minds: Consciousness and Evolution

I highly recommend Other Minds, by Peter Godfrey-Smith. It’s a fascinating exploration of the minds of cephalopods, who independently from vertebrates developed sophisticated nervous systems and what any reasonable person would call intelligence. In so doing, Godfrey-Smith explores the tree of life, the origins and components of complex thought and consciousness, and the ways of formless, curious creatures deep below. Read this book!

Godfrey-Smith describes two important revolutionary periods in Earth’s evolutionary history. In each case, a means of communication between organisms became a means of communication within them.

The first is the Sense-Signaling revolution. Roughly 700 million years ago, the first organisms that we could reasonably call animals – sensing and acting organisms – evolved. Just a bit later, around 542 million years ago according to Godfrey-Smith, certain organisms began to develop not only sensing mechanisms, but signaling mechanisms too. Both sensing and signaling, directed outward, provide evolutionary benefits: they help animals navigate and influence their environments. There is another advantage: these same sensing and signaling mechanisms can be used inside the space of the organism to better coordinate its sense-action loop. Input is processed through the senses, and then signaled in a targeted fashion to another part of the organism (be it tentacle, flipper, paw, or hand) to generate a specific response. Sensing and signaling happens inside only higher order organisms like animals. The internalization of sensing and signaling marks the beginnings of the development of the nervous system. Millions of years later, the sensing and signaling mechanisms found in animals are often incredibly complex.

The second is Language. Less than half a million years ago, human language emerged from simpler forms of communication. Language is nothing but an elaborate, auditory form of sensing and signaling. Its more rudimentary forms are used by our primate cousins to warn, coax, plead, and threaten. In humans, these signals became more universal in expressive power, and more nuanced (despite recent examples to the contrary).

Godfrey-Smith, building on Hume, Vygotsky and others, notes that speech is not only for our others, but for ourselves. Each of us has an inner dialogue that runs through our heads from wake to sleep, and even in our dreams. Our inner speech is inseparable from our conscious selves. In a beautiful passage, Godfrey-Smith writes, “inner speech is a way your brain creates a loop, intertwining the construction of thoughts and the reception of them.” This loop not only helps us direct our action, but it can clarify, integrate, and reinforce our conceptions. For Godfrey-Smith and others such as Baars and Dehaene (whom I’ll cover in a future post since Consciousness and the Brain is amazing), our inner speech is a necessary ingredient in our integrated subjective experience as human beings. It helps us to direct our thoughts in a deliberate, planful way – the “System 2 thinking” that Kahneman writes about in Thinking Fast and Slow.

The Sense-Signaling and Language revolutions were both forerunners of radical planetary change. In the first case, the Cambrian explosion, in the second the rise to primacy of homo sapiens.

Speaking of us: it is interesting to compare the revolutions described by Godfrey-Smith to those described by Yuval Harari in Sapiens: A Brief History of Humankind (link). Harari’s, as summarized by Wikipedia, are the following:

• “The Cognitive Revolution (c. 70,000 BCE, when Sapiens evolved imagination).
• The Agricultural Revolution (c. 10,000 BCE, the development of farming).
• The unification of humankind (the gradual consolidation of human political organisations towards one global empire).
• The Scientific Revolution (c. 1500 CE, the emergence of objective science).”

Harari’s Cognitive Revolution, in my view, maps reasonably well to Godfrey-Smith’s Language revolution. The remaining items in Harari’s list, when considered in Godfrey-Smith’s context, seem like nearly inevitable consequences of the first. Perhaps I am giving us too little credit, or perhaps too much.

## Four Things I Learned from Jack Dongarra

Opening the Washington Post today brought me a Proustian moment: encountering the name of Jack Dongarra. His op-ed on supercomputing involuntarily recalled to mind the dusty smell of the third floor MacLean Hall computer lab, xterm windows, clicking keys, and graphite smudges on spare printouts. Jack doesn’t know it, but he was a big part of my life for a few years in the 90s. I’d like to share some things I learned from him.

I am indebted to Jack. Odds are you are too. Nearly every data scientist on Earth uses Jack’s work every day, and most don’t even know it. Jack is one of the prime movers behind the BLAS and LAPACK numerical libraries, and many more. BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) are programming libraries that provide foundational routines for manipulating vectors and matrices. These routines range from the rocks and sticks of addition, subtraction, and scalar multiplication up to finely tuned engines for solving systems of linear equations, factorizing matrices, determining eigenvalues, and so on.

Much of modern data science is built upon these foundations. They are hidden by layers of abstractions, wheels, pips and tarballs, but when you hit bottom, this is what you reach. Much of ancient data science is also built upon them too, including the solvers I wrote as a graduate student when I was first exposed to his work. As important as LAPACK and BLAS are, that’s not the reason I feel compelled to write about Jack. It’s more about how he and his colleagues went about the whole thing. Here are four lessons:

Layering. If you dig into BLAS and LAPACK, you quickly find that the routines are carefully organized. Level 1 routines are the simplest “base” routines, for example adding two vectors. They have no dependencies. Level 2 routines are more complex because they depend on Level 1 routines – for example multiplying a matrix and a vector (because this can be implemented as repeatedly taking the dot product of vectors, a Level 1 operation). Level 3 routines use Level 2 routines, and so on. Of course all of this obvious. But we dipshits rarely do what is obvious, even these days. BLAS and LAPACK not only followed this pattern, they told you they were following this pattern.

I guess I have written enough code to have acquired the habit of thinking this way too. I recall having to rewrite a hilariously complex beast of project scheduling routines when I worked for Microsoft Project, and I tried to structure my routines exactly in this way. I will spare you the details, but there is no damn way it would have worked had I not strictly planned and mapped out my routines just like Jack did. It worked, we shipped, and I got promoted.

Naming. Fortran seems insane to modern coders, but it is of course awesome. It launched scientific computing as we know it. In the old days there were tight restrictions on Fortran variable names: 1-6 characters from [a-z0-9]. With a large number of routines, how does one choose names that are best for programmer productivity? Jack and team zigged where others might have zagged and chose names with very little connection to English naming.

“All driver and computational routines have names of the form XYYZZZ”

where X represents data type, YY represents type of matrix, and ZZZ is a passing gesture at the operation that is being performed. So SGEMV means “single precision general matrix-vector multiplication”.

This scheme is not “intuitive” in the sense that it is not named GeneralMatrixVectorMultiply or general_matrix_vector_multiply, but it is predictable. There are no surprises and the naming scheme itself is explicitly documented. Developers of new routines have very clear guidance on how to extend the library. In my career I have learned that all surprises are bad, so sensible naming counts for a lot. I have noticed that engineers whom I respect also think hard about naming schemes.

Documentation. BLAS and LAPACK have always had comprehensive documentation. Every parameter of every routine is documented, the semantics of the routine are made clear, and “things you should know” are called out. This has set a standard that high quality libraries (such as the tidyverse and Keras – mostly) have carried forward, extending this proud and helpful tradition.

Pride in workmanship. I can’t point to a single website or routine as proof, but the pride in workmanship in the Netlib has always shone through. It was in some sense a labor of love. This pride makes me happy, because I appreciate good work, and I aspire to good work. As a wise man once said:

Once a job is first begun,
Never leave it ’till it’s done.
Be the job great or small,
Do it right or not at all.

Jack Dongarra has done it right. That’s worth emulating. Read more about him here [pdf] and here.

## 2018 NCAA Tournament Picks

Every year since 2010 I have used data science to predict the results of the NCAA Men’s Basketball Tournament. In this post I will describe the methodology that I used to create my picks (full bracket here). The model has Virginia, Michigan, Villanova, and Michigan State in the Final Four with Virginia defeating Villanova in the championship game:

Here are my ground rules:

• The picks should not be embarrassingly bad.
• I shall spend no more than one work day on this activity (and 30 minutes for this post). This year I spent two hours cleaning up and running my code from last year.
• I will share my code and raw data. (The data is available on Kaggle. The code is not cleaned up but here it is anyway.)

I used a combination of game-by-game results and team metrics from 2003-2017 to build the features in my model. Here is a summary:

I also performed some post-processing:

• I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
• Standard normalization.
• One hot encoding of categorical features.
• Upset generation. I found the results to be not interesting enough, so I added a post-processing function that looks for games where the win probability for the underdog (a significantly lower seed) is quite close to 0.5. In those cases the model picks the underdog instead.

The model predicts the probability one team defeats another, for all pairs of teams in the tournament. The model is implemented in Python and uses logistic regression. The model usually performs well. Let’s see how it does this year!

## Advice for Underqualified Data Scientists

A talented individual seeking entry-level data science roles recently asked me for advice. “How can you show a potential employer that you’d be an asset when on paper your resume doesn’t show what other candidates have?”

I’ll stick to data science, but much of what I share applies to other roles, too.

Let’s think about the question first. Why do coursework and skills matter for employers? It depends. Different employers have different philosophies about how they evaluate candidates. Most job listings specify required skills and qualifications for applicants, for example “must have 3-5 experience programming in R or Python.” Usually there is more to the story. Sometimes employers don’t expect candidates to meet all the criteria. Other times, the criteria are impossible to meet.

In most situations, employers are looking for additional attributes not provided in the job listing. Some employers will tell you their philosophy by listing the attributes they value: “ability to deal with ambiguous situations”, “being a team player”, “putting the customer first”, “seeks big challenges”, and so on. Others don’t. Even if they tell you, you don’t typically know which attributes are most important. What really matters? If I am a so-so programmer but a brilliant statistician, do I have a shot?

Individuals who make hiring decisions have a mental image of how a successful candidate will perform on the job. This mental image includes possessing and using a certain set of skills. Qualifications such as a degree, a certificate, or code on github provide part (but only part) of the evidence necessary to ensure hiring managers that they are making a sound decision.

Let’s be simplistic and say that employers consider both “explicit skills” and “implicit skills”. Examples of explicit skills are demonstrated knowledge or capability with programming language X, technology Y, or methodology Z. Examples of implicit skills might be the ability to break down a complicated problem into its constituent parts, dealing with ambiguity, working collaboratively, and so on. Certainly some employers are very focused on finding candidates with explicit skills, sometimes to the exclusion of implicit skills.

A reframing of the question is then: “If I sense that a potential employer is looking for certain explicit skills and I don’t think I have them, what do I do?” Here are some ideas:

Provide evidence you are good at acquiring explicit skills. Given an example of learning an explicit skill. (“No, I don’t know R, but I know Python. In my blah blah class I had to learn Python so I could apply it to XYZ problem, and it was no big deal. I did ABC and now my code is up on github. Learning R is really not a big deal, I’m confident I could hit the ground running. What would you have in mind for me for my first project?”)

Emphasize your implicit skills. Game plan about questions you’ll be asked and think about how you’d highlight what you believe to be your differentiating skills. (Without sounding like a politician.) By the way, now that I think about it, I followed my own advice when I interviewed at Market6 (now 84.51). I talked about the fact that I have worked in both software engineering and data science roles, and that made me uniquely qualified to work at a company that was trying to deliver data science at scale through SaaS offerings.

Do your own screening. Focus your search on employers who seem to value implicit skills. Rule out others. Do your research prior to applying. Ask friends or contacts. Early in your conversations with employers you can ask the recruiter about their philosophy. Not every job is right for you, so try and figure out which ones are.

That’s all I’ve got. I will close by telling two quick stories.

First story: My first job after finishing my PhD was as an entry level software engineer at Microsoft. When I interviewed, I was fortunate because Microsoft weighted implicit skills highly in their evaluation process. One of my favorite bosses at Microsoft was a classics major (as in Euripides, not the Stones). Another engineering manager started his career localizing dialog box messages into French. Oui, c’est vrai. Both had, and continue to have, a very strong set of implicit skills. They, in turn, looked for implicit skills. Talent comes in many different packages.

Second story: I believe that for early career stage positions it’s important to weight implicit skills more highly than explicit ones. Sometimes it’s a relief if certain explicit skills aren’t there! Several years ago, I had an entry level scientific researcher on my team who did not know how to code, in a position where lots of coding was required. This individual had very deep knowledge of optimization and statistics, was a hard worker, and was incredibly motivated. I was thrilled that they didn’t know how to code because then I could teach them! No bad habits!

## Two Frustrations With the Data Science Industry

I saw some serious BS about data science on LinkedIn last night. This is nothing new, but this time I couldn’t help myself. I went on a small rant:

I don’t give a shit if you call yourself a data scientist, an analyst, a machine learning practitioner, an operations research specialist, a data engineer, a modeler, a statistician, a code poet, or a squirrel. I don’t care if you have a PhD, if you went to MIT or a community college, if you were born on a farm or in a city, or if Andrew Ng DMs you for tips. I want to know what you can do, if you can share, if you can learn, if you can listen, and if you can stand for what is right even if it’s unpopular. If we’re good there, the rest we can figure out together.

I must have tapped into something, so I’d like to myself a bit more thoroughly.

My rant is rooted in two frustrations about data science.

My first frustration relates to overclassification. How many different terms can we use to refer to data scientists? I honestly don’t know. I have it on authority that there are six types of data scientists. No, wait, there are seven. Strike that, eight. Actually there are ten. Stop the insanity!

The industry itself is also subject to this kind of sillified stratification. I don’t know what the hell I do anymore. Is it operations research? Statistics? Analytics? Machine Learning? Artificial Intelligence? All of it? It depends which thought leadership piece I read. And what is the current state of this field, anyway? Are we in the age of Analytics 2.0? Or is it 3.0? Is big data saving the world, or is it the “trough of disillusionment”? I find all of this unhelpful.

Why is this happening? The use of computer models to learn from data has been around for at least five decades now, but data science has moved from an unnamed, specialized backwater into a rapidly growing and vital industry. This growth has created a market for teaching others about this hot new field. It has also led to the organization of a hierarchy of those who are “in the know” and those who are not. These are the factors driving the accelerating creation of labels and classifications.

However, knowing the names of things does not constitute understanding of essence; the proliferation of labels under the banner of “thought leadership” is often a gimmick; and as Martin Gardner said, inventing your own terminology is a sign of a crank. Debates about terminology often draw us away from doing good data science. Maybe it’s just me but sometimes I get the feeling these distractions are on purpose. They don’t help anyone solve any problems, that’s for sure.

The second frustration I have is overreliance on credentials. As opposed to academic or research positions, my own work in industry has been focused on the practical use of data science to address business problems. More often than not, I’ve worked as part of a team to get the job done. What matters for people like me is whether problems actually get solved, in a reasonable amount of time with a reasonable amount of expense.

I have encountered situations where employers would only consider applicants who had graduated from certain schools, or with certain degrees, or with a certain number of years of experience with a certain specific technical skill. All of these qualifications are proxies for what actually matters: whether someone can meaningfully contribute to team-based analytical problem solving. Focusing on proxies results in both Type I and Type II errors: hiring scientists with great credentials but an inability to deliver (“all hat and no cattle“), or even worse, missing out on the opportunity to hire the proverbial “unicorn” because they didn’t tick the right box. I’ve seen both happen. These proxies are not without their uses: if I really require the development of an MINLP solver to solve optimization models with a particular structure…the right candidate very likely has a PhD. The point is not to confuse correlation with causation. Having a PhD does not make me a great data scientist. Nor does github, nor Coursera, nor Kaggle points. We need to dig deeper.

I suppose I should end positively. The last part of my rant was an appeal to inclusiveness and an appeal to pragmatism. Practical data science means making tradeoffs, large and small, every single day. It means seeing the big picture but also being willing to dig into the details. Let’s take this same practical mindset in growing our skills and building our teams.

The past couple of days I’ve been playing around with Facebook’s Prophet, a time series forecasting package.

I used Prophet to forecast quarterly sales of the Apple iPad, all in about 30 lines of Python. The repository for my code is here, and here’s a Jupyter notebook that walks through how it works.

It’s a lot of fun, and you get nice little visualizations like this one:

Check it out!

## 2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

• The picks should not be embarrassingly bad.
• I shall spend no more than one work day on this activity (and 30 minutes for this post).
• I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

• I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
• Standard normalization.
• One hot encoding of categorical features.
• Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

• I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
• Not accounting for injuries.
• NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
• I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

## Noise Reduction Methods for Large-Scale Machine Learning

I have two posts remaining in my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here. These last two posts will discuss improvements on the base stochastic gradient method. Below I have reproduced Figure 3.3, which suggests two general approaches. I will cover noise reduction in this post, and second-order methods in the next.

The left-to-right direction on the diagram signifies noise reduction techniques. We say that the SG search direction is “noisy” because it includes information from only one (randomly generated) sample per iteration. We use a noisy direction, of course, because it’s too expensive to use the entire gradient. But we can consider using a small batch of samples per iteration (a “minibatch”), or using information from previous iterations. The idea here is to find a happy medium between the far left of the diagram, which represents one sample per iteration, and the right, which represents using the full gradient.

Section 5 describes several noise reduction techniques. Dynamic sample size methods vary the number of samples in a minibatch per iteration, for example by increasing the batch size geometrically with the iteration count. Gradient aggregation, as the name suggests, involves the use of gradient information from past iterations. The SVRG method involves starting with a full batch gradient, then for subsequent iterations updating the gradient using gradient information at a single sample. The SAGA method involves “taking the average of stochastic gradients evaluated at previous iterates”. Finally, iterate averaging methods use the iterates from multiple previous steps to update the current iterate.

The motivations behind these various noise reduction methods are more or less the same: make more progress on a single step without paying too much of a computational cost. The primary tradeoff, in addition to increased computational cost per iteration, of course, is the extra storage associated with keeping extra state to compute search direction. Section 5 of the paper discusses these tradeoffs in light of convergence criteria.

[Updated 8/24/2016] Going back to our diagram, the up and down dimension of the diagram represents so-called “second-order methods”. Gradient-based methods, including SG, are first-order methods because they use a first-order (linear) approximation to the objective function we want to optimize. Second-order methods attempt to look at the curvature of the objective function to obtain better search directions. Once again, there is a tradeoff: using the curvature is more work, but we hope that by computing better search directions we’ll need far fewer iterations to get a good solution. I had originally intended on covering Section 6 of the paper, which describes several such methods in detail, but I will leave the interested reader to dig through that section themselves!