Two Frustrations With the Data Science Industry

I saw some serious BS about data science on LinkedIn last night. This is nothing new, but this time I couldn’t help myself. I went on a small rant:

I don’t give a shit if you call yourself a data scientist, an analyst, a machine learning practitioner, an operations research specialist, a data engineer, a modeler, a statistician, a code poet, or a squirrel. I don’t care if you have a PhD, if you went to MIT or a community college, if you were born on a farm or in a city, or if Andrew Ng DMs you for tips. I want to know what you can do, if you can share, if you can learn, if you can listen, and if you can stand for what is right even if it’s unpopular. If we’re good there, the rest we can figure out together.

I must have tapped into something, so I’d like to myself a bit more thoroughly.

My rant is rooted in two frustrations about data science.

My first frustration relates to overclassification. How many different terms can we use to refer to data scientists? I honestly don’t know. I have it on authority that there are six types of data scientists. No, wait, there are seven. Strike that, eight. Actually there are ten. Stop the insanity!

Susan-Powter-image

The industry itself is also subject to this kind of sillified stratification. I don’t know what the hell I do anymore. Is it operations research? Statistics? Analytics? Machine Learning? Artificial Intelligence? All of it? It depends which thought leadership piece I read. And what is the current state of this field, anyway? Are we in the age of Analytics 2.0? Or is it 3.0? Is big data saving the world, or is it the “trough of disillusionment”? I find all of this unhelpful.

Why is this happening? The use of computer models to learn from data has been around for at least five decades now, but data science has moved from an unnamed, specialized backwater into a rapidly growing and vital industry. This growth has created a market for teaching others about this hot new field. It has also led to the organization of a hierarchy of those who are “in the know” and those who are not. These are the factors driving the accelerating creation of labels and classifications.

However, knowing the names of things does not constitute understanding of essence; the proliferation of labels under the banner of “thought leadership” is often a gimmick; and as Martin Gardner said, inventing your own terminology is a sign of a crank. Debates about terminology often draw us away from doing good data science. Maybe it’s just me but sometimes I get the feeling these distractions are on purpose. They don’t help anyone solve any problems, that’s for sure.

The second frustration I have is overreliance on credentials. As opposed to academic or research positions, my own work in industry has been focused on the practical use of data science to address business problems. More often than not, I’ve worked as part of a team to get the job done. What matters for people like me is whether problems actually get solved, in a reasonable amount of time with a reasonable amount of expense.

I have encountered situations where employers would only consider applicants who had graduated from certain schools, or with certain degrees, or with a certain number of years of experience with a certain specific technical skill. All of these qualifications are proxies for what actually matters: whether someone can meaningfully contribute to team-based analytical problem solving. Focusing on proxies results in both Type I and Type II errors: hiring scientists with great credentials but an inability to deliver (“all hat and no cattle“), or even worse, missing out on the opportunity to hire the proverbial “unicorn” because they didn’t tick the right box. I’ve seen both happen. These proxies are not without their uses: if I really require the development of an MINLP solver to solve optimization models with a particular structure…the right candidate very likely has a PhD. The point is not to confuse correlation with causation. Having a PhD does not make me a great data scientist. Nor does github, nor Coursera, nor Kaggle points. We need to dig deeper.

I suppose I should end positively. The last part of my rant was an appeal to inclusiveness and an appeal to pragmatism. Practical data science means making tradeoffs, large and small, every single day. It means seeing the big picture but also being willing to dig into the details. Let’s take this same practical mindset in growing our skills and building our teams.

Advertisements

Forecasting iPad Sales using Facebook’s Prophet

The past couple of days I’ve been playing around with Facebook’s Prophet, a time series forecasting package.

I used Prophet to forecast quarterly sales of the Apple iPad, all in about 30 lines of Python. The repository for my code is here, and here’s a Jupyter notebook that walks through how it works.

It’s a lot of fun, and you get nice little visualizations like this one:

iPad

Check it out!

 

2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post).
  • I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

  • I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
  • Not accounting for injuries.
  • NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
  • I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

Noise Reduction Methods for Large-Scale Machine Learning

I have two posts remaining in my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here. These last two posts will discuss improvements on the base stochastic gradient method. Below I have reproduced Figure 3.3, which suggests two general approaches. I will cover noise reduction in this post, and second-order methods in the next.

SGDirections

The left-to-right direction on the diagram signifies noise reduction techniques. We say that the SG search direction is “noisy” because it includes information from only one (randomly generated) sample per iteration. We use a noisy direction, of course, because it’s too expensive to use the entire gradient. But we can consider using a small batch of samples per iteration (a “minibatch”), or using information from previous iterations. The idea here is to find a happy medium between the far left of the diagram, which represents one sample per iteration, and the right, which represents using the full gradient.

Section 5 describes several noise reduction techniques. Dynamic sample size methods vary the number of samples in a minibatch per iteration, for example by increasing the batch size geometrically with the iteration count. Gradient aggregation, as the name suggests, involves the use of gradient information from past iterations. The SVRG method involves starting with a full batch gradient, then for subsequent iterations updating the gradient using gradient information at a single sample. The SAGA method involves “taking the average of stochastic gradients evaluated at previous iterates”. Finally, iterate averaging methods use the iterates from multiple previous steps to update the current iterate.

The motivations behind these various noise reduction methods are more or less the same: make more progress on a single step without paying too much of a computational cost. The primary tradeoff, in addition to increased computational cost per iteration, of course, is the extra storage associated with keeping extra state to compute search direction. Section 5 of the paper discusses these tradeoffs in light of convergence criteria.

[Updated 8/24/2016] Going back to our diagram, the up and down dimension of the diagram represents so-called “second-order methods”. Gradient-based methods, including SG, are first-order methods because they use a first-order (linear) approximation to the objective function we want to optimize. Second-order methods attempt to look at the curvature of the objective function to obtain better search directions. Once again, there is a tradeoff: using the curvature is more work, but we hope that by computing better search directions we’ll need far fewer iterations to get a good solution. I had originally intended on covering Section 6 of the paper, which describes several such methods in detail, but I will leave the interested reader to dig through that section themselves!

Analyses of Stochastic Gradient Methods

I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.

Last time we discussed stochastic and batch methods for optimization in machine learning. In both cases we’re trying to optimize a loss function that will give us good learning parameters for a deep learning model. We do this iteratively. A pure stochastic gradient (SG) approach picks one sample per iteration to define a search direction, whereas a batch method will pick multiple samples.

In this post I want to cover most of Section 4, which concerns the convergence and complexity of SG methods. For completeness I have reproduced the authors’ summary of a general purpose SG algorithm below:

Choose an iterate w_1
for k = 1, 2, … do
   generate a random number s_k
   compute a stochastic vector g(w_k, s_k)
   choose a stepsize a_k > 0.
   w_k+1 <- w_k – a_k g(w_k, s_k)

In this algorithm, g is the search direction. In this series, we’ve already discussed three different choices for g:

  • g is the gradient. This is conventional gradient descent. In this case the random number is ignored.
  • g is the gradient for a single sample. This is conventional stochastic gradient.
  • g is the gradient over a subset of the samples. This is “mini-batch” SG.

As the authors show, it’s not easy to make definitive statements on comparison between mini batch and SG. The gist of what they show is that SG has better convergence properties in theory, but batch methods can provide certain practical advantages. That is: “one can, however, realize benefits of mini-batching in practice since it offers important opportunities for software optimization and parallelization; e.g., using sizeable mini-batches is often the only way to fully leverage a GPU processor.” The rest of Section 4 spells this out in more detail through convergence results and complexity analysis.

As a practitioner, I honestly don’t place huge weight on the convergence theorems presented in Section 4. The key result for me is Theorem 4.9, which states that for a nonconvex objective (which we have in most deep learning scenarios) the SG method converges when the stepsize diminishes according to a somewhat loosely defined schedule.

More interesting (for me) is Section 4.4, which discusses the overall work complexity for applying SG to deep learning scenarios. In many real-world big data scenarios, we’re optimizing a loss function using previously trained model parameters under a particular computational time limit. This is different from many traditional optimization scenarios, where we let the code run until we achieve a particular solution accuracy. In our situation, the total expected error in the model is the sum of three components: the expected risk using optimal parameters, the estimated error in the expected and empirical risk, and the optimization accuracy. Minimizing this error involves tradeoffs, for example “if one decides to make the optimization more accurate … one might need to make up for the additional computing time by: (i) reducing the sample size n, potentially increasing the estimation error; or (ii) simplifying the function family H, potentially increasing the approximation error.” These are familiar techniques for machine learning practitioners; the benefit provided here is a more formal characterization of how these techniques impact the overall solution error.

Returning to the choice of conventional versus batch approaches, the discussion in Section 4.5 shows that “Even though a batch approach possesses a better dependency on epsilon, this advantage does not make up for its dependence on n. […] In conclusion, we have found that a stochastic optimization algorithm performs better in terms of expected error, and, hence, makes a better learning algorithm in the sense considered here.” Again, even this statement could be mitigated somewhat by the computational benefits associated with batching on a particular computational infrastructure, that is, the benefits of GPUs and parallelism.

Section 4.5 is a commentary on some of the remaining challenges and questions that must be confronted when using SG for large-scale machine learning. Since the issues raised in Section 4.5 pair nicely with the discussion in Section 5, I’ll save both for next time.

Stochastic and Batch Methods for Machine Learning

I am continuing my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal. You can find the entire series here.

In previous posts we discussed the use of deep neural networks (DNNs) in machine learning, and the pivotal role of the optimization of carefully selected prediction functions in training such models. These topics roughly correspond to the first two sections of the paper.

Section 3 provides an overview of optimization methods appropriate for DNNs. We begin where we left off in Section 2 by assuming a family of prediction functions parameterized by w. In other words, w represents the learning parameters. We also assume a loss function that depends on predicted and actual values, for example misclassification rate. Typically we’re minimizing the loss function f over a set of samples – the empirical risk. We call this R_n(w), which in turn is the average of the loss for each sample: (1/n) Ʃ_i∇f_i(). Given all of this, there are two fundamental paths for minimizing risk. In each case we iteratively improve the learning parameters step by step, where we write the parameters at step k as w_k.

  • stochastic: w_k+1 ← w_k − α_k ∇f_ik (w_k) where the index ik is chosen randomly.
  • batch: w_k+1 ← w_k − (α_k / n) Ʃ_i∇f_i(w_k)

The difference between the two paths is how many samples we consider on each step. The tradeoff is between per iteration cost (where stochastic wins) versus per iteration improvement (where batch wins). Stochastic algorithms a good choice for DNNs because they employ information more efficiently. If training samples are the same, or similar, than the added value of per step improvement from batch is not going to be worth it, because some (or most) of the added value turns out to be redundant. “If one believes that working with only, say, half of the data in the training set is sufficient to make good predictions on unseen data, then one may argue against working with the entire training set in every optimization iteration. Repeating this argument, working with only a quarter of the training set may be useful at the start, or even with only an eighth of the data, and so on. In this manner, we arrive at motivation for the idea that working with small samples, at least initially, can be quite appealing.”

The authors summarize: “there are intuitive, practical, and theoretical arguments in favor of stochastic over batch approaches in optimization methods for large-scale machine learning”. I will omit a summary of the theoretical motivations, but they are found in Section 3.3.

Next the authors consider improving on SG. One path involves trying to realize the best of both worlds between batch and stochastic methods, namely preserving the low per-iteration cost of stochastic methods while improving the per-iteration improvement. Another path is to “attempt to overcome the adverse effects of high nonlinearity and ill-conditioning”.  This involves trying to employ information beyond just the gradient. We’ll examine these alternatives in future posts in this series, when we get to later sections in the paper.

Model Building for Large-Scale Machine Learning

In this post on my series on “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal, I want to focus on model building in machine learning.

Section 2 of the paper describes several case studies, with the purpose of showing how “the process of machine learning leads to the selection of a prediction function through solving an optimization problem.” A prediction function is a mathematical function that links the model inputs to the quantity we wish to predict. From the practitioner’s point of view, a prediction function is implicitly specified by the technique the data scientist has chosen (for example, regression or neural networks) and trained model parameters (what is actually learned when the technique is applied to data).

For example, the structure of a neural network amounts to a description of a family of related functions. In the diagram below I have given two simple neural networks with corresponding prediction functions. The first simply adds the two inputs together. The second specifies a linear function involving a vector of inputs and training parameters W and b.

NeuralNetwork

Training the neural network amounts to choosing a particular function from the family corresponding to the nodes. Neural networks are interesting because they yield “large-scale, highly nonlinear, and nonconvex optimization problems”. For optimization practitioners, the “nonconvex” part of this statement is important because nonconvex optimization problems are particularly challenging. Here is a snippet from Stephen Boyd’s Convex Optimization I class that makes the point well.

With this in mind we may be tempted to avoid neural networks, and deep learning, altogether. However, as section 2.2 points out, certain classification tasks, like those involving speech and images, are “not well performed in an automated manner using computer programs based on sets of prescribed rules.” Deep neural networks (DNNs) involve many internal layers of manipulations and transformations, which lead to very flexible, highly parameterized models. Therefore while the corresponding optimization models for DNN are really damn hard, the potential payoff is worth it.

When a machine learning application is trying to classify data, for example in handwriting recognition, it is typical to minimize a function that relates to the misclassification rate. There are various choices for the specific function, as noted here and here (for empirical risk minimization). While we want to minimize a loss function relating to the misclassification rate, we also want classifiers that are general. In other words, if they work great on the data that we have at the time the classifier is learned, but poorly on data that comes in after that, our classifier is not very useful. For this we often divide our data into training, validation, and testing sets. Read here for more.

Section 2.3 considers the determination of a prediction function that accurately predicts outputs given inputs. We want this function to work well over the set of inputs that we will see in the real world, not just the training set. Therefore “one should choose the prediction function h by attempting to minimize a risk measure over an adequately selected family of prediction functions”. A family of functions can be described in many ways, for example as a particular functional form with parameters in it as in m x + b for parameters (m, b). Adequately selected means:

  • Able to achieve low empirical risk by choosing a rich family of functions or by using knowledge about the problem domain.
  • The gap between expected and empirical risk should be small, that is, they should not be biased towards or underfit the input data.
  • Chosen so the resulting optimization problem can be solved efficiently.

These considerations are at odds with one another as some point towards broader, more complicated families of functions and others simpler. With regards to the first consideration, increasing the number of training samples is helpful. So is choosing a function family with a high “capacity”, which can be loosely described as a function’s “complexity, expressive power, richness, or flexibility.

Having considered what makes a good prediction function, the authors next consider procedures for finding them. The approach considered in Section 2.3 is called structural risk minimization – here is a good overview. A nice visual representation is given in Figure 2.5, but the point is to avoid both underfitting and overfitting. Underfitting happens which happens when the observed empirical risk (the frequency of observed misclassification) is high. This happens when the prediction function is insufficiently expressive to link inputs to outputs, which can happen when the network structure doesn’t make sense or is too simplistic. Overfitting happens when increasing the number or complexity of the model parameters begins to increase the misclassification rate in real-world data. This can happen even as the misclassification rate on our training data decreases. In other words, the model no longer effectively generalizes to the real world – it is too highly tuned to the data at hand. All of this implies that picking functional families that give good empirical risk may be counterproductive. The remedy to this situation is to split input data into training, validation, and testing sets, as alluded to in the first post in this series.

In the next post in this series, we’ll cover Section 3, which describes the optimization methods used to train these models.