A simple prediction algorithm for the NCAA tournament

A friend of mine asked me if I wanted to participate in an NCAA tournament pool. The twist: you have to write a program to predict the results. Here are the rules I was given:

The algorithm for your [Java] code goes below. As you can see in the method header, you are passed two objects: Team A and Team B. Spend some time thinking about this and writing your algorithm out, as once you submit your code you won’t get any feedback until your bracket is sent to you. The team class is a class defined by us, here’s what it looks like:

public class Team {
    public String name;    
    public int seed, RPI_rank;    
    public double points_scored, points_against, wins, losses, RPI;        
    public Team(String n, int s, double rpi, int rpi_rank, int w, int l, 
      double ps, double pa) 
    {
      name = n; seed = s; RPI = rpi; RPI_rank = rpi_rank;
      wins = w; losses = l;     points_scored = ps; points_against = pa;
    }
   }

I need to fill this in:

public Team Game(Team A, Team B, int round) {
    // Fill me in
  }

I gave myself one hour to come up with something. For better or worse, here’s what I did. I decided that I wouldn’t "cheat", i.e. use any information other than what I have been given. Otherwise a reasonable approach would be to have a giant switch statement based on name, and look up the Sagarin ratings for each team! I also noticed right away that I don’t have any information about past games between teams. So it seems clear that I need to have a "scoring" based approach, where I compute a metric for team A and Team B and return the team with the higher score.

My first idea was to try and come up with a metric based on points_scored and points_against. I remembered from Mathletics that there is a "pythagorean expectation" formula for predicting win percentage. I quickly learned that there is a variant for basketball. I have a big problem: I don’t have the "defensive and offensive efficiencies", I only have average points for and against. A simple hack is to scale these values by the average number of points scored per game this year, which appears to be 137.3137725. (I downloaded stats from ncaa.org as CSV and threw them in an Excel spreadsheet). Once you have the "normalized" points per game on offense and defense, you can apply the formula on the wikipedia page with 11.5 as the exponent. Here are the first few records:

Name      OPP PTS    OPP PPG    PPG    TotalPPG    NormPPG    NormOpp    Pythag
Kansas    2169    63.8    81.8    145.6    77.1446    60.16908    0.945732
Murr. St. 2056    60.5    77.5    138    77.11461    60.19915    0.945204
BYU       2216    65.2    83    148.2    76.90312    60.41064    0.941358
Duke      2100    61.8    78    139.8    76.61283    60.70093    0.935671
Cst. Car. 2038    59.9    74.6    134.5    76.16065    61.15312    0.925796
Utah St.  2028    59.6    73.7    133.3    75.91916    61.39460    0.919973
Syracuse  2140    66.9    81.5    148.4    75.41153    61.90223    0.906370
Kentucky  2219    65.3    79.2    144.5    75.26125    62.05252    0.901971

You can see the problem: this metric doesn’t account for quality of opposition. Teams that beat up on bad teams will be unjustly rewarded. Murray State has had a good year, but they are not the second best team in the country! So I decided to weight this factor equally with RPI. RPI attempts to take strength of schedule into account, and is one of the factors the tournament selection committee takes into account. Let’s look at this same list of teams once I incorporate the RPI:

Name      OPP PPG PTS    PPG    TotalPPG    NormPPG    NormOpp    Pythag    RPI    Score
Kansas    63.8    2780    81.8    145.6    77.14468    60.16908    0.945732    0.688    1.633732
Duke      61.8    2653    78    139.8    76.61283    60.70093    0.935671    0.664    1.599671
Kentucky  65.3    2694    79.2    144.5    75.26125    62.05252    0.901971    0.666    1.567971
Syracuse  66.9    2607    81.5    148.4    75.41153    61.90223    0.906374    0.651    1.557374
BYU       65.2    2821    83    148.2    76.90312    60.41064    0.941358    0.61    1.551358
Utah St.  59.6    2506    73.7    133.3    75.91916    61.39460    0.919973    0.602    1.521973
Murr. St. 60.5    2635    77.5    138    77.11461    60.19915    0.945204    0.575    1.520204
Cst. Car. 59.9    2535    74.6    134.5    76.16065    61.15312    0.925796    0.519    1.444796

That’s not totally crazy! My hour was pretty much up at that point, so I went with it. Here’s the code (in C#):

public static Team Game(Team A, Team B, int round) {
      // Fake the 'pythagorean calculation' and weight it equally with RPI.
      double avg_points_total = 137.31377245509;

      double a_points_total = A.points_scored + A.points_against;
      double a_adj_points_scored = (avg_points_total / a_points_total) * A.points_scored;
      double a_adj_points_against = (avg_points_total / a_points_total) * A.points_against;
      double a_pythag = Math.Pow(a_adj_points_scored, 11.5) / (Math.Pow(a_adj_points_scored, 11.5) + Math.Pow(a_adj_points_against, 11.5));
      double a_score = a_pythag + A.RPI;

      double b_points_total = B.points_scored + B.points_against;
      double b_adj_points_scored = (avg_points_total / b_points_total) * B.points_scored;
      double b_adj_points_against = (avg_points_total / b_points_total) * B.points_against;
      double b_pythag = Math.Pow(b_adj_points_scored, 11.5) / (Math.Pow(b_adj_points_scored, 11.5) + Math.Pow(b_adj_points_against, 11.5));
      double b_score = b_pythag + B.RPI;

      return a_score >= b_score ? A : B;
    }

And here is the resulting bracket.  The Final Four is Kansas, Syracuse, Kentucky, and Duke.

NCAA picks

Not crazy, uses only the information I have been given, and more fun than just using the RPI directly! Mission accomplished. Here is a link to my picks on ESPN.com.

Nash equilibria and 4th down, continued

I left a small cliffhanger in my last post. After a long week I finally had a chance to read through the Adams paper about estimating the value of “going for it” on 4th down.  I admit I was a little bit let down.   As a reminder – the question is what action a football team should take on fourth down.  Failure to gain the necessary yards means the ball is turned over to the opposing side, kicking turns over the ball but with better field position, and making the first down allows the drive to continue, potentially leading to more points.  The conclusion of the Romer paper was that coaches are too conservative and kick the ball away in situations where they should go for it instead.

Adams hits the nail on the head by asserting that the results of the Romer paper just do not pass the “smell test”.  It’s nuts to suggest that it’s a good idea to go for it on 4th and 4 on your own 25 yard line.  But that leaves us only with more questions – is the conclusion of the Romer paper still valid, even if overstated?  Can we identify a flaw in the reasoning?  Is there a better way to model the problem?

Adams first suggestion for improving the model is to include more historical data.  Adams and Romer both claim it’s hard to come up with a good model for the “going for it” problem because teams seldom go for it on fourth down in practice – data is hard to come by.    Romer and Adams both use game data from the 1998 – 2000 seasons, but Adams uses data from the entire game, not just the first quarter.  But why not include more recent data?  [The Adams paper was written in ’06, so he could have doubled the data set.  We have a couple more seasons-worth of data now.]  So I’m not sure I even buy the premise that data is lacking.

Adams’ second approach is to use Madden ’07 to simulate 4th down situations.  I initially thought this was a really cool idea, and it kind of is, and then I remembered something I once read.  Madden himself asked the designers at EA to make 4th downs more difficult to convert!  You cannot find a better example of Galbraith’s notion of “conventional wisdom” in action.  So as far as I am concerned, you have to throw out the middle section of the paper.  Madden is not a simulation: it is pretending to be a simulation.  It wants to make you feel like you are experiencing real NFL football.  But the problem is that we as players do not make decisions the way that GMs, coaches, and players do.  Our motivations are completely different, and there are no real consequences for our actions (other than bragging rights over your roommate).  My GM will not fire me if I go for it on 4th and 5 on my own 25.  Thus the game must be tuned to correct for this, otherwise you will get Tecmo-like gameplay.

The last section proposes a game-theoretic approach.  Adams introduces a zero-sum game with the offense and defense as opponents.  The offense and defense both have the choice of choosing a pass- or run-oriented strategy.  The payoffs depend on their choices.  Adams points out that this is a “simplified version of reality.”  (It’s very close to the original Tecmo Bowl – two choices instead of four.)  He uses this approach primarily to make the point that it is not a good idea (as Romer proposes) to use third down data to model fourth down choices, because the payoffs change enough to matter.  It is an interesting line of argument for the claim that Romer’s conclusions are overstated, but it does not provide insight into how to better model the problem.

Anyway, in the course of poking around the web I came across the ZEUS Football simulation engine. It is frequently referenced in the NYTimes “5th down” blog. For example, here is an interesting discussion about taking an intentional safety late in the game.  (I won’t bother to explain what that means, because if you have made it this far, you clearly already know what I am talking about.)

Int’l Symposium on Mathematical Programming 2009, Day 2

Here are my notes from the second day of ISMP.  (Here are my day 1 notes.)  I spent the whole day going to IPM talks:

Meszaros: Unfortunately I missed the plenary talk by Friedrich Eisenbrand due to microbrews.  I attended an IPM session after that – first up was Csaba Meszaros – interesting to hear from him because BPMPD has been such an impressive code for so many years.  His talk was about recent improvements in BPMPD.  BPMPD has been in continuous development for 15 years or so.  His recent work appears to focus on a) QCQP b) exploiting “hypersparsity”.  He is using a primal-dual log barrier method  – not a primal-dual HSD method as we at Solver Foundation have chosen for our SOCP solver.  So his handling of starting points and infeasibility detection is different.  He went into the tradeoffs between using augmented and normal systems for the search direction, which affect both performance and numerical stability.  He has implemented both and has logic on top to determine which to use.  He also talked about QCQP presolve, which often whittles down the size of “real” models substantially.
 
An interesting bit of trivia: he showed a table that compared lines of code in various modules, “then and now”:

 	2009	1981
Ordering	5700	155
Cholesky	2825	100
Backsolve	820	45

I think the difference is indicative of both Csaba’s commitment to improving BPMPD as well as the number of practical issues you need to think about when building an industrial grade solver.
 
Bliek (IBM): Next up – IPM session chair Christian Bliek from IBM.  His talk was about barrier improvements in CPLEX.  For the single threaded case, CPLEX 12 is about 5% faster than 11 [note that these numbers are totally unofficial and any errors are my own].  With 4 threads it is about 20% better.  There was no magic bullet here, more tuning at the solver and algebra layer.  With a sophisticated solver such as CPLEX the improvements come through hard work, no tricks.  There was also some data on simplex vs. barrier.  For their complete set of their benchmark LPs, for problems that take 100 seconds or more barrier (without crossover) beats dual simplex on average.  If you allow crossover, then performance vs dual simplex is about even for small problems (~1s solution time) and after that crossover really takes over.  For problems that take 1000s and up, the geometric mean for the performance ratio against dual is 0.55.  So almost twice as fast.
 
For QP, barrier is way better than their active set approach.  Of course this does not take into account scenarios where warmstart could be used, where simplex rocks.  More on that later.  They also have a “concurrent” setting, meaning to race the simplex & IPM solvers.  This setting gives the best results.  Solver Foundation Services has also supported this type of “transparent parallelism” since version 1.0. The last bit of the talk was about some experiments using PARDISO instead of the solver’s own algebra stack.  Their conclusion was that using a “black box” linear algebra module shows potential.
 
Warmstart: The last talk in the session was by Anthony Vannelli, who talked about IPM warmstart.  He takes a very pragmatic approach and layed out a very sensible procedure for attacking the difficult problem of using the solution of an IPM problem to seed the solution of a related problem.  The net was that using a relatively simple warmstart procedure they were able to get (roughly) a factor of 3 speedup for their test set.  There were also some other IPM warmstart researchers in attendance – in particular Alper Yilidrim and Hande Benson. Alper had a great survey talk of existing IPM warmstart methods, and Hande is one of the top researchers in the area.  I am sure there have been interesting discussions this week!
 
Erling Andersen (Mosek):The afternoon IPM session had talks by Erling Andersen from Mosek, Jacek Gondzio, and myself.  Erling talked about SOCP improvements in MOSEK 6 (which will be out later this year).  He’s had SOCP support forever, but for version 6 he has refactored and reworked much of the implementation.  He has run into problems getting high accuracy solutions (I feel his pain), and he has had some problems with infeasible or nearly infeasible problems.  He had some interesting thoughts on infeasibility criteria, rooted in his experience with real problems.  He’s going to get about a 15% speedup, which is quite impressive since he is already starting with a very fast solver.  Erling has built a great set of solvers and I wish him well with Mosek 6.

Jacek Gondzio talked about the use of indirect methods (such as preconditioned conjugate gradient) in IPM.  This hasn’t worked out well in the past because such methods are not able to solve the systems to the accuracy required by IPM search directions.  And the problem is still not solved, but for his new approach he did show some impressive results for dense problems.
 
The title of my talk was Interior Point Methods in Solver Foundation.  I gave an overview of where IPM sits in the Solver Foundation stack, discussed our user and solver models.  (And why different user and solver models are necessary – the “user model” serves Solver Foundation Services as well as programmers, whereas the solver model is structured so as to attain the most numerically stable, high performance solver possible.  Separation of concerns!)  Then I gave a high-level overview of our solution approach, and talked about a couple of implementation issues including handling of free variables, equality constraints, etc.  Lastly I talked about some numerical stability issues. I got nice feedback from people afterwards and there was a lot of curiosity about who is using our stuff and what is next. 

 

Int’l Symposium on Mathematical Programming 2009, Day 1

I am attending the ISMP math programming conference in Chicago, representing the Solver Foundation team. Here are some of my impressions of Day 1 at ISMP:

Plenary [Steven Boyd]: The plenary speaker was Steve Boyd who talked about real-time embedded convex optimization.  He considered cases where problems need to be solved in milliseconds or less: control, signal processing, resource allocation, and even finance (think flash trading).   The approach is basically to extend “disciplined convex programming”.  DCP is a modeling system where problems are convex “by construction” because they combine operators with well-known properties; the system is then able to rewrite the problem in the standard form required by a convex programming solver (such as an IPM QP/SOCP solver). The new step is that the system can now generate highly optimized C code for a custom solver for the particular problem family described by the model.  (The algorithm itself is standard IPM.)  You can do all sorts of optimizations in this case: the problem structure is known so the symbolic ordering can be done in advance, you can ensure better memory locality, etc.  He gave many example where small QP instances were solved in tensof microseconds.  Fun stuff.

Morning Sessions: I attended a morning session concerning approaches for nonconvex optimization problems. Jon Lee talked about solving a particular class of parametric nonlinear problems, Kurt Anstreicher talked about *nonconvex* QP and associated bounds.  I then went to a straight-up theory session talking about convergence rates & asymptotic costs for IPM – there were a couple of new results and an interesting “corrector-predictor” (instead of the other way around) approach that attains better asymptotic convergence.  There is an embarrassment of riches at ISMP: I missed by John Hooker about integrating MIP, constraint & global optimization, a great MIP session including Bob Bixby from Gurobi, and all sorts of other stuff.  There are a staggering number of good talks to attend.

Afternoon Sessions [Modeling / Stochastic]: I went to a modeling languages track in the afternoon.  The first speaker was from LINDO Systems who talked about their new LINGO offering.  The primary new feature is to support stochastic modeling. There were some screenshots from What’s Best, their spreadsheet solver. They had a few nice samples as well.

Gautam Mitra from OptiRisk was up next – he talked about SAMPL, stochastic extensions for AMPL.  They have extended SAMPL to support:

  1. Chance constraints
  2. Integrated chance constraints
  3. Robust optimization

Chance constraints are interesting – they occur in a range of financial problems including VaR and CVaR calculation.  Gautam riffed on the difference between stochastic and robust optimization by invoking Rumsfeld:  there are known knowns [deterministic optimization], known unknowns [stochastic],  and unknown unknowns [robust optimization].  In robust optimization you may have problem coefficients that are in some range with an unknown distribution and you want the constraints to hold in all (or most) possible circumstances.  Robust optimization also has SOCP reformulations, and I am a big believer because I think RO can be cleanly expressed in modeling languages.  The last talk in the session was about YALMIP support for robust optimization – YALMIP has had widespread adoption by control theory specialists and has had a long history of improvements.  This talk was very code-oriented and fun to watch. 

A few overall impressions: I think it is fair to say that SOCP is entering the mainstream with LP/QP/MIP – it came up a lot today.  On the modeling side, it was interesting to see that stochastic was featured in all three talks in the session I attended.  A larger trend is that MINLP (mixed integer nonlinear programming) is hot.  There are tons of tracks on it, and it is [unsurprisingly] being taken on by both the MIP and NLP communities.  The problem description is very general so of course there are applications, but the dust has not settled on the solver side.