A simple prediction algorithm for the NCAA tournament

A friend of mine asked me if I wanted to participate in an NCAA tournament pool. The twist: you have to write a program to predict the results. Here are the rules I was given:

The algorithm for your [Java] code goes below. As you can see in the method header, you are passed two objects: Team A and Team B. Spend some time thinking about this and writing your algorithm out, as once you submit your code you won’t get any feedback until your bracket is sent to you. The team class is a class defined by us, here’s what it looks like:

public class Team {
    public String name;    
    public int seed, RPI_rank;    
    public double points_scored, points_against, wins, losses, RPI;        
    public Team(String n, int s, double rpi, int rpi_rank, int w, int l, 
      double ps, double pa) 
    {
      name = n; seed = s; RPI = rpi; RPI_rank = rpi_rank;
      wins = w; losses = l;     points_scored = ps; points_against = pa;
    }
   }

I need to fill this in:

public Team Game(Team A, Team B, int round) {
    // Fill me in
  }

I gave myself one hour to come up with something. For better or worse, here’s what I did. I decided that I wouldn’t "cheat", i.e. use any information other than what I have been given. Otherwise a reasonable approach would be to have a giant switch statement based on name, and look up the Sagarin ratings for each team! I also noticed right away that I don’t have any information about past games between teams. So it seems clear that I need to have a "scoring" based approach, where I compute a metric for team A and Team B and return the team with the higher score.

My first idea was to try and come up with a metric based on points_scored and points_against. I remembered from Mathletics that there is a "pythagorean expectation" formula for predicting win percentage. I quickly learned that there is a variant for basketball. I have a big problem: I don’t have the "defensive and offensive efficiencies", I only have average points for and against. A simple hack is to scale these values by the average number of points scored per game this year, which appears to be 137.3137725. (I downloaded stats from ncaa.org as CSV and threw them in an Excel spreadsheet). Once you have the "normalized" points per game on offense and defense, you can apply the formula on the wikipedia page with 11.5 as the exponent. Here are the first few records:

Name      OPP PTS    OPP PPG    PPG    TotalPPG    NormPPG    NormOpp    Pythag
Kansas    2169    63.8    81.8    145.6    77.1446    60.16908    0.945732
Murr. St. 2056    60.5    77.5    138    77.11461    60.19915    0.945204
BYU       2216    65.2    83    148.2    76.90312    60.41064    0.941358
Duke      2100    61.8    78    139.8    76.61283    60.70093    0.935671
Cst. Car. 2038    59.9    74.6    134.5    76.16065    61.15312    0.925796
Utah St.  2028    59.6    73.7    133.3    75.91916    61.39460    0.919973
Syracuse  2140    66.9    81.5    148.4    75.41153    61.90223    0.906370
Kentucky  2219    65.3    79.2    144.5    75.26125    62.05252    0.901971

You can see the problem: this metric doesn’t account for quality of opposition. Teams that beat up on bad teams will be unjustly rewarded. Murray State has had a good year, but they are not the second best team in the country! So I decided to weight this factor equally with RPI. RPI attempts to take strength of schedule into account, and is one of the factors the tournament selection committee takes into account. Let’s look at this same list of teams once I incorporate the RPI:

Name      OPP PPG PTS    PPG    TotalPPG    NormPPG    NormOpp    Pythag    RPI    Score
Kansas    63.8    2780    81.8    145.6    77.14468    60.16908    0.945732    0.688    1.633732
Duke      61.8    2653    78    139.8    76.61283    60.70093    0.935671    0.664    1.599671
Kentucky  65.3    2694    79.2    144.5    75.26125    62.05252    0.901971    0.666    1.567971
Syracuse  66.9    2607    81.5    148.4    75.41153    61.90223    0.906374    0.651    1.557374
BYU       65.2    2821    83    148.2    76.90312    60.41064    0.941358    0.61    1.551358
Utah St.  59.6    2506    73.7    133.3    75.91916    61.39460    0.919973    0.602    1.521973
Murr. St. 60.5    2635    77.5    138    77.11461    60.19915    0.945204    0.575    1.520204
Cst. Car. 59.9    2535    74.6    134.5    76.16065    61.15312    0.925796    0.519    1.444796

That’s not totally crazy! My hour was pretty much up at that point, so I went with it. Here’s the code (in C#):

public static Team Game(Team A, Team B, int round) {
      // Fake the 'pythagorean calculation' and weight it equally with RPI.
      double avg_points_total = 137.31377245509;

      double a_points_total = A.points_scored + A.points_against;
      double a_adj_points_scored = (avg_points_total / a_points_total) * A.points_scored;
      double a_adj_points_against = (avg_points_total / a_points_total) * A.points_against;
      double a_pythag = Math.Pow(a_adj_points_scored, 11.5) / (Math.Pow(a_adj_points_scored, 11.5) + Math.Pow(a_adj_points_against, 11.5));
      double a_score = a_pythag + A.RPI;

      double b_points_total = B.points_scored + B.points_against;
      double b_adj_points_scored = (avg_points_total / b_points_total) * B.points_scored;
      double b_adj_points_against = (avg_points_total / b_points_total) * B.points_against;
      double b_pythag = Math.Pow(b_adj_points_scored, 11.5) / (Math.Pow(b_adj_points_scored, 11.5) + Math.Pow(b_adj_points_against, 11.5));
      double b_score = b_pythag + B.RPI;

      return a_score >= b_score ? A : B;
    }

And here is the resulting bracket.  The Final Four is Kansas, Syracuse, Kentucky, and Duke.

NCAA picks

Not crazy, uses only the information I have been given, and more fun than just using the RPI directly! Mission accomplished. Here is a link to my picks on ESPN.com.