## NCAA Tournament Wrapup

Hail to the champions, the University of Kentucky. I only got one of the four Final Four picks correct, but luckily that turned out to be the champion. My SAS-generated picks turned out to be in the 71st percentile of all those who participated in the Yahoo NCAA challenge. Not bad.

Careful readers may have noticed that my code had a somewhat serious bug in it: the exponent in the “pythagorean” formula that I relied upon was supposed to be 11.5, but I had 1.5 in my code. Oh well!

## NCAA tournament prediction using SAS

I don’t watch as much basketball as I used to, so the past few years I have been writing programs to do my picks. Two years ago I did very well, and last year was a complete disaster. This year since I am in a new role at a new company, I chose a new language and a different approach.

I gave myself two hours to write SAS code to generate picks. This was something of a challenge because I am an inexperienced SAS programmer! As in past years I am using simple season-level statistics for all of the teams in the tournament. I obtained the following data:

• Win-loss records for each team.
• RPI rating.
• Strength of schedule.
• Points scored per game.
• Points allowed per game.
• Number of games.

This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com. (Here is the link.)

For each round, I compute a score for each team. The team with the higher score advances. Here’s how I arrive at a score:

• Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.
• Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.
• Add this to the RPI. I call this the “base score”. Everything up until this point is similar to my entry from two years ago.
• When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100.  We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.

Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853

Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out. Final Four is Kentucky, Michigan State, Syracuse, North Carolina with Kentucky beating Syracuse in the final. A couple of other observations:

• Most of the code is data preparation. The raw data files referenced in the code are screen scraped from espn.com and ncaa.com. You can download them below – rename the extensions to csv to use. I did a little bit of massaging of the data files, but I tried to do most of that in SAS.
• I intentionally tried to avoid macro language. I didn’t worry about trying to factor the code nicely (as you can see by the repeated “team canonicalization” code).
• I found the code to read in the more unstructured data files a bit tedious to write. I could have coded it much faster in C#.
• I probably should have left out the bit about RPI. That would have made the picks more interesting. RPI is already a prediction model so we essentially have a “model of a model”.
• I wanted to add a bit more code at the end that parses bracket.csv (which is a list of all the teams with opponents grouped next to each other as in the bracket) and computes the winners for each round. This is not that hard to do (I don’t think), but I ran out of time. A SAS person could whip this up easily, but as a C/C++/C# guy I still tend to think about this procedurally (and would want to just write a few for-loops).
• PROC MEANS is awesome.
• In order to continue to make this interesting I will need to consider per-game data rather than per-season data. Unfortunately that takes more time and I generally don’t have a lot of that to spare.

Here’s the code:

%macro DoPicks;

data espnpicksraw;

array r{6} \$ 50;

infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=‘,’;

input r{1} r{2} r{3} r{4} r{5} r{6};

run;

data espnpicks;

length Pct 8 Round 3;

set espnpicksraw;

array r{6} r1-r6;

do Round = 1 to 6;

* example: “1 Kentucky – 99.4%”  *;

*Seed = scan(r{i}, 1, ‘ ‘);

Pct = compress(scan(r{Round}, –1‘ ‘), ‘%’);

first = find(r{Round}, ‘ ‘1);

last = find(r{Round}, ‘-‘, -length(r{Round}));

Team = substr(r{Round}, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

output;

end;

keep Team Round Pct;

run;

proc sort data=espnpicks;

by Team Round;

run;

proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;

BY Team;

VAR Pct;

ID Round;

run;

%local r ;

%do r = 1 %to 6 ;

proc sort data=espnscores;

by descending Round&r.;

run;

data espnscores;

set espnscores;

Rank&r.=_N_;

run;

%end;

data espnscores;

retain Team Round1-Round6 Rank1-Rank6;

set espnscores;

run;

proc sort data=espnscores;

by Team;

run;

data rpi;

infile ‘c:\temp\ncaa\rpi.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Wins 4 Losses 4 RPI 8;

length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference \$ 25 Schedule \$25;

input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team Wins Losses RPI;

run;

data points_off;

infile ‘c:\temp\ncaa\points_off.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsOff 8 PPGOff 8;

input Rank Team Games WinsLosses PointsOff PPGOff;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team Games PointsOff PPGOff;

run;

data points_def;

infile ‘c:\temp\ncaa\points_def.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsDef 8 PPGDef 8;

input Rank Team Games WinsLosses PointsDef PPGDef;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team PointsDef PPGDef;

run;

data bracket;

infile ‘c:\temp\ncaa\bracket.csv’;

input Line & \$50.;

first = find(Line, ‘ ‘1);

last = find(Line, ‘(‘, first);

Team = substr(Line, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

ID = _N_;

keep ID Team;

run;

*****;

proc sort data=rpi;

by Team;

run;

proc sort data=points_off;

by Team;

run;

proc sort data=points_def;

by Team;

run;

proc sort data=bracket;

by team;

run;

* Get data for the 68 tournament teams. *;

data team_stats;

merge bracket rpi points_off points_def espnscores (in=in_espn);

by Team;

if in_espn;

if ID;

run;

* Find the average number of points per game over all teams. *;

proc means noprint data=team_stats n mean max min range std sum;

var PointsOff Games;

output out=ncaa_points(keep=_Points _Games)

sum(PointsOff Games)= _Points _Games;

run;

* The average is in _PPG. I use the value 72.8 below. *;

data ncaa_points;

set ncaa_points;

_PPG = _Points/_Games;

run;

data team_stats;

set team_stats;

array s{6} s1-s6;

array r{6} Rank1-Rank6;

PointsTotal = PointsOff + PointsDef;

* Scale points for and against based on NCAA average. *;

AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;

AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;

* The base score. *;

Score = Pythag + RPI;

do Round = 1 to 6;

* If this team was very popular on ESPN.com, jack up the score. *;

if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;

else s{Round} = Score;

end;

run;

proc sort data=team_stats;

by id;

run;

%mend;

%DoPicks;

I don’t watch as much basketball as I used to, so the past few years I have been writing programs to do my picks. Two years ago I did very well, and last year was a complete disaster. This year since I am in a new role at a new company, I chose a new language and a slightly different approach.
I gave myself two hours to write SAS code to generate picks. This was something of a challenge because I am an inexperienced SAS programmer! As in past years I am using simple season-level statistics for all of the teams in the tournament. I obtained the following data:
·         Win-loss records for each team.
·         RPI rating.
·         Strength of schedule.
·         Points scored per game.
·         Points allowed per game.
·         Number of games.
This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com.
Here’s how I arrive at a score:
·         Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.
·         Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.
·         Add this to the RPI. I call this the “base score”.
·         When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100.  We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.
Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853
Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out.
%macro DoPicks;
data espnpicksraw;
array r{6} \$ 50;
infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=’,’;
input r{1} r{2} r{3} r{4} r{5} r{6};
run;
data espnpicks;
length Pct 8 Round 3;
set espnpicksraw;
array r{6} r1-r6;
do Round = 1 to 6;
* example: “1 Kentucky – 99.4%”  *;
*Seed = scan(r{i}, 1, ‘ ‘);
Pct = compress(scan(r{Round}, -1, ‘ ‘), ‘%’);
first = find(r{Round}, ‘ ‘, 1);
last = find(r{Round}, ‘-‘, -length(r{Round}));
Team = substr(r{Round}, first, last – first);
Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
output;
end;
keep Team Round Pct;
run;
proc sort data=espnpicks;
by Team Round;
run;
proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;
BY Team;
VAR Pct;
ID Round;
run;
%local r ;
%do r = 1 %to 6 ;
proc sort data=espnscores;
by descending Round&r.;
run;
data espnscores;
set espnscores;
Rank&r.=_N_;
run;
%end;
data espnscores;
retain Team Round1-Round6 Rank1-Rank6;
set espnscores;
run;
proc sort data=espnscores;
by Team;
run;
data rpi;
infile ‘c:\temp\ncaa\rpi.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Wins 4 Losses 4 RPI 8;
length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference \$ 25 Schedule \$25;
input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team Wins Losses RPI;
run;
data points_off;
infile ‘c:\temp\ncaa\points_off.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsOff 8 PPGOff 8;
input Rank Team Games WinsLosses PointsOff PPGOff;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team Games PointsOff PPGOff;
run;
data points_def;
infile ‘c:\temp\ncaa\points_def.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsDef 8 PPGDef 8;
input Rank Team Games WinsLosses PointsDef PPGDef;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team PointsDef PPGDef;
run;
data bracket;
infile ‘c:\temp\ncaa\bracket.csv’;
input Line & \$50.;
first = find(Line, ‘ ‘, 1);
last = find(Line, ‘(‘, first);
Team = substr(Line, first, last – first);
Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
ID = _N_;
keep ID Team;
run;
*****;
proc sort data=rpi;
by Team;
run;
proc sort data=points_off;
by Team;
run;
proc sort data=points_def;
by Team;
run;
proc sort data=bracket;
by team;
run;
* Get data for the 68 tournament teams. *;
data team_stats;
merge bracket rpi points_off points_def espnscores (in=in_espn);
by Team;
if in_espn;
if ID;
run;
* Find the average number of points per game over all teams. *;
proc means noprint data=team_stats n mean max min range std sum;
var PointsOff Games;
output out=ncaa_points(keep=_Points _Games)
sum(PointsOff Games)= _Points _Games;
run;
* The average is in _PPG. I use the value 72.8 below. *;
data ncaa_points;
set ncaa_points;
_PPG = _Points/_Games;
run;
data team_stats;
set team_stats;
array s{6} s1-s6;
array r{6} Rank1-Rank6;
PointsTotal = PointsOff + PointsDef;
* Scale points for and against based on NCAA average. *;
AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;
AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;
* The base score. *;
Score = Pythag + RPI;
do Round = 1 to 6;
* If this team was very popular on ESPN.com, jack up the score. *;
if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;
else s{Round} = Score;
end;
run;
proc sort data=team_stats;
by id;
run;
%mend;
%DoPicks;

## A simple prediction algorithm for the NCAA tournament

A friend of mine asked me if I wanted to participate in an NCAA tournament pool. The twist: you have to write a program to predict the results. Here are the rules I was given:

The algorithm for your [Java] code goes below. As you can see in the method header, you are passed two objects: Team A and Team B. Spend some time thinking about this and writing your algorithm out, as once you submit your code you won’t get any feedback until your bracket is sent to you. The team class is a class defined by us, here’s what it looks like:

```public class Team {
public String name;
public int seed, RPI_rank;
public double points_scored, points_against, wins, losses, RPI;
public Team(String n, int s, double rpi, int rpi_rank, int w, int l,
double ps, double pa)
{
name = n; seed = s; RPI = rpi; RPI_rank = rpi_rank;
wins = w; losses = l;     points_scored = ps; points_against = pa;
}
}```

I need to fill this in:

```public Team Game(Team A, Team B, int round) {
// Fill me in
}```

I gave myself one hour to come up with something. For better or worse, here’s what I did. I decided that I wouldn’t "cheat", i.e. use any information other than what I have been given. Otherwise a reasonable approach would be to have a giant switch statement based on name, and look up the Sagarin ratings for each team! I also noticed right away that I don’t have any information about past games between teams. So it seems clear that I need to have a "scoring" based approach, where I compute a metric for team A and Team B and return the team with the higher score.

My first idea was to try and come up with a metric based on points_scored and points_against. I remembered from Mathletics that there is a "pythagorean expectation" formula for predicting win percentage. I quickly learned that there is a variant for basketball. I have a big problem: I don’t have the "defensive and offensive efficiencies", I only have average points for and against. A simple hack is to scale these values by the average number of points scored per game this year, which appears to be 137.3137725. (I downloaded stats from ncaa.org as CSV and threw them in an Excel spreadsheet). Once you have the "normalized" points per game on offense and defense, you can apply the formula on the wikipedia page with 11.5 as the exponent. Here are the first few records:

```Name      OPP PTS    OPP PPG    PPG    TotalPPG    NormPPG    NormOpp    Pythag
Kansas    2169    63.8    81.8    145.6    77.1446    60.16908    0.945732
Murr. St. 2056    60.5    77.5    138    77.11461    60.19915    0.945204
BYU       2216    65.2    83    148.2    76.90312    60.41064    0.941358
Duke      2100    61.8    78    139.8    76.61283    60.70093    0.935671
Cst. Car. 2038    59.9    74.6    134.5    76.16065    61.15312    0.925796
Utah St.  2028    59.6    73.7    133.3    75.91916    61.39460    0.919973
Syracuse  2140    66.9    81.5    148.4    75.41153    61.90223    0.906370
Kentucky  2219    65.3    79.2    144.5    75.26125    62.05252    0.901971```

You can see the problem: this metric doesn’t account for quality of opposition. Teams that beat up on bad teams will be unjustly rewarded. Murray State has had a good year, but they are not the second best team in the country! So I decided to weight this factor equally with RPI. RPI attempts to take strength of schedule into account, and is one of the factors the tournament selection committee takes into account. Let’s look at this same list of teams once I incorporate the RPI:

```Name      OPP PPG PTS    PPG    TotalPPG    NormPPG    NormOpp    Pythag    RPI    Score
Kansas    63.8    2780    81.8    145.6    77.14468    60.16908    0.945732    0.688    1.633732
Duke      61.8    2653    78    139.8    76.61283    60.70093    0.935671    0.664    1.599671
Kentucky  65.3    2694    79.2    144.5    75.26125    62.05252    0.901971    0.666    1.567971
Syracuse  66.9    2607    81.5    148.4    75.41153    61.90223    0.906374    0.651    1.557374
BYU       65.2    2821    83    148.2    76.90312    60.41064    0.941358    0.61    1.551358
Utah St.  59.6    2506    73.7    133.3    75.91916    61.39460    0.919973    0.602    1.521973
Murr. St. 60.5    2635    77.5    138    77.11461    60.19915    0.945204    0.575    1.520204
Cst. Car. 59.9    2535    74.6    134.5    76.16065    61.15312    0.925796    0.519    1.444796```

That’s not totally crazy! My hour was pretty much up at that point, so I went with it. Here’s the code (in C#):

```public static Team Game(Team A, Team B, int round) {
// Fake the 'pythagorean calculation' and weight it equally with RPI.
double avg_points_total = 137.31377245509;

double a_points_total = A.points_scored + A.points_against;
double a_adj_points_scored = (avg_points_total / a_points_total) * A.points_scored;
double a_adj_points_against = (avg_points_total / a_points_total) * A.points_against;
double a_score = a_pythag + A.RPI;

double b_points_total = B.points_scored + B.points_against;
double b_adj_points_scored = (avg_points_total / b_points_total) * B.points_scored;
double b_adj_points_against = (avg_points_total / b_points_total) * B.points_against;