# NCAA tournament prediction using SAS

I don’t watch as much basketball as I used to, so the past few years I have been writing programs to do my picks. Two years ago I did very well, and last year was a complete disaster. This year since I am in a new role at a new company, I chose a new language and a different approach.

I gave myself two hours to write SAS code to generate picks. This was something of a challenge because I am an inexperienced SAS programmer! As in past years I am using simple season-level statistics for all of the teams in the tournament. I obtained the following data:

• Win-loss records for each team.
• RPI rating.
• Strength of schedule.
• Points scored per game.
• Points allowed per game.
• Number of games.

This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com. (Here is the link.)

For each round, I compute a score for each team. The team with the higher score advances. Here’s how I arrive at a score:

• Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.
• Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.
• Add this to the RPI. I call this the “base score”. Everything up until this point is similar to my entry from two years ago.
• When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100.  We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.

Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853

Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out. Final Four is Kentucky, Michigan State, Syracuse, North Carolina with Kentucky beating Syracuse in the final. A couple of other observations:

• Most of the code is data preparation. The raw data files referenced in the code are screen scraped from espn.com and ncaa.com. You can download them below – rename the extensions to csv to use. I did a little bit of massaging of the data files, but I tried to do most of that in SAS.
• I intentionally tried to avoid macro language. I didn’t worry about trying to factor the code nicely (as you can see by the repeated “team canonicalization” code).
• I found the code to read in the more unstructured data files a bit tedious to write. I could have coded it much faster in C#.
• I probably should have left out the bit about RPI. That would have made the picks more interesting. RPI is already a prediction model so we essentially have a “model of a model”.
• I wanted to add a bit more code at the end that parses bracket.csv (which is a list of all the teams with opponents grouped next to each other as in the bracket) and computes the winners for each round. This is not that hard to do (I don’t think), but I ran out of time. A SAS person could whip this up easily, but as a C/C++/C# guy I still tend to think about this procedurally (and would want to just write a few for-loops).
• PROC MEANS is awesome.
• In order to continue to make this interesting I will need to consider per-game data rather than per-season data. Unfortunately that takes more time and I generally don’t have a lot of that to spare.

Here’s the code:

%macro DoPicks;

data espnpicksraw;

array r{6} \$ 50;

infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=‘,’;

input r{1} r{2} r{3} r{4} r{5} r{6};

run;

data espnpicks;

length Pct 8 Round 3;

set espnpicksraw;

array r{6} r1-r6;

do Round = 1 to 6;

* example: “1 Kentucky – 99.4%”  *;

*Seed = scan(r{i}, 1, ‘ ‘);

Pct = compress(scan(r{Round}, –1‘ ‘), ‘%’);

first = find(r{Round}, ‘ ‘1);

last = find(r{Round}, ‘-‘, -length(r{Round}));

Team = substr(r{Round}, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

output;

end;

keep Team Round Pct;

run;

proc sort data=espnpicks;

by Team Round;

run;

proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;

BY Team;

VAR Pct;

ID Round;

run;

%local r ;

%do r = 1 %to 6 ;

proc sort data=espnscores;

by descending Round&r.;

run;

data espnscores;

set espnscores;

Rank&r.=_N_;

run;

%end;

data espnscores;

retain Team Round1-Round6 Rank1-Rank6;

set espnscores;

run;

proc sort data=espnscores;

by Team;

run;

data rpi;

infile ‘c:\temp\ncaa\rpi.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Wins 4 Losses 4 RPI 8;

length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference \$ 25 Schedule \$25;

input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team Wins Losses RPI;

run;

data points_off;

infile ‘c:\temp\ncaa\points_off.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsOff 8 PPGOff 8;

input Rank Team Games WinsLosses PointsOff PPGOff;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team Games PointsOff PPGOff;

run;

data points_def;

infile ‘c:\temp\ncaa\points_def.csv’ dlm=‘,’ dsd;

length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsDef 8 PPGDef 8;

input Rank Team Games WinsLosses PointsDef PPGDef;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

keep Team PointsDef PPGDef;

run;

data bracket;

infile ‘c:\temp\ncaa\bracket.csv’;

input Line & \$50.;

first = find(Line, ‘ ‘1);

last = find(Line, ‘(‘, first);

Team = substr(Line, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”“ST”), “STATE”“ST”)), ‘.’);

ID = _N_;

keep ID Team;

run;

*****;

proc sort data=rpi;

by Team;

run;

proc sort data=points_off;

by Team;

run;

proc sort data=points_def;

by Team;

run;

proc sort data=bracket;

by team;

run;

* Get data for the 68 tournament teams. *;

data team_stats;

merge bracket rpi points_off points_def espnscores (in=in_espn);

by Team;

if in_espn;

if ID;

run;

* Find the average number of points per game over all teams. *;

proc means noprint data=team_stats n mean max min range std sum;

var PointsOff Games;

output out=ncaa_points(keep=_Points _Games)

sum(PointsOff Games)= _Points _Games;

run;

* The average is in _PPG. I use the value 72.8 below. *;

data ncaa_points;

set ncaa_points;

_PPG = _Points/_Games;

run;

data team_stats;

set team_stats;

array s{6} s1-s6;

array r{6} Rank1-Rank6;

PointsTotal = PointsOff + PointsDef;

* Scale points for and against based on NCAA average. *;

AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;

AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;

* The base score. *;

Score = Pythag + RPI;

do Round = 1 to 6;

* If this team was very popular on ESPN.com, jack up the score. *;

if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;

else s{Round} = Score;

end;

run;

proc sort data=team_stats;

by id;

run;

%mend;

%DoPicks;

I don’t watch as much basketball as I used to, so the past few years I have been writing programs to do my picks. Two years ago I did very well, and last year was a complete disaster. This year since I am in a new role at a new company, I chose a new language and a slightly different approach.
I gave myself two hours to write SAS code to generate picks. This was something of a challenge because I am an inexperienced SAS programmer! As in past years I am using simple season-level statistics for all of the teams in the tournament. I obtained the following data:
·         Win-loss records for each team.
·         RPI rating.
·         Strength of schedule.
·         Points scored per game.
·         Points allowed per game.
·         Number of games.
This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com.
Here’s how I arrive at a score:
·         Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.
·         Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.
·         Add this to the RPI. I call this the “base score”.
·         When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100.  We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.
Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853
Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out.
%macro DoPicks;
data espnpicksraw;
array r{6} \$ 50;
infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=’,’;
input r{1} r{2} r{3} r{4} r{5} r{6};
run;
data espnpicks;
length Pct 8 Round 3;
set espnpicksraw;
array r{6} r1-r6;
do Round = 1 to 6;
* example: “1 Kentucky – 99.4%”  *;
*Seed = scan(r{i}, 1, ‘ ‘);
Pct = compress(scan(r{Round}, -1, ‘ ‘), ‘%’);
first = find(r{Round}, ‘ ‘, 1);
last = find(r{Round}, ‘-‘, -length(r{Round}));
Team = substr(r{Round}, first, last – first);
Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
output;
end;
keep Team Round Pct;
run;
proc sort data=espnpicks;
by Team Round;
run;
proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;
BY Team;
VAR Pct;
ID Round;
run;
%local r ;
%do r = 1 %to 6 ;
proc sort data=espnscores;
by descending Round&r.;
run;
data espnscores;
set espnscores;
Rank&r.=_N_;
run;
%end;
data espnscores;
retain Team Round1-Round6 Rank1-Rank6;
set espnscores;
run;
proc sort data=espnscores;
by Team;
run;
data rpi;
infile ‘c:\temp\ncaa\rpi.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Wins 4 Losses 4 RPI 8;
length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference \$ 25 Schedule \$25;
input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team Wins Losses RPI;
run;
data points_off;
infile ‘c:\temp\ncaa\points_off.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsOff 8 PPGOff 8;
input Rank Team Games WinsLosses PointsOff PPGOff;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team Games PointsOff PPGOff;
run;
data points_def;
infile ‘c:\temp\ncaa\points_def.csv’ dlm=’,’ dsd;
length Rank 3 Team \$ 30 Games 4 WinsLosses \$12 PointsDef 8 PPGDef 8;
input Rank Team Games WinsLosses PointsDef PPGDef;
Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
keep Team PointsDef PPGDef;
run;
data bracket;
infile ‘c:\temp\ncaa\bracket.csv’;
input Line & \$50.;
first = find(Line, ‘ ‘, 1);
last = find(Line, ‘(‘, first);
Team = substr(Line, first, last – first);
Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);
ID = _N_;
keep ID Team;
run;
*****;
proc sort data=rpi;
by Team;
run;
proc sort data=points_off;
by Team;
run;
proc sort data=points_def;
by Team;
run;
proc sort data=bracket;
by team;
run;
* Get data for the 68 tournament teams. *;
data team_stats;
merge bracket rpi points_off points_def espnscores (in=in_espn);
by Team;
if in_espn;
if ID;
run;
* Find the average number of points per game over all teams. *;
proc means noprint data=team_stats n mean max min range std sum;
var PointsOff Games;
output out=ncaa_points(keep=_Points _Games)
sum(PointsOff Games)= _Points _Games;
run;
* The average is in _PPG. I use the value 72.8 below. *;
data ncaa_points;
set ncaa_points;
_PPG = _Points/_Games;
run;
data team_stats;
set team_stats;
array s{6} s1-s6;
array r{6} Rank1-Rank6;
PointsTotal = PointsOff + PointsDef;
* Scale points for and against based on NCAA average. *;
AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;
AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;
* The base score. *;
Score = Pythag + RPI;
do Round = 1 to 6;
* If this team was very popular on ESPN.com, jack up the score. *;
if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;
else s{Round} = Score;
end;
run;
proc sort data=team_stats;
by id;
run;
%mend;
%DoPicks;