Data Science and NCAA Bracketology – Part 1

Monday, March 3, 2014 - 13:15

I want to thank Steve Miller for his assistance in the development of this blog.

I frequently travel to New York City for business. Whenever I'm in town on the first Wednesday of the month, I head to the Museum of Mathematics to attend the monthly Math Encounters lecture. MoMath is a wonderful new museum in lower Manhattan that brings math to life for everyone. In January, Davidson College professor Tim Chartier presented an application of linear algebra to NCAA college basketball ratings/bracketology. Given my passion for sports and analytics, I was hooked and began tinkering immediately.   

The Chartier Methodology

In Chartier's methodology, game and result matrices are used to “solve” for a rating matrix that is used to rank the teams. Since there are 351 Division I men’s basketball teams, I created a 351 x 351 game matrix containing every team combination. The diagonal is composed of elements referencing each team with itself. With Chartier’s approach, the matrix diagonal elements are populated using the formula:  2 + total games played. The non-diagonal elements are assigned as:  -1 * number of games played between the coordinate teams. The results matrix is 351x1 with one row per team. The elements in the results matrix are populated with the formula:  1 + .5(wins – losses). The games and results matrices are then used to solve for a 351 x 1 ratings matrix.

The methodology seems simple enough, so what’s the analytics magic? Well, Chartier suggests that not all games are equal – some have more meaning than others. A matrix can be personalized by applying unique weights to each game played. Perhaps road wins are weighted more than home wins. Or larger game point spreads are weighted higher. Or maybe overtime wins are more important. Or when the game was played... There are a lot of options to experiment with, so I started working...

Building The Dataset with Pentaho & R

First I had to find a data source that provided results for every game of the current season. After a few Internet searches, I found www.masseyratings.com. I then needed to scrape this site, do a little data wrangling to produce game and result matrices and then solve for the ratings matrix. I chose to use the Pentaho Data Integration (PDI) tool with its handy built in functions for scraping and parsing HTML. I could have used Python and the BeautifulSoup package, but I'm biased toward tools and was frankly more comfortable with the PDI environment. Though PDI enabled me to easily wrangle game data and to apply custom game weights, it’s not the best technology for performing linear algebra and analytics. Enter R...  

I spoke with colleagues at Pentaho and they mentioned they'd begun developing a new RScript step for PDI. Using this, I could stream wrangled game data into an R data frame and then use R  functions to generate matrices and perform the linear algebra. I installed the beta software and, after working through a few minor glitches, had it working like a champ. I was able to develop a single PDI transformation that scraped game data, applied game weights, and used R to perform the linear algebra needed to produce my very own power ratings. And, it was 100% automated! (By the way, I’ll keep my weighting algorithm to myself as I have some bracket pools to win!)

The Season is a Graph

This isn't the end of the story. My bracketology journey continued in February at Strata in Santa Clara. As I listened to a lecture about the features of the new GraphX module of the Berkeley Data Analytics Stack (BDAS), I wondered how I could apply graph analytics to domains other than the web, social media, and system networks. That’s when my “sports analytics” mindset kicked in:  It hit me that the entire NCAA season was a massive graph. Imagine the nodes on the graph being the 351 Division I teams and the edges being the games played. The graph could be made directed by having the loser point to the winner. As with the Linear Algebra approach, each game (i.e. graph edge) can be weighted using game statistics. But, how could I turn this graph into a rating system? As the GraphX lecture progressed, I realized that the ubiquitous PageRank Algorithm could produce the ratings.

Page Rank was invented by Larry Page at Google to measure the importance of website pages in order to improve Google’s search engine results. In general, the more a page is reference by other pages, the higher its rank. In addition, if the referencing pages are highly ranked, then the weighting is compounded. The PR algorithm is recursive, but converges. A quick search and I found that I would not have to write my own version of pagerank since the R igraph package provides a page.rank function out of the box.

Snapping igraph page.rank  into my existing PDI-R framework was easy. I simply created a new RScript step that converted the same stream of weighted game results I used for the Linear Algebra approach into inputs for the page.rank function. Within minutes, I had a new rating system – again 100% automated.

What's Next?

In Part 2 of this series, I’ll compare the Linear Algebra and Page Rank ratings with established industry rating systems including RPI, BPI, Ken Pomeroy and Jeff Sagarin indexes. And I'll make heavy use of R’s statistical analysis and plotting libraries to show correlations between the different approaches and determine whether there are observable biases. After the NCAA tournament is over, I’ll blog in Part 3 about the bracket picking performance of each rating system – revealing the winners and losers!

In case you're wondering, as of Monday March 3, here are the top 10 teams for generic and weighted Linear Algebra and Page Rank approaches as well as the most recent AP poll. Notice that undefeated Wichita State (31-0!) gets more respect from the human observers at AP than from the systematic ranking systems. Florida is barely noticed by the Page Rank approach yet is at or near the top in Linear Algebra and AP Poll. In the next blog, I’ll endeavor to explain these anomalies.

Rank

Generic  Linear Algebra

Weighted Linear Algebra

Generic  Page Rank

Weighted Page Rank

AP Poll

1

Arizona

Arizona

Wisconsin

Wisconsin

Florida

2

Florida

Kansas

Arizona

Arizona

Wichita St

3

Villanova

San Diego St

Kansas

Kansas

Arizona

4

Wichita St

Florida

Arizona St

Syracuse

Duke

5

Kansas

Iowa St

Syracuse

Creighton

Virginia

6

Syracuse

UCLA

Creighton

Connecticut

Villanova

7

Wisconsin

New Mexico

Duke

Duke

Syracuse

8

Creighton

Wisconsin

Michigan

Arizona St

Kansas

9

Iowa St

Oregon

California

Virginia

Wisconsin

10

Virginia

Colorado

Connecticut

Florida

San Diego St

 

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.