Data Science and NCAA Bracketology – Part 2

Monday, March 17, 2014 - 12:45

The brackets are set. Let the madness begin!

In Part 1 of this series, I introduced two methods for ranking NCAA teams: a Linear Algebra approach based on a lecture given by Tim Chartier at the Museum of Mathematic,s and a PageRank approach which I applied by reframing the entire NCAA season as a directed, weighted graph. In this blog, we'll compare these ratings approaches with the established industry rating systems of RPI , BPI , Ken Pomeroy and Jeff Sagarin . We'll make use of R's statistical analysis and plotting libraries to understand the strength of correlation between these systems, taking note of any biases we observe. We'll conclude with some fun bracketology discussion including analysis of the field of 68 and predictions for the Final Four and Champion.

Before getting started, I need to clarify that the Linear Algebra method presented by Dr. Chartier is actually based on research conducted by Dr. Wesley N. Colley , the Colley Matrix originally explained in this white paper . Colley's initial use case was to produce an unbiased ranking for college football's BCS system.

Industry Indexes

The Ratings Percentage Index (RPI) is the primary measure used by the NCAA tournament selection committee. It produces a rating based on the team's winning percentage and strength of schedule. Specifically, the RPI formula considers a team's winning percentage (WP), the winning percentage of the team's opponents (OWP) and the winning percentage of the team's opponents' opponents (OOWP). The formula is given as:

RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)

There are several nuances to computing WP, OWP and OOWP including game weights for home versus road victories and removing redundant information. (e.g. the OWP must be computed by removing games played against the team being rated)

RPI is biased toward who a team plays and, other than the game result (win/loss), does not factor how well a team plays or when the game was played. In the aggregate, it's simply a summation of the season -- giving similar weight to an overtime win in November's Maui Classic and blowout win in the heat of March's conference tournaments.

Perhaps there is more to learn about a team's performance and its likelihood to succeed in the NCAA tournament than what is encoded in the RPI formula? To that end, numerous other ratings systems have emerged, including Massey, Sagarin, Ken Pomeroy (KenPom) and ESPN's Basketball Power Index (BPI). While RPI only weights games based on location (home vs. road), these indexes (to varying degrees) also consider margin of victory, pace of the game and opponents' strength of schedule (not measured simply as winning percentage). BPI in addition considers whether a team was missing key players. ESPN has produced a nice summary comparison of these systems.

Custom Index Redux

In Part 1, we explained how to produce indexes based on Colley's Linear Algebra (LA) method and the use of the ubiquitous PageRank algorithm (PR). We generated two variants for each approach: generic (LAGeneric, PRGeneric) and weighted (LAWeighted, PRWeighted).

Similar to RPI, the LAGeneric approach does not reflect the quality of each win (how the game was played) but does consider the strength of the opponent through its linear equations. However, unlike RPI and the other indexes, LAGeneric doesn't consider the location of the game (home vs road). Indeed, LAGeneric rankings are simply the unbiased Colley Rankings.

In our LAWeighted rankings, we introduced game weights for the following factors: location (home, road, neutral), margin of victory (with diminishing value for blowouts), game pace (a ten point win is worth more if the score was 50-40 than if it is 100-90), and game recency (March games matter much more than November games). LAWeighted rankings ought to be more correlated to KenPom and Sagarin than RPI.

The Page Rank approach considers the season as a graph with each team being a node and each game an edge. The graph is directed by having the loser point to the winner (much like a page would reference another page). The PR recursive computation accumulates a node's rating based on the rating strength of the nodes that point to it. Applied to our NCAA example, the PR algorithm gives a team more credit for beating stronger teams. You'd expect to see PR's top-ranked teams with more wins against RPI Top 50 opponents  and a likely bias to major conferences. Conversely, losing to a poorly- rated team doesn't hurt or help a team's rating any more than losing to a highly-rated team. PR clearly values quality wins much more heavily than bad losses.

Like LAGeneric, the PRGeneric approach doesn't weight games. The PRWeighted approach applies similar weights as LAWeighted. Given the nature of the PR algorithm, it's hard to reason how well it will correlate with the other indexes.

Index Comparison

The relationship among the various indexes is best shown with a scatter plot matrix. With the splom, each index is represented in both the rows and columns of the graphical matrix. The intersection of a particular row and column results in an xyplot detailing the scatter relationship between the chosen indexes. A pure splom is symmetric  the upper and lower halves identical. In addition, the diagonal of a splom relates indexes to themselves, which of course results in straight lines. Since no additional information is generated from the two halves of a splom, only one is usually given. And the relationships of indexes to themselves along the diagonal are generally replaced by other statistical representations.

Consider the splom represented in Figure 1 using the lattice package of R. There are nine indexes in the graph, the eight discussed so far and a ninth, Ensemble, derived by a statistical procedure on the original eight known as principal components. The PC computations distill the commonality of the eight indexes into the Ensemble.

Figure 1 (click to enlarge)

 

The matrix details the scatter relationships among the nine indexes. In addition, a smoothing curve for each off-diagonal cell further elucidates the relationships. Finally, the correlation coefficients on a scale of -1 (strongly negative) to 1 (strongly positive) are noted. That all correlations are close to 1 is further corroboration of the strong positive relationships among the indexes. BPI is very highly correlated to the Ensemble and could probably be a proxy.

The diagonal of the matrix are density plots of the individual indexes. Think of these as frequency histograms with the bin size approaching zero and the area underneath as 1. Each density has somewhat of a bell shape, though PRGeneric and PRWeighted had to be log-transformed to reduce skewness.

All of the index combinations produce extremely strong correlations. With the weakest correlation at .92, it's pretty clear that any differentiation between the indexes enters splitting hairs territory statistically. The now-mature BPI, Sagarin and KenPom indexes inter-correlate at or above .99. RPI correlates to the above set of indexes at .96. We had expected LAGeneric to be most closely correlated to RPI and indeed it is. We reasoned that LAWeighted would correlate better with BPI/KenPom/Sagarin than LAGeneric. Instead it shows the opposite. Likely, my weighting algorithm is dissimilar from those used by BKS. As expected the PR rankings show the weakest correlation to all the other indexes. If you wanted to be a contrarian in your bracket pool, you might use the PageRank ratings.

Bracketology

Each year, 68 teams are divided into 4 regions. Each region has 16 seeds with redundant seeds given to 2 teams that compete in a play-in game which the NCAA politely calls Round 1. Teams are selected and seeded by a selection committee that considers a team's season-long body of work, recent success and geography. The committee uses statistics to inform their decisions including the RPI and Strength of Schedule (SOS). The committee does not use the other rating systems discussed in this blog. Of the 68 teams selected, 32 are given automatic bids by winning their conference's post-season tournament  the lone exception being the Ivy League which still determines its bid based on regular season conference title. The remaining 36 teams are at-large selections made at the discretion of the committee.

The committee's work is scrutinized by the experts to determine if they chose the right at-large teams, properly seeded the selected teams and rewarded the best conferences with more at-large bids. Let's use the indexes to assess these questions.

Were the best 36 at-large teams selected?

Each year the usual debate ensues about the last few teams in and out of the tournament. The last four at-large teams for 2014 are Iowa, Tennessee, NC State and Xavier. Were these the right ones? Were there any snubs? The following tables may shed some light. The IN column represents the top 4 excluded (snubbed) teams that each Index suggests should have been in the tournament. The OUT column contains the teams to be replaced.

RPI

BPI

KenPom Sagarin

IN

OUT

IN

OUT

IN

OUT

IN

OUT

South. Miss

Iowa

SMU

NC State

SMU

NC State

SMU

NC State

Toledo

NC State

Utah

Nebraska

La. Tech

Colorado

LA. Tech

Dayton

Missouri

Xavier

Maryland

Colorado

Utah

Dayton

Arkansas

Colorado

Minnesota

Kansas St

Arkansas

Kansas St

St John’s

UMass

Utah

Nebraska

 

LAWeighted

LAGeneric

PRWeighted PRGeneric

IN

OUT

IN

OUT

IN

OUT

IN

OUT

Toledo

Xavier

Missouri

Nebraska

La. Tech

Tennessee

California

Tennessee

South Miss

NC State

Toledo

Xavier

South Miss.

Iowa

SMU

BYU

WI Green Bay

Nebraska

South Miss.

Iowa

California

Nebraska

Georgetown

Pittsburgh

Missouri

Iowa

SMU

NC State

SMU

NC State

Indiana

Iowa

The indexes scream one consistent recommendation: NC State should be bounced and replaced by SMU. Conference USA gets no love with both Louisiana Tech and Southern Miss appearing on multiple IN lists. The casualties for these additions would be Big 10 teams Nebraska and Iowa or the Big East's Xavier. Finally, three of the indexes suggest that Utah should have been given the nod over Pac 12 rival Colorado. Utah and Colorado split their season matches, but Colorado had the overall better record. Perhaps the quality of Utah's schedule and wins mattered more? In the end, this could be much ado about nothing as all of these teams are likely fodder for the top seeds of the tournament. Assuming that the selection committee seeded properly...

How well did the selection committee seed?

The following table shows the top four seeds by each index. The Committee's picks are bolded in black if the other rating systems mostly agree, green if the committee seeded higher than the consensus of the systems and red if lower. (I only considered the weighted rankings for LA and PR, ignoring generic.)

 

Seed

Committee

RPI

BPI

KenPom

Sagarin

LAWeight

LAGeneric

PRWeight

PRGeneric

1

Florida

Florida

Arizona

Arizona

Arizona

Wichita St

Wichita St

Wisconsin

Wisconsin

1

Arizona

Arizona

Florida

Louisville

Louisville

Florida

Florida

Arizona

Arizona

1

Wichita St

Kansas

Wichita St

Florida

Florida

Arizona

Arizona

UConn

Kansas

1

Virginia

Wichita St

Louisville

Virginia

Villanova

Villanova

Wisconsin

Kansas

UConn

2

Villanova

Villanova

Kansas

Wichita St

Virginia

Kansas

Villanova

UCLA

UCLA

2

Michigan

Wisconsin

Virginia

Villanova

Kansas

Iowa St

Kansas

Louisville

Michigan

2

Wisconsin

Iowa St

Villanova

Duke

Duke

Virginia

Iowa St

Florida

Duke

2

Kansas

Duke

Duke

Creighton

Michigan St

Wisconsin

Virginia

Wichita St

Creighton

3

Creighton

Virginia

Michigan St

Kansas

Wisconsin

Duke

Syracuse

Michigan

Iowa St

3

Duke

Creighton

Kentucky

Michigan St

Creighton

Syracuse

Duke

Creighton

Michigan St

3

Iowa St

Michigan

Wisconsin

Wisconsin

Michigan

San Diego St

San Diego St

New Mexico

Syracuse

3

Syracuse

New Mexico

Iowa St

VCU

Iowa St

Creighton

St Louis

Michigan St

Virginia

4

UCLA

VCU

Pittsburgh

Tennessee

Wichita St

VCU

Creighton

Virginia

Florida

4

San Diego St

San Diego St

Syracuse

Michigan

Ohio St

UCLA

Louisville

VCU

Louisville

4

Louisville

UCLA

Creighton

Syracuse

UCLA

Michigan

UCLA

Duke

Oregon

4

Michigan St

Syracuse

UCLA

UCLA

Okla St

Louisville

VCU

Texas

Texas

All totaled, the indexes agreed with 9 out of 16 of the selection committee's top 4 seeds. Arizona and Florida were consensus 1 seeds. The indexes were not unanimous on Wichita State but there was enough evidence to consider it a reasonable 1 seed. Virginia appears to be a reach with the cold statistical computations recommending a 2 seed. Virginia's RPI, the one index used by the committee, suggested a 3 seed! It's clear that the committee felt a team which won both the regular season and post-season tournament championships of the prestigious ACC conference had earned a 1 seed. Who to replace Virginia? The indexes lean toward Louisville.

The 2-line was fairly consistent with Michigan getting a pass from the committee despite barely escaping with a victory against a young Illinois squad in the Big 10 quarterfinals and then losing to eventual 4 seed, Michigan State, in the Big 10 championship game. The indexes had trouble deciding Wisconsin's seed with the Page Rank algorithm making Wisconsin the overall tournament favorite based on the number of quality wins against strong opponents but the BPI, KenPom and Sagarin indexes squarely placing them as a 3. We'll agree with the RPI and weighted Linear Algebra placement as a 2 seed.

Other than Creighton, the indexes had the committee's 3 seeds all over the map. Iowa State's seeding is a challenge as two indexes each had them as a 2, 3 and less than a 4 seed. We'll average this schizophrenia and accept Iowa State as a 3 seed. The indexes place Duke on the 2-line, the committee somehow not agreeing despite Duke's perennial NCAA darling status. Syracuse gets a boost from the committee with only one index considering them worthy of a 3.

The 4 seeds in this year's brackets are loaded, with the committee undervaluing returning champion Louisville and a healthy, peaking Michigan State squad. In last night's ESPN Bracketology analysis, the expert panel unanimously selected Sparty as the 2014 champ. Well, let's not get ahead of ourselves just yet! The indexes suggest that Michigan State deserved a 3 seed, with Sagarin suggesting a 2. Louisville shows up as a 1 seed on BPI, KenPom and Sagarin. The fact that the RPI suggested Louisville as a 5 might have had something to do with the committee's decision. That and perhaps a desire to drive up ticket sales at the Indianapolis-based Midwest regional, just a short drive from campus!

So, did any teams get snubbed? 4 out of 6 indexes felt VCU deserved a 4-seed. Perhaps losing in the Atlantic 10 Championship caused the committee to drop them to a 5? Bracketology experts were up in arms over Kentucky's 8 seed; perhaps they are right with the indexes indicating a 5 or 6 seed would have been more apt. ESPN's BPI had Kentucky as a 3!

Were the best conferences properly rewarded?

Figure 2 (click to enlarge)

Another debate stems from whether the most competitive conferences were properly rewarded with tournament bids. This almost always boils into a debate over which conference was most competitive and whether an exceptional team from a small or mid-major conference would fare better than a middle-of-the-pack team from a major power conference.

The R-generated strip plot in Figure 2 shows quartile distributions for each conference based on the Ensemble ranking discussed above. The strip plots show 0, 25, 50, 75 and 100 quartiles. With the inter-quartile range depicted as the white space between the strips. The conferences are sorted by summing their 0, 25, 50 and 75 quartile ranks -- eliminating the worst 25% of teams in each conference from consideration.

The data validates what the experts have been saying. The Big 12 was this season's best conference. Notice how the Big 10 median is roughly the same as the Big 12 75 th quartile. This suggests that the top half of the Big 10 is no better than the top 75% of the Big 12. That is a huge disparity! Deservedly, the Big 12 received the most bids, 7, of any conference. So, did any other conference get unduly rewarded or snubbed? Consider the following table.

Conference

Bids

Big 12

7

Big 10

6

Pac 12

6

ACC

6

Big East

4

SEC

3

Atlantic 10

6

American

4

It would seem that the committee felt that the Atlantic 10 was on a par with the Big 10, Pac 12 and ACC. As a result, the Big East and SEC received roughly half the bids of the A10. Was this fair? The strip plot helps us visualize why the answer is a resounding, YES! Note that the A10's top 25% (4 teams) were ranked better than the 25 th percentile team from both the Big East and SEC's. Further, you will see the two lowest dots in the A10 strip rank lower than the median ranking for both the Big East and SEC. In short, the top 6 A10 teams were ranked better than the other's top half. Using the same logic, the American conference's top quartile was also very strong and the 5 th best American conference team appears better suited than the last 2 A10 teams. Guess which team that is? SMU. Reinforcing its spot as this year's biggest selection committee snub.

Predictions

So what do the indexes say about who will take home the championship?

First, let's see if there any bracket busters lurking in the data. Were there any double digit seeds that could make a run to the sweet sixteen or beyond? Nada. The only major sweet sixteen upset is picked by PRGeneric, having 9 seed Kansas State romp past 1 seed Wichita State in the second round before falling to Louisville in the regional semis. Given the lack of sweet sixteen surprises, it's reasonable to conclude that, based on the indexes, the seeding committee did a pretty good job.

So, how do the various systems predict this year's Final Four and Champion?

 

RPI

BPI

KenPom

Sagarin

LAWeight

LAGeneric

PRWeight

PRGeneric

East

Villanova

Virginia

Virginia

Villanova

Villanova

Villanova

UConn

UConn

South

Florida

Florida

Florida

Florida

Florida

Florida

Kansas

Kansas

West

Arizona

Arizona

Arizona

Arizona

Arizona

Arizona

Wisconsin

Wisconsin

Midwest

Wichita St

Wichita St

Louisville

Louisville

Wichita St

Wichita St

Louisville

Michigan

Semi-finalist

Florida

Florida

Florida

Florida

Florida

Florida

UConn

Kansas

Semi-finalist

Arizona

Arizona

Arizona

Arizona

Wichita St

Wichita St

Wisconsin

Wisconsin

Champion

Florida

Arizona

Arizona

Arizona

Wichita St

Wichita St

Wisconsin

Wisconsin

The industry indexes are fairly unanimous in selecting an Arizona-Florida showdown for the title. The Page Rank models suggest that the Badgers will hoist the trophy. And our Linear Algebra models predict Wichita State to finish an historic 40-0 season as champs!

Well, as the saying goes, games are not won on paper! Enjoy the next few weeks and check back after the tournament, when I'll recap each system's performance, crowning winners and losers!

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.