What's Changed In Baseball? Home Run and Strikeout Analysis with Actian, Tableau and Pentaho

Thursday, August 22, 2013 - 16:00

I admit it. I am addicted to baseball stats. Although in my mid 40’s, every Christmas morning I am a kid again, anxiously awaiting the unwrapping of gifts. I know my wife will give me the new season’s Bill James Handbook. It’s always the first one I open. I fidget while the rest of the ritual unwrapping is done…one…gift…at…a…time. Finally, I can take time off from being a dad and husband and bury myself into a morass of baseball data. It’s the same sensation I felt as a 10-year-old reading the backs of 1000’s of baseball cards in my collection. Nirvana.

In case you don’t know, Bill James is the father of sabermetrics, the analysis of baseball statistics to measure activity and performance. Its data science applied to baseball. Given that I am mad about baseball and data science is what we do at Inquidia, I figured perhaps I ought to find some data and play ball.   

My quest for baseball data led me to Retrosheet where, amongst other things, you can download play-by-play datasets containing records for every “event” from nearly all regular season games since 1948. (There are a few missing or incomplete games and there are new seasons being added all the time.) I used Retrosheet’s software tools to crunch 64 seasons (1948 – 2012) of data. The tool generated one file per season containing event records of nearly 100 fields providing such details as who was batting, pitching, playing RF, running on second base, pitch count, result of the at bat, etc. All totaled, I generated about 9.7 million delimited records requiring almost 2.8GB of disk. Finally, I downloaded Retrosheet supplied lookup files containing players, coaches, ballparks and franchise/team information, knowing this would be helpful if I wanted to display player and team names as I analyze the data. Retrosheet has terms of use which can be found here.

Next, I needed a database that I could use to query this data on my Windows laptop. I thought I’d give Actian’s Vectorwise a try. Of course, Vectorwise is meant for SMP boxes and much larger datasets. It’s not designed for 10 million records on a multi-core laptop with a middle of the road hard drive. That said, each record in my dataset contains 100 fields. Since Vectorwise is a column store database it ought to be more efficient at doing OLAP-y queries than a row-store like MySQL. Beyond the benefits of column-level data access, Vectorwise amps up query performance through data compression and parallel vector processing. 

I used Pentaho Data Integration (PDI) ETL software to load Vectorwise with the raw event and lookup data. (Aside:  I could have used the PDI community edition, called Kettle, as all of the needed capabilities are readily available.) I then designed a star schema centered on hitting statistics. I loaded the central fact table with an “insert into … select …” statement from my raw events data -- focused on the batter, with one record per plate appearance. The new fact table contained 9.4 million rows with only 50 columns, mostly integers. In the end, I had “raw” data available in three formats:  flat file event data, Vectorwise event data and Vectorwise hitting data in a star schema.

Now to the fun part:  visual data analysis. I chose Tableau because of the richness of its graphics, the ease with which I can get connected to my data and the intuitiveness of its user interface. For starters, I decided to investigate the correlation between Home Runs (HR) and strikeouts (K). Most baseball fans intuit that home run hitters tend to strikeout more so this correlation by itself is not too interesting. What is more interesting is to see how this correlation has changed over time.  

The following chart compares players from the 1950 and 2012 seasons. Each circle is a player who had at least 300 plate appearances. Players from 2012 are orange while 1950 are blue. The X-axis is HRs and the Y-axis is Ks. A degree-2 polynomial trend line is drawn (both with very small p-values) for each population.

This chart shows a stark contrast between the players of 1950 and 2012, confirming something my dad and grandfather always told me about how the old-timers may have hit fewer home runs, but also struck out less. Why? Well, there may be myriad reasons including the nature of pitching, financial incentives for home runs, and players not choking up when they get 2 strikes in the count. I suspect you can use this graphic to start a healthy debate at your next BBQ….You’re welcome!

In case you are wondering who the boppers and strikeout kings were from each season, I created another Tableau image extract which highlights a few players. 

Note that Adam Dunn’s numbers are stunning. He struck out in over a third of his plate appearance while 37% of his hits were home runs. As expected, MVP and Triple Crown winner, Miguel Cabrera led the majors with 44 HRs but struck out fewer than 100 times – a remarkable stat for a modern day ballplayer. Speaking of MVP, of the top 2012 HR hitters, Josh Hamilton and Ryan Braun were top 5 vote getters. Back in 1950, Al Rosen led the majors with 37 HR and wasn’t even in the 15 for the MVP vote. Ralph Kiner had fewer Ks per HR and ended up 5th in MVP voting. Luke Easter was the 1950 version of Adam Dunn (well kinda, he struck out at half the rate of Dunn and had a healthy .280 batting average). Easter led the league with 95 strikeouts. Look at the number of 2012 players (including Miguel Cabrera) who had more K’s than him!  How times have changed.

Contact us today to find out how Inquidia can show you how to collect, integrate and enrich your data. We do data. You can, too.

Would you like to know more?

Sign up for our fascinating (albeit infrequent) emails. Get the latest news, tips, tricks and other cool info from Inquidia.