Later that morning I received an incendiary email from Matt Dowle, author of the R “data.table” package, basically questioning my sanity for not using the library's speedy fread function for delimited text data, rather than the inefficient read.csv kluge I ended up with.
I'm a big fan of 'data.table', and so wrote in a dedicated blog three years ago, also mentioning the package positively last summer. Truth be told, though, I was unaware of the newest goodies – and was excited for a potential better solution for my script.
Once Dowle realized I wasn't conspiring against data.table, our correspondence became much more collaborative. He even led me to the latest data.table code pre-release to install for testing. And though this version of fread is “Not for production use yet”, I gladly made the changes and re-timed the scripts.
The performance benefits of just adding fread were substantial. The raw data load time decreased from 12.5 minutes to 2, about the same as Python-pandas. The entire R script could then execute in just under 4 minutes, much closer to Python-pandas, which completed in 2 minutes, 10 seconds.
Though my intention in posting the comparison code was more to show Python-R similarities than to optimize R performance, Dowle suggested my loop through the input files using the R rbind function was costly and should be replaced with data.table's more efficient rbindlist. I disagreed, feeling the six seconds savings that change would have afforded wasn't worth diminished readability and comparability to the Python version. I was reminded of programming grandmaster Brian Kernighan's admonition on code development many years ago: Don't sacrifice clarity for small gains in efficiency.
With the initial read times similar between R and Python-pandas, the remaining difference in execution speed boiled down to the calculations of final data frame variables using numpy’s where function against R’s ifelse. Indeed, the similarity of these functions had been a motivator for writing the blog. Unfortunately, the performance was quite dissimilar: the where Python calcs ran in 10 seconds vs almost 2 minutes for R’s ifelse.
So Dowle and I looked for alternatives. He proposed cut, a function available in both R and pandas for binning continuous data. Cut is blazingly fast and would have reduced the R calc timings from 2 minutes to under 30 seconds – and the Python timings from 10 seconds to 5. But while cut works for this well-behaved data, it’s not a general-purpose recoding solution and in fact can produce incorrect results with wayward input.
I was able to locate a function called, appropriately enough, recode, from the car package that appears to be a suitable replacement for ifelse in R – providing most of the functionality and lowering the runtime by 50%. Contrast the following original R ifelse code with recode.
The schl variable is coded as integers in the range 0 to 16. The ifelse code “knows” that schl is well behaved. If it weren’t, the recoded categories may well have included errors. In retrospect, I should have coded the ifelse with “in’s” rather than “<=”.
The recode snippet, in contrast, looks for each valid value of schl. Those outside are lumped into‘8.O’. This code in this example is tighter -- and also faster than ifelse by a factor of three.
The net-net of the re-programming exercise? Two scripts which performed the same tasks on over 15M initial records and were quite close in performance – just under 3 minutes for R and about 2 minutes 10 seconds for Python-panda. I’m impressed.
Takeaways from the entire experience? Both R and Python-pandas-numpy are serious tools for data analysis. And for each platform, the contributions from the open source communities outside core development make big differences in capabilities.
Python-pandas-numpy and R aficionados are also quite competitive. My Python colleagues often bemoan R, while R friends dis Python-pandas. This competition is good for the data analysis world. Me? Though I’ve probably used R more in the last 12 years, I’m rooting for both – anything to make my life easier.
Originally published in Information Managment.