Last week I participated in a webinar on machine learning with Ruby. The specific application revolved on sentiment analysis using support vector machines utilizing the Ruby bindings from the libsvm library.
I'm a big Ruby fan, and wrote glowingly on it for Information Management almost five years ago. Extending the scripting language tradition of Perl and Python, Ruby's ideal for many core munging challenges of data science. The combination of Ruby data types with methods, along with blocks and iterators, can make for very powerful – and terse – code, as this snippet from a portfolio returns program I wrote five years ago illustrates:
And Ruby's more than just a scripting language, as Rails web developers will assuredly attest.
Alas, when it came time for the webinar author/presenter to walk through code, his screen sharing technology failed, and the last 30 minutes of the hour allotted time were relegated to extended ML and Ruby question and answer.
For me, this wasn't such a bad thing, since I got to ask a few questions and learned quite a bit about the Ruby ecosystem in the process. One question pertained to the use of RinRuby, “a Ruby library that integrates the R interpreter in Ruby, making R's statistical routines and graphics available within Ruby.”
Though RinRuby's been around for a number of years, I hadn't test-driven it. I actually installed the library and got it up and running before the webinar ended. It turns out RinRuby's one of at least three R-Ruby libraries available, the others being Rserve and RSRuby. I later installed Rserve, comparing its functionality to RinRuby.
Each of the R-Ruby libraries provides capabilities to execute blocks of R code within a Ruby program, as well as push/pull data between R and Ruby variables. There are limitations on the complexity of data types supported.
I ended up coding a modification to the “hello world” illustration cited by RinRuby authors Dahl and Crawford. The program reads the text of the Gettysburg Address, breaks it into words, and counts word occurrences – all in Ruby. Once the counts are completed, the Ruby arrays are “assigned” to R variables, where a subsequent “here document” R script graphs word frequencies. An R-calculated correlation coefficient is finally returned to Ruby. My adaptation of this simple program is as follows:
I was also able to replicate and extend this example with the Ruby Rserve library. My biggest frustration is the difficulty in moving complex data types such as data frames and hashes between the disparate languages.
As much as I love the elegance of Ruby and enjoy programming in it, I can't recommend it as a foundational data science platform over Python. If you're a Rails developer or have programmed in Ruby for years, it may make sense. But the Python analytics ecosystem is much more advanced, with a portfolio of productivity-enhancing libraries like numpy, scipy, pandas, statsmodels, scikit, et al not found in Ruby. Indeed, the array-orientation of these libraries make Python programming look much more like R than Ruby.
The good news, as always, is that the enhancement of these open source platforms is great for the data science world.