Split-Apply-Combine with R and Python – Part 1

Tuesday, January 19, 2016 - 15:45

The split-apply-combine pattern of analytic computation is prominent in the tool-chest of many data scientists. Fundamentally, s-a-c is quite simple: “you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.”

s-a-c is often used for analysis of large data sets in which subsets are created by a combination of one or more grouping/dimensional variables. An example might be a census file of household records organized by state, where it makes sense to aggregate analyics/models at the state level.

Hadoop map-reduce programmers are certainly familiar with split-apply-combine: “In map-reduce, the map step corresponds to split and apply, and reduce corresponds to combine” Indeed, s-a-c is a tried and true approach to handling many big data problems, and is an important strategy for memory-constrained statistical engines like R.

There are now several ways to implement split-apply-combine in R. Historically, there've been the core “apply” functions we all know and hate. Mercifully, a number of add-on packages that make s-a-c more comprehensible have been introduced over the years. The remainder of this article focuses on showing one such R s-a-c approach in action.

By choice, many of the R functions used in this demo derive from the outstanding packages architected by project luminary Hadley Wickham. Indeed, dplyr, tidyr, ggplot, readr, and lubridate were all developed by Hadley.

The problem I investigate is over-time index performance of stock portfolios managed by Russell Investments. I construct an historical daily final index level file for 26 Russell portfolios such as the 1000, 2000, and 3000. The data are “stacked” in a csv file by portfolio, with additional attributes date, index level with dividends re-invested, and level with dividends not re-invested. As of the downloading run of 2016-01-18, there were over 130,000 daily final level records starting in 1995.

The dplyr s-a-c functions used in this exercise are select, filter, mutate, summarize, group_by and arrange. %>% serves as a pipe operator that enables chaining of operations. From tidyr, the functions utilized are gather, which handles the “melt” operation (columns to rows), and separate, which performs “pivot” or “cast” (rows to columns).

So let's take a look at the html markdown doc below for the R script analyzing the latest Russell portfolio index data. RStudio is development environment, and the rmarkdown package is used to present the code/results. I'll comment on each of the nine code/output sections in turn.

0) assigns script-wide options while 1) loads the relevant R libraries and sets the working directory. 2) details a simple print function used in subsequent steps to show the data.

3) loads the raw Russell csv file using the readr package. Note the “chaining” with the %>% operator that progresses from initial read to “mutate” that refines attributes pdate and portfolio, to finally sorting “arrange” of the data.frame by portfolio and pdate. There are 130,071 records with the 4 attributes portfolio, pdate, idxwodiv and idxwdiv in russellstack from the most recent data build. idxwodiv/idxwdiv represent portfolio index values without and with dividends re-invested.

Just to complicate our split-apply-combine lives, 4) “melts” russellstack and creates a longer, skinnier russellmelt data.frame/data.table with a new attribute indextype that indicates either idxwodiv or idxwdiv, with value attribute indexvalue. The columns have been replaced by rows, then arranged in order by portfolio and pdate. After filtering out invalid indexvalues, there are still almost twice as many records in russellmelt than in russellstack – as would be expected.

In 5), we apply the split-apply-combine principles to demonstrate how to create new variables pctch, grdollar, grdollarend1 and grdollarend2 across each grouping of portfolio and indextype. We first initialize the grouping variable grp by portfolio and indexype. For each unique combination of grp, grdollar is a normalized calculation that divides the entire indexvalue vector by its first value. With that, the performance of each portfolio can be compared. In contrast, each grdollarend1 and grdollarend2 calculation is unique just once per group; that value is then “broadcast” across the remainder of group records. printfnct shows the details.

6) summarizes calculations by grouping of portfolio and pdate. Each of the calculations produces a single value per group, in contrast to the mutate of 5), which returns a value for each record in the group. The count function produces frequencies/crosstabs, while sqldf provides SQL access to data.frames/data.tables. The summarize function allows us to invoke aggregate functions at each portfolio, indexvalue occurrence. Note that we can define our own function (sharpe) to be called by summarize.

7) illustrates how dplyr-produced data.frames might be used for analysis. Given a subset of portfolios, an indextype of one or both of idxwodiv and idxwdiv, and a start date, a computation on russellmelt is fed to ggplot to produce a graph of the growth of an initial $1 investment over time. In this case,, the Midcap portfolio, now at about $2.75 from a $1 investment in early 2009, has outperformed its 1000 and 2000 kin. The steep early 2016 decline is troubling.

Like 7), 8) first subsets russellmelt by portfolio, pdate, and indextype, then computes a daily percent change variable, pctch, “pivots” the data by portfolio using the spread function, and deploys the auxiliary ggplot package GGally to graph pairwise scatterplots of pctch among the selected portfolios. Note the strong linear relationship between the 1000 and 3000, not surprising given that both are competing proxies for the larger US stock market.

Finally, 9) consolidates the data handling and graphics for growth of $1 portfolio comparisons into a re-usable function that's invoked three times. The first graphs shows the benefits of long-term investing, while the second shows the benefits of dividend re-investment. The final chart confirms the horrendous early-2016 market performance.

Hopefully, this exercise gives readers a sense of split-apply-combine using the dplyr and tidyr packages. I could just have easily demoed the computations using the data.table library, or the core R apply functions. Even though they're not part of core R, I prefer working with the dplyr and data.table packages for their coherency, ease of use, and performance.

Next time, I'll run through pretty much the same Russell portfolio exercise using Python and it's powerful data management add-on package, Pandas.



split-apply-combine demonstration of Russell stock portfolio data using R,dplyr,tidyr and ggplot

Tue Jan 19 08:19:55 2016

0) set global options.

options(datatable.print.topn=200)
options(scipen=10)
options(width=2000)

knitr::opts_chunk$set(fig.width=12, fig.height=9,tidy=FALSE)

1) load libraries and set the working directory.

library(data.table)
library(dplyr)
library(reshape2)
library(tidyr)
library(lubridate)
library(readr)
library(ggplot2)
library(GGally)
library(ggthemes)
library(quantmod)
library(gridExtra)
library(sqldf)

setwd("d:/data/russell/2015")

2) helper print function.

printfnct <- function(dt)
{
    print(dim(dt))
    print(str(dt))
    print(head(dt))
    print(tail(dt))
}

3) load the russellstack data.frame/data.table.

russellstack <- read_csv("russellstack.csv", 
    col_types=cols(col_character(), col_character(), col_double(), col_double()))  %>% 
        mutate(pdate=ymd(pdate), portfolio=as.factor(portfolio)) %>% data.table() %>%
        arrange(portfolio, pdate)

printfnct(russellstack)
## [1] 130071      4
## Classes 'data.table' and 'data.frame':   130071 obs. of  4 variables:
##  $ portfolio: Factor w/ 26 levels "1000","1000Growth",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pdate    : POSIXct, format: "1995-01-03" "1995-01-04" "1995-01-05" "1995-01-06" ...
##  $ idxwodiv : num  469 471 471 471 472 ...
##  $ idxwdiv  : num  887 890 890 891 892 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
##    portfolio      pdate idxwodiv idxwdiv
## 1:      1000 1995-01-03   469.15  886.87
## 2:      1000 1995-01-04   470.85  890.35
## 3:      1000 1995-01-05   470.55  889.89
## 4:      1000 1995-01-06   471.31  891.32
## 5:      1000 1995-01-09   471.56  891.91
## 6:      1000 1995-01-10   472.47  893.63
##    portfolio      pdate idxwodiv  idxwdiv
## 1:  WorldCap 2016-01-11 1506.769 2127.716
## 2:  WorldCap 2016-01-12 1510.513 2133.064
## 3:  WorldCap 2016-01-13 1494.028 2109.924
## 4:  WorldCap 2016-01-14 1497.944 2115.477
## 5:  WorldCap 2016-01-15 1470.449 2076.695
## 6:  WorldCap 2016-01-18 1463.606 2067.044

4) create the russellmelt data.frame/data.table.

russellmelt <- russellstack %>% gather(indextype, indexvalue, idxwodiv, idxwdiv) %>% 
    filter(indexvalue>0) %>% data.table() %>% arrange(portfolio, indextype, pdate)

printfnct(russellmelt)
## [1] 260124      4
## Classes 'data.table' and 'data.frame':   260124 obs. of  4 variables:
##  $ portfolio : Factor w/ 26 levels "1000","1000Growth",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pdate     : POSIXct, format: "1995-01-03" "1995-01-04" "1995-01-05" "1995-01-06" ...
##  $ indextype : Factor w/ 2 levels "idxwodiv","idxwdiv": 1 1 1 1 1 1 1 1 1 1 ...
##  $ indexvalue: num  469 471 471 471 472 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
##    portfolio      pdate indextype indexvalue
## 1:      1000 1995-01-03  idxwodiv     469.15
## 2:      1000 1995-01-04  idxwodiv     470.85
## 3:      1000 1995-01-05  idxwodiv     470.55
## 4:      1000 1995-01-06  idxwodiv     471.31
## 5:      1000 1995-01-09  idxwodiv     471.56
## 6:      1000 1995-01-10  idxwodiv     472.47
##    portfolio      pdate indextype indexvalue
## 1:  WorldCap 2016-01-11   idxwdiv   2127.716
## 2:  WorldCap 2016-01-12   idxwdiv   2133.064
## 3:  WorldCap 2016-01-13   idxwdiv   2109.924
## 4:  WorldCap 2016-01-14   idxwdiv   2115.477
## 5:  WorldCap 2016-01-15   idxwdiv   2076.695
## 6:  WorldCap 2016-01-18   idxwdiv   2067.044
print(sum(russellmelt$indexvalue==0))
## [1] 0

5) create new russellmelt variables per group. demo broadcasting.

grp <- russellmelt %>% group_by(portfolio, indextype) %>% select(portfolio, pdate, indextype, indexvalue) 

russellmelt <- grp %>% 
    mutate(pctch=Delt(indexvalue), grplen=n(), grdollar=indexvalue/indexvalue[1], 
        grdollarend1=indexvalue[grplen]/indexvalue[1], grdollarend2=prod(1+pctch,na.rm=TRUE)) %>%
        data.table() %>% arrange(desc(grdollarend1))

printfnct(russellmelt)
## [1] 260124      9
## Classes 'data.table' and 'data.frame':   260124 obs. of  9 variables:
##  $ portfolio   : Factor w/ 26 levels "1000","1000Growth",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ pdate       : POSIXct, format: "1995-01-31" "1995-02-28" "1995-03-31" "1995-05-31" ...
##  $ indextype   : Factor w/ 2 levels "idxwodiv","idxwdiv": 2 2 2 2 2 2 2 2 2 2 ...
##  $ indexvalue  : num  289 303 309 328 329 ...
##  $ pctch       : num  NA 0.05047 0.01881 0.06219 0.00198 ...
##  $ grplen      : int  5378 5378 5378 5378 5378 5378 5378 5378 5378 5378 ...
##  $ grdollar    : num  1 1.05 1.07 1.14 1.14 ...
##  $ grdollarend1: num  8.97 8.97 8.97 8.97 8.97 ...
##  $ grdollarend2: num  8.97 8.97 8.97 8.97 8.97 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
##      portfolio      pdate indextype indexvalue        pctch grplen grdollar grdollarend1 grdollarend2
## 1: MidcapValue 1995-01-31   idxwdiv   288.8370           NA   5378 1.000000     8.968305     8.968305
## 2: MidcapValue 1995-02-28   idxwdiv   303.4150 0.0504713731   5378 1.050471     8.968305     8.968305
## 3: MidcapValue 1995-03-31   idxwdiv   309.1220 0.0188092217   5378 1.070230     8.968305     8.968305
## 4: MidcapValue 1995-05-31   idxwdiv   328.3460 0.0621890386   5378 1.136786     8.968305     8.968305
## 5: MidcapValue 1995-06-01   idxwdiv   328.9968 0.0019821347   5378 1.139040     8.968305     8.968305
## 6: MidcapValue 1995-06-02   idxwdiv   329.3022 0.0009281518   5378 1.140097     8.968305     8.968305
##         portfolio      pdate indextype indexvalue        pctch grplen grdollar grdollarend1 grdollarend2
## 1: MicrocapGrowth 2016-01-07  idxwodiv   1485.879 -0.035936787   4050 1.462495     1.382606     1.382606
## 2: MicrocapGrowth 2016-01-08  idxwodiv   1455.111 -0.020706755   4050 1.432212     1.382606     1.382606
## 3: MicrocapGrowth 2016-01-11  idxwodiv   1431.356 -0.016325391   4050 1.408830     1.382606     1.382606
## 4: MicrocapGrowth 2016-01-12  idxwodiv   1435.682  0.003022308   4050 1.413088     1.382606     1.382606
## 5: MicrocapGrowth 2016-01-13  idxwodiv   1377.052 -0.040837625   4050 1.355381     1.382606     1.382606
## 6: MicrocapGrowth 2016-01-14  idxwodiv   1404.713  0.020086927   4050 1.382606     1.382606     1.382606

6) do some aggregations/summarizations on russellmelt.

##
# how many market days for each portfolio?
##
russellmelt %>% count(portfolio, indextype) %>% arrange(desc(n), portfolio)
## Source: local data table [52 x 3]
## Groups: portfolio
## 
##     portfolio indextype     n
##        (fctr)    (fctr) (int)
## 1        1000   idxwdiv  5477
## 2        1000  idxwodiv  5477
## 3  1000Growth   idxwdiv  5478
## 4  1000Growth  idxwodiv  5478
## 5   1000Value   idxwdiv  5478
## 6   1000Value  idxwodiv  5478
## 7        2000   idxwdiv  5478
## 8        2000  idxwodiv  5478
## 9  2000Growth   idxwdiv  5478
## 10 2000Growth  idxwodiv  5478
## ..        ...       ...   ...
q <- "select count(*), portfolio, indextype from russellmelt group by portfolio, indextype order by portfolio, indextype"
sqldf(q)
##    count(*)      portfolio indextype
## 1      5477           1000   idxwdiv
## 2      5477           1000  idxwodiv
## 3      5478     1000Growth   idxwdiv
## 4      5478     1000Growth  idxwodiv
## 5      5478      1000Value   idxwdiv
## 6      5478      1000Value  idxwodiv
## 7      5478           2000   idxwdiv
## 8      5478           2000  idxwodiv
## 9      5478     2000Growth   idxwdiv
## 10     5478     2000Growth  idxwodiv
## 11     5478      2000Value   idxwdiv
## 12     5478      2000Value  idxwodiv
## 13     5478           2500   idxwdiv
## 14     5478           2500  idxwodiv
## 15     3556     2500Growth   idxwdiv
## 16     3556     2500Growth  idxwodiv
## 17     3556      2500Value   idxwdiv
## 18     3556      2500Value  idxwodiv
## 19     5478           3000   idxwdiv
## 20     5478           3000  idxwodiv
## 21     5378     3000Growth   idxwdiv
## 22     5375     3000Growth  idxwodiv
## 23     5378      3000Value   idxwdiv
## 24     5375      3000Value  idxwodiv
## 25     5101         Global   idxwdiv
## 26     5101         Global  idxwodiv
## 27     5101   GlobalGrowth   idxwdiv
## 28     5101   GlobalGrowth  idxwodiv
## 29     5101    GlobalValue   idxwdiv
## 30     5101    GlobalValue  idxwodiv
## 31     4052       Microcap   idxwdiv
## 32     4052       Microcap  idxwodiv
## 33     4050 MicrocapGrowth   idxwdiv
## 34     4050 MicrocapGrowth  idxwodiv
## 35     4051  MicrocapValue   idxwdiv
## 36     4051  MicrocapValue  idxwodiv
## 37     5378         Midcap   idxwdiv
## 38     5378         Midcap  idxwodiv
## 39     5378   MidcapGrowth   idxwdiv
## 40     5375   MidcapGrowth  idxwodiv
## 41     5378    MidcapValue   idxwdiv
## 42     5375    MidcapValue  idxwodiv
## 43     5377         Top200   idxwdiv
## 44     5377         Top200  idxwodiv
## 45     5378   Top200Growth   idxwdiv
## 46     5375   Top200Growth  idxwodiv
## 47     5378    Top200Value   idxwdiv
## 48     5375    Top200Value  idxwodiv
## 49     3556   Top50MegaCap   idxwdiv
## 50     3556   Top50MegaCap  idxwodiv
## 51     5101       WorldCap   idxwdiv
## 52     5101       WorldCap  idxwodiv
sharpe <- function(pct)
{
    return(mean(pct,na.rm=T)/sd(pct,na.rm=T))
}

grp <- russellmelt %>% group_by(portfolio, indextype) %>% select(portfolio, pdate, indextype, indexvalue, pctch) 

sumrussell <- grp %>% 
    summarize(grplen=n(), grdollar=indexvalue[n()]/indexvalue[1], mean=mean(pctch,na.rm=T), 
        sd=sd(pctch,na.rm=T), sharpe=sharpe(pctch)) %>% 
        data.table() %>% arrange(desc(sharpe))

printfnct(sumrussell)
## [1] 52  7
## Classes 'data.table' and 'data.frame':   52 obs. of  7 variables:
##  $ portfolio: Factor w/ 26 levels "1000","1000Growth",..: 21 19 3 12 1 7 10 22 19 6 ...
##  $ indextype: Factor w/ 2 levels "idxwodiv","idxwdiv": 2 2 2 2 2 2 2 2 1 2 ...
##  $ grplen   : int  5378 5378 5478 5378 5477 5478 5478 5377 5378 5478 ...
##  $ grdollar : num  8.97 8.39 6.51 6.31 6.42 ...
##  $ mean     : num  0.000479 0.000475 0.000412 0.000415 0.000411 ...
##  $ sd       : num  0.0119 0.0126 0.0119 0.012 0.0119 ...
##  $ sharpe   : num  0.0402 0.0378 0.0348 0.0345 0.0345 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
##      portfolio indextype grplen grdollar         mean         sd     sharpe
## 1: MidcapValue   idxwdiv   5378 8.968305 0.0004794423 0.01193434 0.04017335
## 2:      Midcap   idxwdiv   5378 8.385614 0.0004747336 0.01257102 0.03776414
## 3:   1000Value   idxwdiv   5478 6.505966 0.0004123303 0.01185257 0.03478826
## 4:   3000Value   idxwdiv   5378 6.309938 0.0004148701 0.01201047 0.03454237
## 5:        1000   idxwdiv   5477 6.419003 0.0004105501 0.01190820 0.03447626
## 6:        2500   idxwdiv   5478 7.347173 0.0004499299 0.01307858 0.03440205
##         portfolio indextype grplen grdollar         mean         sd     sharpe
## 1:       WorldCap  idxwodiv   5101 2.153213 0.0002021829 0.01016842 0.01988341
## 2:       Microcap  idxwodiv   4052 1.994760 0.0002730003 0.01430652 0.01908223
## 3:   GlobalGrowth  idxwodiv   5101 1.981138 0.0001894395 0.01051535 0.01801552
## 4:   Top50MegaCap  idxwodiv   3556 1.500860 0.0001850941 0.01190846 0.01554308
## 5: MicrocapGrowth   idxwdiv   4050 1.494418 0.0002116496 0.01497869 0.01413005
## 6: MicrocapGrowth  idxwodiv   4050 1.382606 0.0001924406 0.01497840 0.01284788

7) growth of $1 ggplot using russellmelt.

ports <- c("1000", "2000","Midcap")
pd  <- "1990-01-01"
it <- c("idxwdiv")

grp <- russellmelt %>% group_by(portfolio, indextype) %>% select(portfolio, pdate, indextype, indexvalue) 

pd <- ymd(pd)
minrdte <- min(russellstack$pdate)
mindte <- ifelse (pd <= minrdte, minrdte, tail(russellmelt[which(russellmelt$pdate<ymd(pd))]$pdate,1))

slug <- grp %>% filter(portfolio %in% ports & pdate >= mindte & indextype %in% it) %>%      
        mutate(pdate, grplen=n(), grdollar=indexvalue/indexvalue[1], pctch=Delt(indexvalue), 
            indextype=factor(indextype, levels=c("idxwdiv","idxwodiv"))) %>% data.table() 
        
g1 <- ggplot(slug, 
    aes(y=grdollar, x=pdate, color=portfolio, linetype=indextype)) +
    geom_line() +
    theme_economist_white() +
    theme(plot.title = element_text(size = 10,colour="black")) +  
    theme(axis.text=element_text(size=rel(0.5)),axis.title=element_text(size=rel(0.6))) +  
    theme(legend.text = element_text(size=rel(0.7)), legend.title=element_text(size=rel(.7)), legend.position="right") + 
    scale_linetype_manual(breaks=c("idxwdiv","idxwodiv"), values=c(1,3)) +
    scale_y_continuous(limits = c(0.0,1.2*max(slug$grdollar))) +
    ggtitle(paste("Russell Portfolios -- ", min(slug$pdate)," to ", max(slug$pdate), "\n\n",sep="")) +
    ylab("\nGrowth of $1") +
    xlab("Date\n")
print(g1)

8) pairs ggplot using russellmelt.

ports <- c("1000", "2000", "3000", "2500")
pd  <- "1999-12-31"
it <- c("idxwdiv")

grp <- russellmelt %>% group_by(portfolio, indextype) %>% select(portfolio, pdate, indextype, indexvalue) 

slug <- grp %>% select(portfolio, pdate, indextype, indexvalue) %>% 
        filter(portfolio %in% ports & pdate >= pd & indextype %in% it) %>% 
        mutate(pdate,  pctch=Delt(indexvalue), 
            indextype=factor(indextype, levels=c("idxwdiv","idxwodiv"))) %>% data.table() %>% 
            filter(!is.na(pctch)) %>% mutate(portfolio = paste("r", portfolio, sep="")) %>%
            select(portfolio, pdate, pctch, indextype) %>% spread(portfolio, pctch) %>% filter(!is.na(r1000))

slug %>% select(3:ncol(slug)) %>% ggpairs() + 
                theme_economist() +
                theme(axis.text=element_text(size=rel(0.5)),axis.title=element_text(size=rel(0.5)))   

9) growth of $1 function.

mkgraph <- function(rm,ports=c("1000","2000","3000"),it=c("idxwdiv"), pd="1899-12-31")
{ 
    
    grp <- rm %>% group_by(portfolio, indextype) %>% select(portfolio,pdate,indextype,indexvalue) 
    
    pd <- ymd(pd)
    minrdte <- min(russellstack$pdate)
    mindte <- ifelse (pd <= minrdte, minrdte, tail(russellmelt[which(russellmelt$pdate<pd)]$pdate,1))

        
    slug <- grp %>% filter(portfolio %in% ports & pdate >= mindte & indextype %in% it) %>% 
        mutate(pdate, grplen=n(), grdollar=indexvalue/indexvalue[1], pctch=Delt(indexvalue), 
            indextype=factor(indextype, levels=c("idxwdiv","idxwodiv"))) %>% data.table()

    g1 <- ggplot(slug, 
    aes(y=grdollar, x=pdate, color=portfolio, linetype=indextype)) +
    geom_line() +
    theme_economist_white() +
    theme(plot.title = element_text(size = 10,colour="black")) +  
    theme(axis.text=element_text(size=rel(0.5)),axis.title=element_text(size=rel(0.6))) +  
    theme(legend.text = element_text(size=rel(0.7)), legend.title=element_text(size=rel(.7)), legend.position="bottom", legend.box="horizontal") + 
    scale_linetype_manual(breaks=c("idxwdiv","idxwodiv"), values=c(1,3)) +
    scale_color_brewer(palette="Dark2") +
    ggtitle(paste("Russell Portfolios -- ", min(slug$pdate)," to ", max(slug$pdate), "\n\n",sep="")) +
    ylab("\nGrowth of $1") +
    xlab("Date\n")
    print(g1)

}

mkgraph(russellmelt,ports=c("3000","Midcap"))

mkgraph(russellmelt, ports=c("1000Growth","1000Value"), it=c("idxwdiv","idxwodiv"),pd="2000-01-01")

mkgraph(russellmelt,pd="2016-01-01")