Primer on R for Finance

Daniel P. Palomar (2025). Portfolio Optimization: Theory and Application. Cambridge University Press.

Last update: February 06, 2025


Introduction

What is R?

R vs Python

  • The syntax of R, Python, Matlab, and Julia is really not that different (see cheatsheet).

  • R and Python are the most popular languages for data science. Some comparisons by Norm Matloff, Arslan Shahid, Martijn Theuwissen.

  • Data visualization: both R and Python have many plotting libraries, but the R package ggplot2 (based on the concept of “grammar of graphics”) is the clear winner (see gallery). Now Python also has a ggplot library.

  • Modeling libraries: both R and Python have tons of libraries. It seems that R has more in data science. For statistics, R is the clear winner.

  • Ease of learning: Python was designed in 1989 for programmers with a syntax inspired by C. R was developed around 1993 for statisticians and scientists, also with a syntax inspired by C. Some people thing Python is easier while others think R is easier. Perhaps initially R was more difficult to learn than Python but with the modern existing IDE’s like RStudio this is not the case anymore. Some people may say that Python is more elegant, but that depends on what one is used to.

  • Speed: Python was initially faster than R. However, with the existing modern packages that’s not true anymore. For example, the famous R package data.table to manipulate huge amounts of data is the clearn winner (see benchmark). In fact, the R package Rcpp allows the combination of R with C++ leading to very fast implementations of the packages. More recently, R has benefitted from many parallel computation packages as well.

  • Community support: both languages have significant amount of user base, hence, they both have a very active support community.

  • Machine learning: Python is more popular for neural networks. However, the truth is that the popular deep learning libraries (e.g., TensorFlow, MXNet, etc.) are coded in C and have interfaces with Python, R, and other languages. Interestingly, random forests (which is one of the most popular machine learning methods) is far superior in R. The reason is that neural networks traditionally come from a computer science background whereas random forests come from a statistics background.

  • Why R?: R has been used for statistical computing for over two decades. You can get started with writing useful code in no time. It has been used extensively by data scientists and has an insane number of packages available for a lot of data science related tasks.

  • Why Python?: Python is more of a general purpose programming language. For web-based applications, Python seems to be more popular.

  • Finance: again both R and Python are heavily used in finance. You can easily find very passionate defenders and opponents of each language. From my own observations in the academic and industrial sectors, I can say that R is unbeatable for quick testing and prototype development and perhaps Python is more used for a later stage where the final product (probably web-based) has to be developed for clients.

Installation

To install, just follow the following simple steps:

  1. Install R from CRAN.
  2. Install the free IDE RStudio.

Now you are ready to start using R from within RStudio (note that you can also use R directly from the command line without the need for RStudio or you can use another IDE of your preference).

Packages

To see the versions of R and the installed packages just type sessionInfo():

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Hong_Kong
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] portfolioBacktest_0.4.1.9000 PerformanceAnalytics_2.0.4  
[3] quantmod_0.4.26              TTR_0.24.4                  
[5] rbokeh_0.5.2                 reshape2_1.4.4              
[7] ggplot2_3.5.1                xts_0.14.1                  
[9] zoo_1.8-12                  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     lattice_0.22-6   
 [5] httpcode_0.3.0    digest_0.6.35     magrittr_2.0.3    evaluate_0.24.0  
 [9] grid_4.3.3        fastmap_1.2.0     maps_3.4.2        plyr_1.8.9       
[13] jsonlite_1.8.8    crul_1.4.2        httr_1.4.7        fansi_1.0.6      
[17] scales_1.3.0      pbapply_1.7-2     codetools_0.2-20  lazyeval_0.2.2   
[21] cli_3.6.2         rlang_1.1.4       gistr_0.9.0       munsell_0.5.1    
[25] withr_3.0.0       yaml_2.3.8        parallel_4.3.3    tools_4.3.3      
[29] pryr_0.1.6        dplyr_1.1.4       colorspace_2.1-0  curl_5.2.1       
[33] assertthat_0.2.1  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[37] stringr_1.5.1     htmlwidgets_1.6.4 pkgconfig_2.0.3   pillar_1.9.0     
[41] hexbin_1.28.3     gtable_0.3.5      glue_1.7.0        Rcpp_1.0.12      
[45] xfun_0.49         tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.16.0
[49] knitr_1.47        htmltools_0.5.8.1 rmarkdown_2.27    compiler_4.3.3   
[53] quadprog_1.5-8   

To see the version of a specific package use packageVersion("package_name").

As time progresses, you will have to install different packages from CRAN with the command install.packages("package_name") or from GitHub with the command devtools::install_github("package_name"). After installing a package, it needs to be loaded before it can be used with the command library("package_name") or library(package_name):

# let's try to use the function xts() from package xts:
x <- xts()
#> Error in xts() : could not find function "xts"

# let's try to load the package first:
library(xts)
#> Error in library(xts) : there is no package called ‘xts’

# let's first install it:
install.packages("xts")
# now we can load it and use it:
library(xts)
x <- xts()

Variables and data types

In R, we can easily assign a value to a variable or object with <- (if the variable does not exist it will be created):

x <- "Hello"
x
[1] "Hello"

We can combine several elements with c():

y <- c("Hello", "everyone")
y
[1] "Hello"    "everyone"

We can always see the variables in memory with ls():

ls()
[1] "x" "y"

My favorite command is str(variable). It gives you various information about the variable, i.e., type of variable, dimensions, contents, etc.

str(x)
 chr "Hello"
str(y)
 chr [1:2] "Hello" "everyone"

Another useful pair of commands are head() and tail(). They are specially good for variables of large dimensions showing you the first and last few elements, respectively.

x <- c(1:1000)
str(x)
 int [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
head(x)
[1] 1 2 3 4 5 6
tail(x)
[1]  995  996  997  998  999 1000

It is important to remark that R is a functional language where almost everything is done through functions of all sorts (such as str(), print(), head(), ls(), tail(), max(), etc.).

There are a variety of functions for getting help:

help(matrix)       # help about function matrix()
?matrix            # same thing
example(matrix)    # show an example of function matrix()
apropos("matrix")  # list all functions containing string "matrix"

# get vignettes of installed packages
vignette()       # show available vingettes
vignette("xts")  # show specific vignette

Operators in R: arithmetic operators include +, -, *, /, ^ and logical operators >, >=, ==, !=.

R has a wide variety of data types including scalars, vectors, matrices, data frames, and lists.

Vectors

A vector is just a collection of several variables of the same type (numerical, character, logical, etc.).

a <- c(1, 2, 5.3, 6, -2, 4)  # numeric vector
b <- c("one", "two", "three")  # character vector
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) # logical vector

Refer to elements of a vector using subscripts:

a[2]  # 2nd element of vector
[1] 2
a[c(2, 4)]  # 2nd and 4th elements of vector
[1] 2 6

Note that in R vectors are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an \(n\times 1\) matrix.

It is also important to differentiate elementwise multiplication * from inner product %*% and outer product %o%:

x <- c(1, 2)
y <- c(10, 20)
x * y
[1] 10 40
x %*% y
     [,1]
[1,]   50
t(x) %*% y
     [,1]
[1,]   50
x %o% y
     [,1] [,2]
[1,]   10   20
[2,]   20   40

One can name the elements of a vector:

names(y)
NULL
names(y) <- c("convex", "optimization")
y  # same as print(y)
      convex optimization 
          10           20 
str(y)
 Named num [1:2] 10 20
 - attr(*, "names")= chr [1:2] "convex" "optimization"
# we can get the length
length(y)
[1] 2

Matrices

A matrix is two-dimensional collection of several variables of the same type (numerical, character, logical, etc.).

We can easily create a matrix with matrix():

# generate 5 x 4 numeric matrix 
x <- matrix(1:20, nrow = 5, ncol = 4)
x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
# we can name the columns and the rows
colnames(x) <- c("col1", "col2", "col3", "col4")
rownames(x) <- c("row1", "row2", "row3", "row4", "row5")
x
     col1 col2 col3 col4
row1    1    6   11   16
row2    2    7   12   17
row3    3    8   13   18
row4    4    9   14   19
row5    5   10   15   20
str(x)
 int [1:5, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "row1" "row2" "row3" "row4" ...
  ..$ : chr [1:4] "col1" "col2" "col3" "col4"
# we can get the dimensions or number of rows/columns
dim(x)
[1] 5 4
nrow(x)
[1] 5
ncol(x)
[1] 4

Identify rows, columns or elements using subscripts:

x[, 4]  # 4th column of matrix (returned as vector)
row1 row2 row3 row4 row5 
  16   17   18   19   20 
str(x[, 4])
 Named int [1:5] 16 17 18 19 20
 - attr(*, "names")= chr [1:5] "row1" "row2" "row3" "row4" ...
x[, 4, drop = FALSE]  # 4th column of matrix (returned as one-column matrix)
     col4
row1   16
row2   17
row3   18
row4   19
row5   20
str(x[, 4, drop = FALSE])
 int [1:5, 1] 16 17 18 19 20
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "row1" "row2" "row3" "row4" ...
  ..$ : chr "col4"
x[3, ]  # 3rd row of matrix 
col1 col2 col3 col4 
   3    8   13   18 
x[2:4, 1:3]  # rows 2,3,4 of columns 1,2,3
     col1 col2 col3
row2    2    7   12
row3    3    8   13
row4    4    9   14
str(x[2:4, 1:3])
 int [1:3, 1:3] 2 3 4 7 8 9 12 13 14
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "row2" "row3" "row4"
  ..$ : chr [1:3] "col1" "col2" "col3"

Arrays

Arrays are similar to matrices but can have more than two dimensions.

Data frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).

d <- c(1, 2, 3, 4)
e <- c("red", "white", "red", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
myframe <- data.frame(d, e, f)
names(myframe) <- c("ID", "Color", "Passed")  # variable names
myframe
  ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3   red   TRUE
4  4  <NA>  FALSE
str(myframe)
'data.frame':   4 obs. of  3 variables:
 $ ID    : num  1 2 3 4
 $ Color : chr  "red" "white" "red" NA
 $ Passed: logi  TRUE TRUE TRUE FALSE

There are a variety of ways to identify the elements of a data frame:

myframe[c(1, 3)]  # columns 1,3 of data frame
  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE
myframe[c("ID", "Color")] # columns ID and Color from data frame
  ID Color
1  1   red
2  2 white
3  3   red
4  4  <NA>
myframe["ID"]  # select column ID
  ID
1  1
2  2
3  3
4  4
myframe$ID  # extract variable ID in the data frame (as a vector), like myframe[["ID"]]
[1] 1 2 3 4

Data frames in R are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (numeric). For multivariate time series we will later explore the class xts, which is more appropriate than matrices or data frames.

Lists

A list is an ordered collection of objects (components) of (possibly) different types. A list allows you to gather a variety of (possibly unrelated) objects under one name.

# example of a list with 4 components: a string, a numeric vector, a matrix, and a scalar 
w <- list(name = "Fred", mynumbers = c(1:10), mymatrix = matrix(NA, 3, 3), age = 5.3)
w
$name
[1] "Fred"

$mynumbers
 [1]  1  2  3  4  5  6  7  8  9 10

$mymatrix
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
[3,]   NA   NA   NA

$age
[1] 5.3
str(w)
List of 4
 $ name     : chr "Fred"
 $ mynumbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ mymatrix : logi [1:3, 1:3] NA NA NA NA NA NA ...
 $ age      : num 5.3

Useful functions

length(object)  # number of elements or components of a vector
str(object)     # structure of an object 
class(object)   # class or type of an object
names(object)   # names

c(object, object, ...)       # combine objects into a vector
cbind(object, object, ...)   # combine objects as columns
rbind(object, object, ...)   # combine objects as rows 

object          # prints the object
print(object)   # prints the object

ls()        # list current objects
rm(object)  # delete an object

Plotting

Base R

The base/native plotting functions in R are quite simple and may not look as appealing as desired:

x <- rnorm(1000)  # generate normal random numbers
x <- cumsum(x)    # random walk
plot(x, type = "l", col = "blue", main = "Random walk", xlab = "time", ylab = "log-price")
lines(cumsum(rnorm(1000)), col = "red")

For multiple plots:

par(mfrow = c(2, 2))  # define a 2x2 matrix of plots
plot(cumsum(rnorm(1000)), type = "l", ylab = "x1")
plot(cumsum(rnorm(1000)), type = "l", ylab = "x2")
plot(cumsum(rnorm(1000)), type = "l", ylab = "x3")
plot(cumsum(rnorm(1000)), type = "l", ylab = "x4")
par(mfrow = c(1, 1))  # set it to default single plot

There are two highly recommended packages for plotting: ggplot2 and rbokeh.

ggplot2

The package ggplot2 is extremely popular within the R community. Here is the official webpage, the free online book ggplot2: Elegant Graphics for Data Analysis, and a cheatsheet. It is particularly versatile and adapted to data frames. It is based on the concept of “grammar of graphics”.

library(ggplot2)   # install.packages("ggplot2")
library(reshape2)  # required for melting the data frame

# create data frame with data
df <- data.frame(index   = 1:1000,
                 series1 = cumsum(rnorm(1000)),
                 series2 = cumsum(rnorm(1000)),
                 series3 = cumsum(rnorm(1000)),
                 series4 = cumsum(rnorm(1000)))
molten_df <- melt(df, id.vars = "index", measure.vars = c("series1", "series2", "series3", "series4"))
str(molten_df)
'data.frame':   4000 obs. of  3 variables:
 $ index   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ variable: Factor w/ 4 levels "series1","series2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  -0.5903 -0.0357 -0.0803 -0.1943 0.1056 ...
# plot
ggplot(molten_df, aes(x = index, y = value, col = variable)) + 
  geom_line() +
  ggtitle("Random walks")

ggplot(molten_df, aes(x = index, y = value, col = variable)) + 
  geom_line(show.legend = FALSE) +
  facet_wrap(~ variable) +
  ggtitle("Random walks")

rbokeh

The package rbokeh is adopted from Python and allows for interactive plotting (however it seems that it is not being maintained anymore since the last update was in 2016):

library(rbokeh)  # install.packages("rbokeh") or devtools::install_github("bokeh/rbokeh")

figure(width = 700, title = "Random walks", 
       xlab = "T", ylab = "log-price", legend_location = "top_right") %>%
  ly_lines(cumsum(rnorm(1000)), color = "blue", width = 2, legend = "line 1") %>%
  ly_lines(cumsum(rnorm(1000)), color = "green", width = 2, legend = "line 2") %>%
  ly_lines(cumsum(rnorm(1000)), color = "purple", width = 2, legend = "line 3")

Key packages for finance

Package xts

As previously mentioned, in finance we mainly deal with multivariate time series that can be thought of as matrices where each row is an observation in a specific order (properly indexed with dates or times) and all columns are of the same time (numeric) corresponding to different assets One could simply use an object of class matrix or class data.frame. However, there is a very convenient class (from package xts) that has been specifically designed for that purpose: xts (actually, it is the culmination of a long history of development of other classes like ts, fts, mts, irts, tseries, timeSeries, and zoo).

Creating xts

One can easily convert an existing time series data into xts with as.xts():

library(xts)

data(sample_matrix)  # load some data from package xts
class(sample_matrix)
[1] "matrix" "array" 
str(sample_matrix)
 num [1:180, 1:4] 50 50.2 50.4 50.4 50.2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:180] "2007-01-02" "2007-01-03" "2007-01-04" "2007-01-05" ...
  ..$ : chr [1:4] "Open" "High" "Low" "Close"
matrix_xts <- as.xts(sample_matrix, dateFormat = "Date")
class(matrix_xts)
[1] "xts" "zoo"
str(matrix_xts)
An xts object on 2007-01-02 / 2007-06-30 containing: 
  Data:    double [180, 4]
  Columns: Open, High, Low, Close
  Index:   Date [180] (TZ: "UTC")

Alternatively, one can create new data with the xts constructor xts():

xts(1:10, as.Date("2000-01-01") + 1:10)
           [,1]
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
2000-01-08    7
2000-01-09    8
2000-01-10    9
2000-01-11   10

Subsetting xts

The most noticable difference in the behavior of xts objects will be apparent in the use of the “[” operator. Using special notation, one can use date-like strings to extract data based on the time-index. Using increasing levels of time-detail, it is possible to subset the object by year, week, days, or even seconds.

The i (row) argument to the subset operator “[”, in addition to accepting numeric values for indexing, can also be a character string, a time-based object, or a vector of either. The format must left-specified with respect to the standard ISO:8601 time format “CCYY-MM-DD HH:MM:SS”. This means that for one to extract a particular month, it is necesssary to fully specify the year as well. To identify a particular hour, say all observations in the eighth hour on January 1, 2007, one would likewise need to include the full year, month and day - e.g. “2007-01-01 08”.

It is also possible to explicitly request a range of times via this index-based subsetting, using the ISO-recommended “/” as the range separator. The basic form is “from/to”, where both from and to are optional. If either side is missing, it is interpretted as a request to retrieve data from the beginning, or through the end of the data object.

Another benefit to this method is that exact starting and ending times need not match the underlying data: the nearest available observation will be returned that is within the requested time period.

The following example shows how to extract the entire month of March 2007:

matrix_xts["2007-03"]
#>                Open     High      Low    Close
#> 2007-03-01 50.81620 50.81620 50.56451 50.57075
#> 2007-03-02 50.60980 50.72061 50.50808 50.61559
#> 2007-03-03 50.73241 50.73241 50.40929 50.41033
#> 2007-03-04 50.39273 50.40881 50.24922 50.32636
#> 2007-03-05 50.26501 50.34050 50.26501 50.29567
#> ...

Now extract all the data from the beginning through January 7, 2007:

matrix_xts["/2007-01-07"]  # same as: matrix_xts["::2007-01-07"]
               Open     High      Low    Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
2007-01-07 50.13211 50.21561 49.99185 49.99185

Additional xts tools providing subsetting are the first and last functions. In the spirit of head and tail from the utils recommended package, they allow for string based subsetting, without forcing the user to conform to the specifics of the time index. Here is the first 1 week of the data:

first(matrix_xts,"1 week")
               Open     High      Low    Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
2007-01-07 50.13211 50.21561 49.99185 49.99185

… and here is the first 3 days of the last week of the data.

first(last(matrix_xts,"1 week"),"3 days")
               Open     High      Low    Close
2007-06-25 47.20471 47.42772 47.13405 47.42772
2007-06-26 47.44300 47.61611 47.44300 47.61611
2007-06-27 47.62323 47.71673 47.60015 47.62769

While the subsetting ability of the above makes exactly which time-based class you choose for your index a bit less relevant, it is nonetheless a factor that is beneficial to have control over.

To that end, xts provides facilities for indexing based on any of the current time-based classes. These include Date, POSIXct, chron, yearmon, yearqtr, and timeDate. The index itself may be accessed via the zoo generics extended to xts: index and the replacement function index<-.

It is also possible to directly query and set the index class of an xts object by using the respective functions tclass and tclass<-. Temporary conversion, resulting in a new object with the requested index class, can be accomplished via the convertIndex function.

tclass(matrix_xts)
[1] "Date"
matrix_xts_POSIX <- convertIndex(matrix_xts,'POSIXct')
tclass(matrix_xts_POSIX)
[1] "POSIXct" "POSIXt" 

Of course one can also use the traditional indexing for matrices:

matrix_xts[1:5]  # same as matrix_xts[1:5, ]
               Open     High      Low    Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
matrix_xts[1:5, 4]
              Close
2007-01-02 50.11778
2007-01-03 50.39767
2007-01-04 50.33236
2007-01-05 50.33459
2007-01-06 50.18112
matrix_xts[1:5, "Close"]
              Close
2007-01-02 50.11778
2007-01-03 50.39767
2007-01-04 50.33236
2007-01-05 50.33459
2007-01-06 50.18112
matrix_xts[1:5]$Close
              Close
2007-01-02 50.11778
2007-01-03 50.39767
2007-01-04 50.33236
2007-01-05 50.33459
2007-01-06 50.18112

Finally, it is straightforward to combine different xts objects into one with multiple columns and properly aligned by the time index with merge() or simply the more standard cbind() (which calls merge()):

open_close <- cbind(matrix_xts$Open, matrix_xts$Close)
str(open_close)
An xts object on 2007-01-02 / 2007-06-30 containing: 
  Data:    double [180, 2]
  Columns: Open, Close
  Index:   Date [180] (TZ: "UTC")

Plotting xts

Another advantage of using the class xts is for plotting. While the base R plot function is not very visually appealing, when plotting an xts object with plot() it is actually plot.xts() that is invoked and it is much prettier:

# base R plot for matrices
plot(sample_matrix[, 4], type = "l", main = "Stock prices")

# plot for xts (actually uses plot.xts under the hood)
plot(matrix_xts$Close, main = "Stock prices")

One can also use the awesome ggplot2 package. Recall that first we need to melt the multivariate xts object with the function ggplot2::fortify():

library(ggplot2)

# first we melt the xts
molten_df <- fortify(matrix_xts, melt = TRUE)
str(molten_df)
'data.frame':   720 obs. of  3 variables:
 $ Index : Date, format: "2007-01-02" "2007-01-03" ...
 $ Series: Factor w/ 4 levels "Open","High",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Value : num  50 50.2 50.4 50.4 50.2 ...
# plot
ggplot(molten_df, aes(x = Index, y = Value, col = Series)) +
  geom_line()

# configure the plot a bit more
ggplot(molten_df, aes(x = Index, y = Value, col = Series)) +
  geom_line() +
  ggtitle("Stock prices") + xlab(element_blank()) + ylab(element_blank()) + 
  theme(legend.title = element_blank()) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y", date_minor_breaks = "1 week")

Alternatively, we can use the convenient function autoplot() (from the package ggfortify) that will do the melting for us:

library(ggfortify)
autoplot(matrix_xts, facets = FALSE, main = "Stock prices")  # names of molten df: Index, value, plot_group

Note: the package ggTimeSeries contains nice extension of ggplot2 for time series (including calendar heatmaps, horizon plots, steamgraphs, waterfalls, etc.).

Additional time-based tools

Calculate periodicity: The periodicity function provides a quick summary as to the underlying periodicity of time series objects:

periodicity(matrix_xts)
Daily periodicity from 2007-01-02 to 2007-06-30 

Find endpoints by time: Another common issue with time-series data is identifying the endpoints with respect to time. Often it is necessary to break data into hourly or monthly intervals to calculate some statistic. A simple call to endpoints offers a quick vector of values suitable for subsetting a dataset by. Note that the first element it zero, which is used to delineate the end.

endpoints(matrix_xts, on = "months")
[1]   0  30  58  89 119 150 180
matrix_xts[endpoints(matrix_xts, on = "months")]
               Open     High      Low    Close
2007-01-31 50.07049 50.22578 50.07049 50.22578
2007-02-28 50.69435 50.77091 50.59881 50.77091
2007-03-31 48.95616 49.09728 48.95616 48.97490
2007-04-30 49.13825 49.33974 49.11500 49.33974
2007-05-31 47.82845 47.84044 47.73780 47.73780
2007-06-30 47.67468 47.94127 47.67468 47.76719
endpoints(matrix_xts, on = "weeks")
 [1]   0   6  13  20  27  34  41  48  55  62  69  76  83  90  97 104 111 118 125
[20] 132 139 146 153 160 167 174 180
head(matrix_xts[endpoints(matrix_xts, on = "weeks")])
               Open     High      Low    Close
2007-01-07 50.13211 50.21561 49.99185 49.99185
2007-01-14 50.46359 50.62395 50.46359 50.60145
2007-01-21 50.16188 50.42090 50.16044 50.42090
2007-01-28 49.96586 50.00217 49.87468 49.88096
2007-02-04 50.48183 50.55509 50.40203 50.55509
2007-02-11 50.67849 50.91776 50.67849 50.91160

Change periodicity: One of the most ubiquitous type of data in finance is OHLC data (Open-High- Low-Close). Often is is necessary to change the periodicity of this data to something coarser, e.g. take daily data and aggregate to weekly or monthly. With to.period and related wrapper functions it is a simple proposition.

to.period(matrix_xts, "months")
           matrix_xts.Open matrix_xts.High matrix_xts.Low matrix_xts.Close
2007-01-31        50.03978        50.77336       49.76308         50.22578
2007-02-28        50.22448        51.32342       50.19101         50.77091
2007-03-31        50.81620        50.81620       48.23648         48.97490
2007-04-30        48.94407        50.33781       48.80962         49.33974
2007-05-31        49.34572        49.69097       47.51796         47.73780
2007-06-30        47.74432        47.94127       47.09144         47.76719
periodicity(to.period(matrix_xts, "months"))
Monthly periodicity from 2007-01-31 to 2007-06-30 
# changing the index to something more appropriate
to.monthly(matrix_xts)
         matrix_xts.Open matrix_xts.High matrix_xts.Low matrix_xts.Close
Jan 2007        50.03978        50.77336       49.76308         50.22578
Feb 2007        50.22448        51.32342       50.19101         50.77091
Mar 2007        50.81620        50.81620       48.23648         48.97490
Apr 2007        48.94407        50.33781       48.80962         49.33974
May 2007        49.34572        49.69097       47.51796         47.73780
Jun 2007        47.74432        47.94127       47.09144         47.76719

Periodically apply a function: Often it is desirable to be able to calculate a particular statistic, or evaluate a function, over a set of non-overlapping time periods. With the period.apply family of functions it is quite simple. The following examples illustrate a simple application of the max function to our example data:

# the general function, internally calls sapply
period.apply(matrix_xts[, "Close"], INDEX = endpoints(matrix_xts), FUN = max)
              Close
2007-01-31 50.67835
2007-02-28 51.17899
2007-03-31 50.61559
2007-04-30 50.32556
2007-05-31 49.58677
2007-06-30 47.76719
# same result as above, just a monthly interface
apply.monthly(matrix_xts[, "Close"], FUN = max)
              Close
2007-01-31 50.67835
2007-02-28 51.17899
2007-03-31 50.61559
2007-04-30 50.32556
2007-05-31 49.58677
2007-06-30 47.76719
# using one of the optimized functions - about 4x faster
period.max(matrix_xts[,4], endpoints(matrix_xts))
               [,1]
2007-01-31 50.67835
2007-02-28 51.17899
2007-03-31 50.61559
2007-04-30 50.32556
2007-05-31 49.58677
2007-06-30 47.76719

In addition to apply.monthly, there are wrappers to other common time frames including: apply.daily, apply.weekly, apply.quarterly, and ap- ply.yearly. Current optimized functions include period.max, period.min, period.sum, and period.prod.

Package quantmod

The package quantmod is designed to assist the quantitative trader in the development, testing, and deployment of statistically based trading models.

Getting data: The most useful function in quantmod is getSymbol(), which allows to conveniently load data from several websites like YahooFinance, GoogleFinance, FRED, etc.:

library(quantmod)

getSymbols(c("AAPL", "GOOG"), from = "2013-01-01", to = "2015-12-31")
[1] "AAPL" "GOOG"
str(AAPL)
An xts object on 2013-01-02 / 2015-12-30 containing: 
  Data:    double [755, 6]
  Columns: AAPL.Open, AAPL.High, AAPL.Low, AAPL.Close, AAPL.Volume ... with 1 more column
  Index:   Date [755] (TZ: "UTC")
  xts Attributes:
    $ src    : chr "yahoo"
    $ updated: POSIXct[1:1], format: "2025-02-06 08:49:30"
head(AAPL)
           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2013-01-02  19.77929  19.82143 19.34393   19.60821   560518000      16.68734
2013-01-03  19.56714  19.63107 19.32143   19.36071   352965200      16.47671
2013-01-04  19.17750  19.23679 18.77964   18.82143   594333600      16.01775
2013-01-07  18.64286  18.90357 18.40000   18.71071   484156400      15.92353
2013-01-08  18.90036  18.99607 18.61607   18.76107   458707200      15.96639
2013-01-09  18.66071  18.75036 18.42821   18.46786   407604400      15.71685

The OHLCV basics: Data commonly has the prices open, high, low, close, adjusted close, as well as volume. There are many handy functions to extract those data, e.g., Op(), Hi(), Lo(), Cl(), Ad(), Vo(), as well as to query a variety of questions such as is.OHLC(), has.Vo(), etc.

getSymbols("GS")  # Goldman OHLC from yahoo 
[1] "GS"
is.OHLC(GS)  # does the data contain at least OHL and C? 
[1] TRUE
has.Vo(GS)  # how about volume? 
[1] TRUE
head(Op(GS))  # just the Open column please. 
           GS.Open
2007-01-03  200.60
2007-01-04  200.22
2007-01-05  198.43
2007-01-08  199.05
2007-01-09  203.54
2007-01-10  203.40

Charting with quantmod: The function chartSeries() is a nice tool to visualize financial time series in a way that many practicioners are familiar with—line charts, as well as OHLC bar and candle charts. There are convenience wrappers to these different styles (lineChart(), barChart(), and candleChart()), though chartSeries() does quite a bit to automatically handle data in the most appropriate way.

chartSeries(AAPL["2013-8/2013-12"], name = "AAPL") 

# Add multi-coloring and change background to white 
candleChart(AAPL["2013-8/2013-12"], multi.col = TRUE, theme = "white", name = "AAPL")

# Now weekly with custom color candles using the quantmod function to.weekly 
chartSeries(to.weekly(AAPL), up.col = "white", dn.col = "blue", name = "AAPL")

Technical analysis charting tools: One can add technical analysis studies from package TTR to the above charts:

chartSeries(AAPL["2014"], name = "AAPL",
            TA = "addMACD(); addBBands()")

chartSeries(AAPL["2014"], name = "AAPL", 
            TA = "addMomentum(); addEMA(); addRSI()")

reChart(subset = "2014-6/2014-12", theme = "white", type = "candles")

Package TTR

The package TTR (Technical Trading Rules) is designed for traditional technical analysis and charting.

Moving averages: One can easily compute moving averages.

library(TTR)

# simple moving average
sma10 <- SMA(Cl(AAPL), n = 10)
head(sma10, 20)
                SMA
2013-01-02       NA
2013-01-03       NA
2013-01-04       NA
2013-01-07       NA
2013-01-08       NA
2013-01-09       NA
2013-01-10       NA
2013-01-11       NA
2013-01-14       NA
2013-01-15 18.62829
2013-01-16 18.47493
2013-01-17 18.33414
2013-01-18 18.23771
2013-01-22 18.16939
2013-01-23 18.12904
2013-01-24 17.89118
2013-01-25 17.59250
2013-01-28 17.34082
2013-01-29 17.18554
2013-01-30 17.08164
# exponentially moving average
ema10 <- EMA(Cl(AAPL), n = 10)

Bollinger Bands:

bb20 <- BBands(HLC(AAPL), sd = 2.0)
str(bb20)
An xts object on 2013-01-02 / 2015-12-30 containing: 
  Data:    double [755, 4]
  Columns: dn, mavg, up, pctB
  Index:   Date [755] (TZ: "UTC")
  xts Attributes:
    $ src    : chr "yahoo"
    $ updated: POSIXct[1:1], format: "2025-02-06 08:49:30"
plot(bb20)

RSI – Relative Strength Indicator:

rsi14 <- RSI(Cl(AAPL), n = 14)
plot(cbind(Cl(AAPL), rsi14), legend.loc = "topleft")

MACD:

macd <- MACD(Cl(AAPL), nFast = 12, nSlow = 26, nSig = 9, maType = SMA)
str(macd)
An xts object on 2013-01-02 / 2015-12-30 containing: 
  Data:    double [755, 2]
  Columns: macd, signal
  Index:   Date [755] (TZ: "UTC")
  xts Attributes:
    $ src    : chr "yahoo"
    $ updated: POSIXct[1:1], format: "2025-02-06 08:49:30"
plot(cbind(Cl(AAPL), macd), legend.loc = "topleft")

Package PerformanceAnalytics

The package PerformanceAnalytics contains a large list of convenient functions for plotting and evaluation of performance.

library(PerformanceAnalytics)

# compute returns
ret <- CalculateReturns(cbind(Cl(AAPL), Cl(GOOG)))  # same as Cl(AAPL)/lag(Cl(AAPL)) - 1)
head(ret)
             AAPL.Close    GOOG.Close
2013-01-02           NA            NA
2013-01-03 -0.012622234  0.0005807685
2013-01-04 -0.027854637  0.0197603839
2013-01-07 -0.005882336 -0.0043632609
2013-01-08  0.002691287 -0.0019735156
2013-01-09 -0.015628793  0.0065730483
# performance measures
table.AnnualizedReturns(ret)
                          AAPL.Close GOOG.Close
Annualized Return             0.1105     0.2895
Annualized Std Dev            0.2575     0.2449
Annualized Sharpe (Rf=0%)     0.4291     1.1821
table.CalendarReturns(ret)
      Jan  Feb  Mar  Apr  May Jun  Jul  Aug  Sep  Oct  Nov  Dec AAPL.Close
2013 -0.3 -0.7 -2.1  2.9 -0.4 0.7 -0.2 -0.9 -1.2 -0.4  1.9  1.2        0.3
2014  0.2 -0.3  0.0 -0.4 -0.4 1.0 -2.6  0.2  0.6  1.0 -0.1 -1.9       -2.6
2015 -1.5 -1.5 -1.5 -2.7 -1.1 0.7 -0.9 -0.5  1.1 -0.9  0.4 -1.3       -9.2
     GOOG.Close
2013       -0.3
2014        1.5
2015       -2.8
table.Stats(ret)
                AAPL.Close GOOG.Close
Observations      754.0000   754.0000
NAs                 1.0000     1.0000
Minimum            -0.1236    -0.0531
Quartile 1         -0.0076    -0.0068
Median              0.0001     0.0000
Arithmetic Mean     0.0005     0.0011
Geometric Mean      0.0004     0.0010
Quartile 3          0.0100     0.0084
Maximum             0.0820     0.1605
SE Mean             0.0006     0.0006
LCL Mean (0.95)    -0.0006     0.0000
UCL Mean (0.95)     0.0017     0.0022
Variance            0.0003     0.0002
Stdev               0.0162     0.0154
Skewness           -0.5988     2.6857
Kurtosis            6.5047    24.1815
table.DownsideRisk(ret)
                              AAPL.Close GOOG.Close
Semi Deviation                    0.0118     0.0092
Gain Deviation                    0.0105     0.0142
Loss Deviation                    0.0120     0.0082
Downside Deviation (MAR=210%)     0.0163     0.0138
Downside Deviation (Rf=0%)        0.0116     0.0086
Downside Deviation (0%)           0.0116     0.0086
Maximum Drawdown                  0.2887     0.1918
Historical VaR (95%)             -0.0249    -0.0196
Historical ES (95%)              -0.0369    -0.0273
Modified VaR (95%)               -0.0266    -0.0029
Modified ES (95%)                -0.0538    -0.0241
# plots
charts.PerformanceSummary(ret, wealth.index = TRUE, main = "Buy & Hold performance")

chart.Boxplot(ret)

chart.Histogram(ret, note.cex = 0.5, 
                methods = c("add.density", "add.normal", "add.risk", "add.qqplot"))

chart.RiskReturnScatter(ret)

Package portfolioBacktest

When a trader designs a portfolio strategy, the first thing to do is to backtest it. Backtesting is the process by which the portfolio strategy is put to test using the past historical market data available.

A common approach is to do a single backtest against the existing historical data and then plot graphs and draw conclusions from that. This is a big mistake. Performing a single backtest is not representative as it is just one realization and one will definitely overfit the tested strategy if there is parameter tuning involved or portfolio comparisons involved. Section 1 of this book chapter on backtesting illustrates the dangers of backtesting.

The package portfolioBacktest performs multiple backtesting of portfolios in an automated way on a rolling-window basis by taking data randomly from different markets, different time periods, and different stock universes. Here is a simple usage example with the equally weighted portfolio:

  • Step 1 - load package & datasets (you should download many more datasets, see vignette)
library(portfolioBacktest)  # install.packages("portfolioBacktest")
data("dataset10")
  • Step 2 - define your own portfolio
my_portfolio <- function(dataset, ...) {
  prices <- dataset$adjusted
  N <- ncol(prices)
  w <- rep(1/N, N)
  return(w)
}
  • Step 3 - do backtest (dataset10 just contains 10 datasets for illustration purposes)
bt <- portfolioBacktest(my_portfolio, dataset10)
  • Step 4 - check your portfolio performance
backtestSummary(bt)$performance
                           fun1
Sharpe ratio       1.476203e+00
max drawdown       8.937890e-02
annual return      1.594528e-01
annual volatility  1.218623e-01
Sortino ratio      2.057677e+00
downside deviation 8.351402e-02
Sterling ratio     2.122653e+00
Omega ratio        1.295090e+00
VaR (0.95)         1.101934e-02
CVaR (0.95)        1.789425e-02
rebalancing period 1.000000e+00
turnover           8.641594e-03
ROT (bps)          7.334458e+02
cpu time           1.307692e-03
failure rate       0.000000e+00

Examples of the produced tables/plots include:

  • Performance table:
  • Barplot:

  • Boxplot:

R Scripts and R Markdown

R scripts

One simple way to use R is by typing the commands in the command window one by one. However, this quickly becomes inconvenient and it is necessary to write scripts. In RStudio one can simply create a new R script or open a .R file, where the commands are written in the same order as they will be later executed (this point cannot be overemphasized).

With the R script open, one can execute line by line (either clicking a button or with a keyboard shortcut) or source the whole R file (also either clicking a button or with a keyboard shortcut). Alternatively, one can also source the R file from the command line with source("filename.R"), but first one has to make sure to be in the correct folder (to see and set the current directory use the commands: getwd()and setwd("folder_name")). Sourcing using the command source("filename.R") is very convenient when one has a library of useful functions or data that is needed prior to the execution of another main R script.

R Markdown

Another important type of scripts is the R Markdown format (with file extension .Rmd). It is an extremely versatile format that allows the combination of formattable text, mathematics based on Latex codes, R code (or any other language), and then automatic inclusion of the results from the execution of the code (plots, tables, or just other type of output). This type of format also exists for Python and they are generally referred to as Jupyter Notebooks and have recently become key in the context of reproducible research (because anybody can execute the source .Rmd file and reproduce all the plots and output). This document that you are now reading is an example of an R Markdown script.

R Markdown files can be directly created or opened from within RStudio. To compile the source .Rmd file, just click the button called Knit and an html will be automatically generated after executing all the chunks of code (other formats can also be generated like pdf).

The following is a simple header/body template that can be used to prepare projects/reports for this course:

---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-09-05"
output: html_document
---
Summary of this document here.

# First header

## First subheader

## Second subheader

# Second header

- bullet list 1
- bullet list 2
    + more 2a
    + more 2b
  
This is a link: [R Markdown tutorial](http://rmarkdown.rstudio.com) 

```{r}
# here some R code
plot(cumsum(rnorm(100)), type = "l")
```

For more information on the R Markdown formatting:

To explore further

There are several CRAN Task Views relevant to financial applications, each of them encompases many packages: