Data visualization: both R and Python have many plotting libraries, but the R package ggplot2 (based on the concept of “grammar of graphics”) is the clear winner (see gallery). Now Python also has a ggplot library.
Modeling libraries: both R and Python have tons of libraries. It seems that R has more in data science. For statistics, R is the clear winner.
Ease of learning: Python was designed in 1989 for programmers with a syntax inspired by C. R was developed around 1993 for statisticians and scientists, also with a syntax inspired by C. Some people thing Python is easier while others think R is easier. Perhaps initially R was more difficult to learn than Python but with the modern existing IDE’s like RStudio this is not the case anymore. Some people may say that Python is more elegant, but that depends on what one is used to.
Speed: Python was initially faster than R. However, with the existing modern packages that’s not true anymore. For example, the famous R package data.table to manipulate huge amounts of data is the clearn winner (see benchmark). In fact, the R package Rcpp allows the combination of R with C++ leading to very fast implementations of the packages. More recently, R has benefitted from many parallel computation packages as well.
Community support: both languages have significant amount of user base, hence, they both have a very active support community.
Machine learning: Python is more popular for neural networks. However, the truth is that the popular deep learning libraries (e.g., TensorFlow, MXNet, etc.) are coded in C and have interfaces with Python, R, and other languages. Interestingly, random forests (which is one of the most popular machine learning methods) is far superior in R. The reason is that neural networks traditionally come from a computer science background whereas random forests come from a statistics background.
Why R?: R has been used for statistical computing for over two decades. You can get started with writing useful code in no time. It has been used extensively by data scientists and has an insane number of packages available for a lot of data science related tasks.
Why Python?: Python is more of a general purpose programming language. For web-based applications, Python seems to be more popular.
Finance: again both R and Python are heavily used in finance. You can easily find very passionate defenders and opponents of each language. From my own observations in the academic and industrial sectors, I can say that R is unbeatable for quick testing and prototype development and perhaps Python is more used for a later stage where the final product (probably web-based) has to be developed for clients.
Installation
To install, just follow the following simple steps:
Now you are ready to start using R from within RStudio (note that you can also use R directly from the command line without the need for RStudio or you can use another IDE of your preference).
Packages
To see the versions of R and the installed packages just type sessionInfo():
To see the version of a specific package use packageVersion("package_name").
As time progresses, you will have to install different packages from CRAN with the command install.packages("package_name") or from GitHub with the command devtools::install_github("package_name"). After installing a package, it needs to be loaded before it can be used with the command library("package_name") or library(package_name):
# let's try to use the function xts() from package xts:x <-xts()#> Error in xts() : could not find function "xts"# let's try to load the package first:library(xts)#> Error in library(xts) : there is no package called ‘xts’# let's first install it:install.packages("xts")
# now we can load it and use it:library(xts)x <-xts()
Variables and data types
In R, we can easily assign a value to a variable or object with <- (if the variable does not exist it will be created):
x <-"Hello"x
[1] "Hello"
We can combine several elements with c():
y <-c("Hello", "everyone")y
[1] "Hello" "everyone"
We can always see the variables in memory with ls():
ls()
[1] "x" "y"
My favorite command is str(variable). It gives you various information about the variable, i.e., type of variable, dimensions, contents, etc.
str(x)
chr "Hello"
str(y)
chr [1:2] "Hello" "everyone"
Another useful pair of commands are head() and tail(). They are specially good for variables of large dimensions showing you the first and last few elements, respectively.
x <-c(1:1000)str(x)
int [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
head(x)
[1] 1 2 3 4 5 6
tail(x)
[1] 995 996 997 998 999 1000
It is important to remark that R is a functional language where almost everything is done through functions of all sorts (such as str(), print(), head(), ls(), tail(), max(), etc.).
There are a variety of functions for getting help:
help(matrix) # help about function matrix()?matrix # same thingexample(matrix) # show an example of function matrix()apropos("matrix") # list all functions containing string "matrix"# get vignettes of installed packagesvignette() # show available vingettesvignette("xts") # show specific vignette
Operators in R: arithmetic operators include +, -, *, /, ^ and logical operators >, >=, ==, !=.
R has a wide variety of data types including scalars, vectors, matrices, data frames, and lists.
Vectors
A vector is just a collection of several variables of the same type (numerical, character, logical, etc.).
Note that in R vectors are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an \(n\times 1\) matrix.
It is also important to differentiate elementwise multiplication * from inner product %*% and outer product %o%:
x <-c(1, 2)y <-c(10, 20)x * y
[1] 10 40
x %*% y
[,1]
[1,] 50
t(x) %*% y
[,1]
[1,] 50
x %o% y
[,1] [,2]
[1,] 10 20
[2,] 20 40
One can name the elements of a vector:
names(y)
NULL
names(y) <-c("convex", "optimization")y # same as print(y)
convex optimization
10 20
str(y)
Named num [1:2] 10 20
- attr(*, "names")= chr [1:2] "convex" "optimization"
# we can get the lengthlength(y)
[1] 2
Matrices
A matrix is two-dimensional collection of several variables of the same type (numerical, character, logical, etc.).
We can easily create a matrix with matrix():
# generate 5 x 4 numeric matrix x <-matrix(1:20, nrow =5, ncol =4)x
ID Color Passed
1 1 red TRUE
2 2 white TRUE
3 3 red TRUE
4 4 <NA> FALSE
str(myframe)
'data.frame': 4 obs. of 3 variables:
$ ID : num 1 2 3 4
$ Color : chr "red" "white" "red" NA
$ Passed: logi TRUE TRUE TRUE FALSE
There are a variety of ways to identify the elements of a data frame:
myframe[c(1, 3)] # columns 1,3 of data frame
ID Passed
1 1 TRUE
2 2 TRUE
3 3 TRUE
4 4 FALSE
myframe[c("ID", "Color")] # columns ID and Color from data frame
ID Color
1 1 red
2 2 white
3 3 red
4 4 <NA>
myframe["ID"] # select column ID
ID
1 1
2 2
3 3
4 4
myframe$ID # extract variable ID in the data frame (as a vector), like myframe[["ID"]]
[1] 1 2 3 4
Data frames in R are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (numeric). For multivariate time series we will later explore the class xts, which is more appropriate than matrices or data frames.
Lists
A list is an ordered collection of objects (components) of (possibly) different types. A list allows you to gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components: a string, a numeric vector, a matrix, and a scalar w <-list(name ="Fred", mynumbers =c(1:10), mymatrix =matrix(NA, 3, 3), age =5.3)w
$name
[1] "Fred"
$mynumbers
[1] 1 2 3 4 5 6 7 8 9 10
$mymatrix
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
$age
[1] 5.3
str(w)
List of 4
$ name : chr "Fred"
$ mynumbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ mymatrix : logi [1:3, 1:3] NA NA NA NA NA NA ...
$ age : num 5.3
Useful functions
length(object) # number of elements or components of a vectorstr(object) # structure of an object class(object) # class or type of an objectnames(object) # namesc(object, object, ...) # combine objects into a vectorcbind(object, object, ...) # combine objects as columnsrbind(object, object, ...) # combine objects as rows object # prints the objectprint(object) # prints the objectls() # list current objectsrm(object) # delete an object
Plotting
Base R
The base/native plotting functions in R are quite simple and may not look as appealing as desired:
x <-rnorm(1000) # generate normal random numbersx <-cumsum(x) # random walkplot(x, type ="l", col ="blue", main ="Random walk", xlab ="time", ylab ="log-price")lines(cumsum(rnorm(1000)), col ="red")
For multiple plots:
par(mfrow =c(2, 2)) # define a 2x2 matrix of plotsplot(cumsum(rnorm(1000)), type ="l", ylab ="x1")plot(cumsum(rnorm(1000)), type ="l", ylab ="x2")plot(cumsum(rnorm(1000)), type ="l", ylab ="x3")plot(cumsum(rnorm(1000)), type ="l", ylab ="x4")par(mfrow =c(1, 1)) # set it to default single plot
There are two highly recommended packages for plotting: ggplot2 and rbokeh.
The package rbokeh is adopted from Python and allows for interactive plotting (however it seems that it is not being maintained anymore since the last update was in 2016):
library(rbokeh) # install.packages("rbokeh") or devtools::install_github("bokeh/rbokeh")figure(width =700, title ="Random walks", xlab ="T", ylab ="log-price", legend_location ="top_right") %>%ly_lines(cumsum(rnorm(1000)), color ="blue", width =2, legend ="line 1") %>%ly_lines(cumsum(rnorm(1000)), color ="green", width =2, legend ="line 2") %>%ly_lines(cumsum(rnorm(1000)), color ="purple", width =2, legend ="line 3")
As previously mentioned, in finance we mainly deal with multivariate time series that can be thought of as matrices where each row is an observation in a specific order (properly indexed with dates or times) and all columns are of the same time (numeric) corresponding to different assets One could simply use an object of class matrix or class data.frame. However, there is a very convenient class (from package xts) that has been specifically designed for that purpose: xts (actually, it is the culmination of a long history of development of other classes like ts, fts, mts, irts, tseries, timeSeries, and zoo).
Creating xts
One can easily convert an existing time series data into xts with as.xts():
library(xts)data(sample_matrix) # load some data from package xtsclass(sample_matrix)
The most noticable difference in the behavior of xts objects will be apparent in the use of the “[” operator. Using special notation, one can use date-like strings to extract data based on the time-index. Using increasing levels of time-detail, it is possible to subset the object by year, week, days, or even seconds.
The i (row) argument to the subset operator “[”, in addition to accepting numeric values for indexing, can also be a character string, a time-based object, or a vector of either. The format must left-specified with respect to the standard ISO:8601 time format “CCYY-MM-DD HH:MM:SS”. This means that for one to extract a particular month, it is necesssary to fully specify the year as well. To identify a particular hour, say all observations in the eighth hour on January 1, 2007, one would likewise need to include the full year, month and day - e.g. “2007-01-01 08”.
It is also possible to explicitly request a range of times via this index-based subsetting, using the ISO-recommended “/” as the range separator. The basic form is “from/to”, where both from and to are optional. If either side is missing, it is interpretted as a request to retrieve data from the beginning, or through the end of the data object.
Another benefit to this method is that exact starting and ending times need not match the underlying data: the nearest available observation will be returned that is within the requested time period.
The following example shows how to extract the entire month of March 2007:
Additional xts tools providing subsetting are the first and last functions. In the spirit of head and tail from the utils recommended package, they allow for string based subsetting, without forcing the user to conform to the specifics of the time index. Here is the first 1 week of the data:
… and here is the first 3 days of the last week of the data.
first(last(matrix_xts,"1 week"),"3 days")
Open High Low Close
2007-06-25 47.20471 47.42772 47.13405 47.42772
2007-06-26 47.44300 47.61611 47.44300 47.61611
2007-06-27 47.62323 47.71673 47.60015 47.62769
While the subsetting ability of the above makes exactly which time-based class you choose for your index a bit less relevant, it is nonetheless a factor that is beneficial to have control over.
To that end, xts provides facilities for indexing based on any of the current time-based classes. These include Date, POSIXct, chron, yearmon, yearqtr, and timeDate. The index itself may be accessed via the zoo generics extended to xts: index and the replacement function index<-.
It is also possible to directly query and set the index class of an xts object by using the respective functions tclass and tclass<-. Temporary conversion, resulting in a new object with the requested index class, can be accomplished via the convertIndex function.
Finally, it is straightforward to combine different xts objects into one with multiple columns and properly aligned by the time index with merge() or simply the more standard cbind() (which calls merge()):
An xts object on 2007-01-02 / 2007-06-30 containing:
Data: double [180, 2]
Columns: Open, Close
Index: Date [180] (TZ: "UTC")
Plotting xts
Another advantage of using the class xts is for plotting. While the base R plot function is not very visually appealing, when plotting an xts object with plot() it is actually plot.xts() that is invoked and it is much prettier:
# base R plot for matricesplot(sample_matrix[, 4], type ="l", main ="Stock prices")
# plot for xts (actually uses plot.xts under the hood)plot(matrix_xts$Close, main ="Stock prices")
One can also use the awesome ggplot2 package. Recall that first we need to melt the multivariate xts object with the function ggplot2::fortify():
library(ggplot2)# first we melt the xtsmolten_df <-fortify(matrix_xts, melt =TRUE)str(molten_df)
# plotggplot(molten_df, aes(x = Index, y = Value, col = Series)) +geom_line()
# configure the plot a bit moreggplot(molten_df, aes(x = Index, y = Value, col = Series)) +geom_line() +ggtitle("Stock prices") +xlab(element_blank()) +ylab(element_blank()) +theme(legend.title =element_blank()) +scale_x_date(date_breaks ="1 month", date_labels ="%b %Y", date_minor_breaks ="1 week")
Alternatively, we can use the convenient function autoplot() (from the package ggfortify) that will do the melting for us:
library(ggfortify)autoplot(matrix_xts, facets =FALSE, main ="Stock prices") # names of molten df: Index, value, plot_group
Note: the package ggTimeSeries contains nice extension of ggplot2 for time series (including calendar heatmaps, horizon plots, steamgraphs, waterfalls, etc.).
Additional time-based tools
Calculate periodicity: The periodicity function provides a quick summary as to the underlying periodicity of time series objects:
periodicity(matrix_xts)
Daily periodicity from 2007-01-02 to 2007-06-30
Find endpoints by time: Another common issue with time-series data is identifying the endpoints with respect to time. Often it is necessary to break data into hourly or monthly intervals to calculate some statistic. A simple call to endpoints offers a quick vector of values suitable for subsetting a dataset by. Note that the first element it zero, which is used to delineate the end.
Change periodicity: One of the most ubiquitous type of data in finance is OHLC data (Open-High- Low-Close). Often is is necessary to change the periodicity of this data to something coarser, e.g. take daily data and aggregate to weekly or monthly. With to.period and related wrapper functions it is a simple proposition.
# changing the index to something more appropriateto.monthly(matrix_xts)
matrix_xts.Open matrix_xts.High matrix_xts.Low matrix_xts.Close
Jan 2007 50.03978 50.77336 49.76308 50.22578
Feb 2007 50.22448 51.32342 50.19101 50.77091
Mar 2007 50.81620 50.81620 48.23648 48.97490
Apr 2007 48.94407 50.33781 48.80962 49.33974
May 2007 49.34572 49.69097 47.51796 47.73780
Jun 2007 47.74432 47.94127 47.09144 47.76719
Periodically apply a function: Often it is desirable to be able to calculate a particular statistic, or evaluate a function, over a set of non-overlapping time periods. With the period.apply family of functions it is quite simple. The following examples illustrate a simple application of the max function to our example data:
# the general function, internally calls sapplyperiod.apply(matrix_xts[, "Close"], INDEX =endpoints(matrix_xts), FUN = max)
In addition to apply.monthly, there are wrappers to other common time frames including: apply.daily, apply.weekly, apply.quarterly, and ap- ply.yearly. Current optimized functions include period.max, period.min, period.sum, and period.prod.
The package quantmod is designed to assist the quantitative trader in the development, testing, and deployment of statistically based trading models.
Gettingdata: The most useful function in quantmod is getSymbol(), which allows to conveniently load data from several websites like YahooFinance, GoogleFinance, FRED, etc.:
library(quantmod)getSymbols(c("AAPL", "GOOG"), from ="2013-01-01", to ="2015-12-31")
[1] "AAPL" "GOOG"
str(AAPL)
An xts object on 2013-01-02 / 2015-12-30 containing:
Data: double [755, 6]
Columns: AAPL.Open, AAPL.High, AAPL.Low, AAPL.Close, AAPL.Volume ... with 1 more column
Index: Date [755] (TZ: "UTC")
xts Attributes:
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2025-02-06 08:49:30"
The OHLCV basics: Data commonly has the prices open, high, low, close, adjusted close, as well as volume. There are many handy functions to extract those data, e.g., Op(), Hi(), Lo(), Cl(), Ad(), Vo(), as well as to query a variety of questions such as is.OHLC(), has.Vo(), etc.
getSymbols("GS") # Goldman OHLC from yahoo
[1] "GS"
is.OHLC(GS) # does the data contain at least OHL and C?
Charting with quantmod: The function chartSeries() is a nice tool to visualize financial time series in a way that many practicioners are familiar with—line charts, as well as OHLC bar and candle charts. There are convenience wrappers to these different styles (lineChart(), barChart(), and candleChart()), though chartSeries() does quite a bit to automatically handle data in the most appropriate way.
chartSeries(AAPL["2013-8/2013-12"], name ="AAPL")
# Add multi-coloring and change background to white candleChart(AAPL["2013-8/2013-12"], multi.col =TRUE, theme ="white", name ="AAPL")
# Now weekly with custom color candles using the quantmod function to.weekly chartSeries(to.weekly(AAPL), up.col ="white", dn.col ="blue", name ="AAPL")
Technical analysis charting tools: One can add technical analysis studies from package TTR to the above charts:
chartSeries(AAPL["2014"], name ="AAPL",TA ="addMACD(); addBBands()")
chartSeries(AAPL["2014"], name ="AAPL", TA ="addMomentum(); addEMA(); addRSI()")
reChart(subset ="2014-6/2014-12", theme ="white", type ="candles")
library(TTR)# simple moving averagesma10 <-SMA(Cl(AAPL), n =10)head(sma10, 20)
SMA
2013-01-02 NA
2013-01-03 NA
2013-01-04 NA
2013-01-07 NA
2013-01-08 NA
2013-01-09 NA
2013-01-10 NA
2013-01-11 NA
2013-01-14 NA
2013-01-15 18.62829
2013-01-16 18.47493
2013-01-17 18.33414
2013-01-18 18.23771
2013-01-22 18.16939
2013-01-23 18.12904
2013-01-24 17.89118
2013-01-25 17.59250
2013-01-28 17.34082
2013-01-29 17.18554
2013-01-30 17.08164
# exponentially moving averageema10 <-EMA(Cl(AAPL), n =10)
When a trader designs a portfolio strategy, the first thing to do is to backtest it. Backtesting is the process by which the portfolio strategy is put to test using the past historical market data available.
A common approach is to do a single backtest against the existing historical data and then plot graphs and draw conclusions from that. This is a big mistake. Performing a single backtest is not representative as it is just one realization and one will definitely overfit the tested strategy if there is parameter tuning involved or portfolio comparisons involved. Section 1 of this book chapter on backtesting illustrates the dangers of backtesting.
The package portfolioBacktest performs multiple backtesting of portfolios in an automated way on a rolling-window basis by taking data randomly from different markets, different time periods, and different stock universes. Here is a simple usage example with the equally weighted portfolio:
Step 1 - load package & datasets (you should download many more datasets, see vignette)
my_portfolio <-function(dataset, ...) { prices <- dataset$adjusted N <-ncol(prices) w <-rep(1/N, N)return(w)}
Step 3 - do backtest (dataset10 just contains 10 datasets for illustration purposes)
bt <-portfolioBacktest(my_portfolio, dataset10)
Step 4 - check your portfolio performance
backtestSummary(bt)$performance
fun1
Sharpe ratio 1.476203e+00
max drawdown 8.937890e-02
annual return 1.594528e-01
annual volatility 1.218623e-01
Sortino ratio 2.057677e+00
downside deviation 8.351402e-02
Sterling ratio 2.122653e+00
Omega ratio 1.295090e+00
VaR (0.95) 1.101934e-02
CVaR (0.95) 1.789425e-02
rebalancing period 1.000000e+00
turnover 8.641594e-03
ROT (bps) 7.334458e+02
cpu time 1.307692e-03
failure rate 0.000000e+00
Examples of the produced tables/plots include:
Performance table:
Barplot:
Boxplot:
R Scripts and R Markdown
R scripts
One simple way to use R is by typing the commands in the command window one by one. However, this quickly becomes inconvenient and it is necessary to write scripts. In RStudio one can simply create a new R script or open a .R file, where the commands are written in the same order as they will be later executed (this point cannot be overemphasized).
With the R script open, one can execute line by line (either clicking a button or with a keyboard shortcut) or source the whole R file (also either clicking a button or with a keyboard shortcut). Alternatively, one can also source the R file from the command line with source("filename.R"), but first one has to make sure to be in the correct folder (to see and set the current directory use the commands: getwd()and setwd("folder_name")). Sourcing using the command source("filename.R") is very convenient when one has a library of useful functions or data that is needed prior to the execution of another main R script.
R Markdown
Another important type of scripts is the R Markdown format (with file extension .Rmd). It is an extremely versatile format that allows the combination of formattable text, mathematics based on Latex codes, R code (or any other language), and then automatic inclusion of the results from the execution of the code (plots, tables, or just other type of output). This type of format also exists for Python and they are generally referred to as Jupyter Notebooks and have recently become key in the context of reproducible research (because anybody can execute the source .Rmd file and reproduce all the plots and output). This document that you are now reading is an example of an R Markdown script.
R Markdown files can be directly created or opened from within RStudio. To compile the source .Rmd file, just click the button called Knit and an html will be automatically generated after executing all the chunks of code (other formats can also be generated like pdf).
The following is a simple header/body template that can be used to prepare projects/reports for this course:
---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-09-05"
output: html_document
---
Summary of this document here.# First header## First subheader## Second subheader# Second header- bullet list 1- bullet list 2 + more 2a + more 2bThis is a link: [R Markdown tutorial](http://rmarkdown.rstudio.com)```{r}# here some R codeplot(cumsum(rnorm(100)), type = "l")```
For more information on the R Markdown formatting: