Primer on Python for Finance
Daniel P. Palomar (2025). Portfolio Optimization: Theory and Application. Cambridge University Press.
Last update: February 06, 2025
Contributors:
Introduction¶
What is Python?¶
- Python is a general purpose, high-level, interpreted language that has become one of the de facto tools for data science, data analysis, machine learning, and finance. For instance, Python has topped the IEEE rank of top programming languages for many years in a row.
- Python was created by Guido van Rossum in 1991. Guido acted as Python's Benevolent Dictator for Life (BDFL) and rencently retired as a Principal Engineer from Dropbox. Python's main philosophy focuses on code readability, i.e., it provides a simple, easy-to-follow grammar, which often translates into rapid code development.
- Python is an actively developed open source language. The Python Software Foundation (PSF) holds the intellectual property rights and protects the trademarks associated with Python. Agencies, foundations, private companies, and non-profit organizations support the development of Python and many of its open source libraries. Just to name a few, the Moore Foundation, the NumFOCUS organization, Microsoft, J.P. Morgan, and so on, have been funding the development of Python for many years.
- Python is distributed by a variety of sources of which we recommend the Anaconda Python primarily due to its straightforward installation procedure.
- Useful Python links:
- Anaconda
- Searching packages: PYPI
- Python documentation
- JupyterLab is the de facto environment for data analysis
- Learning: LearnPython, Python for Beginners, Learn Python 3, The Hitchhiker's Guide to Python, NumPy basics by Andrej Karpathy, Matplotlib basics, Seaborn basics, Pandas Crash Course by Datacamp,
- Stack Overflow
- Python homepage
- Other resources: Book: Python for Finance, mastering data-driven finance
Python vs R¶
Let's not even get started :) Both are great! Learn both! You'll have to use whatever your boss/advisor/team needs. But don't forget to know a bit of C++ too :)
Installation¶
First, install the Python distribution, for example, from Ananconda (in macOS and Linux you can also use brew install python
).
Then, install your favorite code editor or IDE. Some examples are:
- JupyterLab, which is the de facto IDE. It can be installed from a terminal window with
pip install jupyterlab
(in macOS and Linux you can also usebrew install jupyterlab
) and there is also a desktop version of Jupyter Lab. - PyCharm
- VS Code
- spyder
To get started coding, start your code editor or IDE. For example, with JupyterLab either click the app or from a terminal (or cmd on Windows) simply type jupyter lab
(and a browser window will pop up). A Jupyter notebook is an environment where you can write code and interactively evaluate its output. This feature is very convenient for exploratory analysis.
Now you are ready to start using Python from within JupyterLab.
Packages¶
To see the versions of Python and the installed packages just type !pip list
or !conda list
on a Jupyter notebook
and press SHIFT+ENTER. Alternatively, type pip list
or conda list
in a terminal window and press ENTER.
To see the version of a specific package use import package_name; print(package_name.__version__)
.
As time progresses, you will have to install different packages from PYPI or Conda with the command
pip install package_name
or conda install package_name
. Note that you can execute these commands
from within a Jupyter notebook by prepending them with an exclamation mark (!
).
After installing a package, it needs to be imported before it
can be used with the command import package_name
:
# we need to import it first and then we can use it:
import numpy # to install do: pip install numpy
x = [1, 2, 3]
y = numpy.mean(x)
y
2.0
It is common to use shortcuts for the names of the imported libraries:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 5, 100)
y = (x - np.pi) * (x - 1.618)
plt.plot(x, y, label = "2nd order degree polynomial")
plt.legend()
plt.show()
Good style¶
The are several Python style guides:
Style Guide for Python Code: The Python Enhancement Proposal (PEP) 8 is a widely accepted document that outlines good programming practices for Python. It was created by Guido van Rossum, Barry Warsaw, and Alyssa Coghlan, and it evolves over time as new conventions are identified and old ones become obsolete. The guide emphasizes that code is read much more often than it is written, and therefore, readability and consistency are crucial. It provides guidelines on various aspects of coding in Python, including naming conventions, indentation, and use of whitespace, among others.
Google style guide: Google also has its own style guide for Python, which includes a list of dos and don'ts for Python programs. It emphasizes the importance of using descriptive names for public APIs, making modules importable, and using the right style for module, function, method docstrings, and inline comments. It also recommends using tools like pylint for finding bugs and style problems in Python source code.
The Hitchhiker's Guide to Python: It is another resource that also recommends following PEP 8. It highlights the importance of readability and provides some common Python idioms[6].
Remember, while these style guides provide useful guidelines, they are not absolute rules. They are intended to improve the readability and consistency of your code, but there may be instances where it makes sense to deviate from the guidelines. As PEP 8 itself says, "A Foolish Consistency is the Hobgoblin of Little Minds". When in doubt, use your best judgment and consider the readability and maintainability of your code.
Variables: lists, dicts, arrays, and data frames¶
In Python, we can easily assign a value to a variable or object with =
(if the variable does not exist it will be created):
x = "Hello"
x
'Hello'
We can combine several elements with lists:
y = ["Hello", "everyone"]
y
['Hello', 'everyone']
Note that elements in a list need not have the same datatype (we'll see the datatypes in a few minutes)
y = [1, "hello", 2., "everyone"]
y
[1, 'hello', 2.0, 'everyone']
A dictionary, or simply "dict", is a data structure that allows mappings between keywords and values.
There are many ways to create a dict, the simplest one is just to use curly brackets {}
as follows:
x = {'a': 1, 'b': 2, 'c': 3}
x
{'a': 1, 'b': 2, 'c': 3}
Another way is to explicitly specify the keywords and values:
x = dict(a = 1, b = 2, c = 3)
x
{'a': 1, 'b': 2, 'c': 3}
A dict can store different data types for different keywords:
x = dict(a = '1', b = 2, c = 3)
x
{'a': '1', 'b': 2, 'c': 3}
The usual way to query a value from a dict is to pass the desired keyword:
x['a']
'1'
We can also modify the contents of dicts or add new entries:
x = dict(a = 1, b = 2, c = 3)
x['a'] = 2
x
{'a': 2, 'b': 2, 'c': 3}
x['d'] = 10
x
{'a': 2, 'b': 2, 'c': 3, 'd': 10}
Sets in Python are a collection of unordered unique elements. The main purposes of sets are to verify membership, remove duplicate elements from a sequence, and computing standard math operations on sets.
We can create sets from lists as follows:
x = set([1, 2, 3, 1])
y = set([2, 4, 5, 3])
x.intersection(y)
{2, 3}
x.difference(y)
{1}
x.symmetric_difference(y)
{1, 4, 5}
A useful command is ?variable
. It gives you various information about the variable, i.e.,
type, dimensions, contents, etc. Note that this is a feature only in
IPython and JupyterLab.
Another useful feature in Python is slicing. It is specially good for arrays or lists of large
dimensions, showing you the first and last n
elements, respectively.
x = np.arange(1000)
x[:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x[-10:]
array([990, 991, 992, 993, 994, 995, 996, 997, 998, 999])
It is important to keep in mind that in Python almost everything is done through functions or methods
of all sorts such as max()
, min()
, arange()
, linspace()
, and so on.
Data types¶
Operators in Python: arithmetic operators include +, -, *, /, **
for addition, subtraction,
multiplication, division, and exponentiation. Binary comparison operators are >, >=, ==, !=
.
Boolean operators are and
, or
, and their bitwise versions are &
, |
.
Python has a wide variety of data types including scalars (float, double, integers, complex numbers), strings, lists, tuples, dictionaries (dicts), sets, nd-arrays, and data frames.
Scalars and strings¶
Scalars are basically float and integers, for example:
x = 1
type(x)
int
x = 1.1
type(x)
float
Can you think about why Python gives the following answer to the sum 3.3 - (1.1 + 2.2)
?
x = 3.3 - (1.1 + 2.2)
x
-4.440892098500626e-16
Try the same thing in R and MATLAB. what do you see? Is it different from what Python computed?
What is the binary expansion of 0.333333...
? How do computers represent anything? :)
x = "Hello IEDA/ELEC 3180"
x
'Hello IEDA/ELEC 3180'
Complex numbers are defined by using the object aj
, where a
is any float or integer number,
which represents $a\sqrt{-1}$:
x = 1 + 2j
x
(1+2j)
x.conjugate()
(1-2j)
x.real
1.0
x.imag
2.0
abs(x)
2.23606797749979
Note that the function abs
, which in case of scalars computes the absolute value,
calculates the length of a complex number as it were a vector in $\mathbb{R}^2$.
Lists¶
The most basic data structure in Python is a list. It is an ordered collection of variables of any type
defined with squared brackets []
. For example:
x = [1, 2, 3., "hello", True]
Unlike R
, MATLAB
, and Julia
,
Python is 0-index based, which means that x[0]
is the actual first element of the list x
.
The length of a list may be obtained via the function len()
.
len(x)
5
To access the value in a given position of a list, use indexes:
x[1] # 2nd element of the list x
2
Python allows for negative indexes to be given, e.g., x[-1]
returns the last item of a list, x[-2]
returns the second last item, and so on:
x[-1]
True
x[-2]
'hello'
Additionally, we can retrieve sublists of a list by using slices, e.g., x[1:3]
returns a sublist
containing the elements x[1]
and x[2]
; x[1:]
returns a sublist containing all the elements
to the right of (and including) x[1]
; x[:3]
returns a sublist containing all the elements
to the left of x[3]
. In general, x[a:b]
, for integers a
,b
, b > a
,
returns the ordered sublist from x[a]
to x[b-1]
.
However, in many cases we would like to retrieve a sublist such that the indexes posses a particular property.
In the Python jargon this is called "list comprehension". For instance, let's retrieve the sublist of elements
which are in even positions of the original list x
:
y = [x[i] for i in range(len(x)) if i % 2 == 0]
y
[1, 3.0, True]
We can concatenate lists by using the "+" operator:
[1, 2, 3] + [3, 2, 1]
[1, 2, 3, 3, 2, 1]
NumPy arrays¶
Python was not designed specificaly for scientific computing, however libraries such as NumPy, started by Travis Oliphant, extend the language data structures so as to deal more easily with vectors, matrices, and the mathematical operations involved.
Note that in Python, 1d numpy arrays (or simply 1d-arrays) are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an $n\times 1$ matrix.
It is also important to differentiate elementwise multiplication *
from inner or dot product @
(also np.dot()
):
x = np.array([1, 2])
y = np.array([10, 20])
z = x.reshape((len(x), 1))
x * y
array([10, 40])
x @ y
50
z @ np.transpose(z)
array([[1, 2], [2, 4]])
z @ z.T
array([[1, 2], [2, 4]])
Outer product between two arrays can be done via the function np.outer
:
x = np.array([1, 2])
np.outer(x, x)
array([[1, 2], [2, 4]])
The number of elements of a numpy array can be retrieved via len
:
y = np.array([10, 20])
len(y)
2
Be careful when using len
with arrays with more than one dimension! len
always return the
"size" of the first dimension:
y = np.array([1, 2]).reshape((2, 1))
len(y)
2
len(np.transpose(y))
1
len(y.T)
1
Matrices¶
A matrix is two-dimensional collection of several variables of the same type.
We can easily create a matrix with np.array
:
# generate 5 x 4 numeric matrix
x = np.random.uniform(size=20).reshape((5, 4))
x
array([[0.99986311, 0.85728769, 0.8866678 , 0.74687179], [0.18634443, 0.93322182, 0.4948763 , 0.33255002], [0.47607828, 0.34158423, 0.84196406, 0.1878747 ], [0.76494651, 0.91290035, 0.87672029, 0.70672659], [0.22290809, 0.4503684 , 0.68715706, 0.26613961]])
# we can get the dimensions or number of rows/columns
np.shape(x)
(5, 4)
x.shape
(5, 4)
Identify rows, columns or elements using subscripts:
x[:,3] # 4th column of matrix (returned as an 1D-array)
array([0.74687179, 0.33255002, 0.1878747 , 0.70672659, 0.26613961])
x[2,:] # 3rd row of matrix (returned as an 1D-array)
array([0.47607828, 0.34158423, 0.84196406, 0.1878747 ])
Pandas data frames¶
Pandas is a data analysis library, started by Wes McKinney, whose main data structure is the so called Pandas Data Frames. The basic unit of Pandas dataframes is the Series class. Basically, a dataframe is a collection of column stacked Series objects that share the same "index". In finance, that "index" usually corresponds to time data (seconds, minutes, hours, days, weeks, months, etc).
A Pandas data frame is more general than numpy nd-arrays in the sense that we can attach labels to columns and have them with different data types
import pandas as pd
df = pd.DataFrame(
{
'float': [1., 2., 3.],
'int': [1, 2, 3],
'datetime': [pd.Timestamp('20180310'), pd.Timestamp('20190310'), pd.Timestamp('20200310')],
'string': ['foo', 'bar', 'buzz']
}
)
print(df)
float int datetime string 0 1.0 1 2018-03-10 foo 1 2.0 2 2019-03-10 bar 2 3.0 3 2020-03-10 buzz
Let's check the type of one of the variables:
type(df['float'])
pandas.core.series.Series
There are a variety of ways to retrieve the elements of a data frame:
df['float']
0 1.0 1 2.0 2 3.0 Name: float, dtype: float64
df['float'][0]
1.0
df['datetime']
0 2018-03-10 1 2019-03-10 2 2020-03-10 Name: datetime, dtype: datetime64[ns]
df['datetime'][2]
Timestamp('2020-03-10 00:00:00')
df.at[1, 'float']
2.0
df.loc[1]
float 2.0 int 2 datetime 2019-03-10 00:00:00 string bar Name: 1, dtype: object
df.loc[:, 'float']
0 1.0 1 2.0 2 3.0 Name: float, dtype: float64
df.loc[1, 'float']
2.0
We can also set values provided that they can be castable to the type of the column:
df.at[1, 'float'] = 10
print(df)
float int datetime string 0 1.0 1 2018-03-10 foo 1 10.0 2 2019-03-10 bar 2 3.0 3 2020-03-10 buzz
The variable (column) names can be retrieved via the .columns
attribute:
df.columns
Index(['float', 'int', 'datetime', 'string'], dtype='object')
A few other useful methods to inspect dataframes are head()
and tail()
that show the first and last
few rows (observations) in a dataframe:
print(df.head(n=2))
float int datetime string 0 1.0 1 2018-03-10 foo 1 10.0 2 2019-03-10 bar
print(df.tail(n=2))
float int datetime string 1 10.0 2 2019-03-10 bar 2 3.0 3 2020-03-10 buzz
In finance, often times data comes with missing values, usually labeled as "NaN" (not a number) or "NaT" (not a time, for time values). Let's check out a basic example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"name": ['Superman', 'Batman', 'Spiderman'],
"toy": [np.nan, 'Batmobile', 'Spiderman toy'],
"born": [pd.NaT, pd.Timestamp("1956-06-26"), pd.NaT]})
print(df)
name toy born 0 Superman NaN NaT 1 Batman Batmobile 1956-06-26 2 Spiderman Spiderman toy NaT
Now, in case we simply would like to remove the rows (observations) where at least one element is NaN
or NaT, we use df.dropna()
:
print(df.dropna())
name toy born 1 Batman Batmobile 1956-06-26
In case we would like to remove the columns (variables) where at least one element is missing,
we make use of the argument axis
:
print(df.dropna(axis='columns'))
name 0 Superman 1 Batman 2 Spiderman
Many other options are available to deal with NaNs and NaTs, such as specifying which columns to look for missing values:
print(df.dropna(subset=['name', 'born']))
name toy born 1 Batman Batmobile 1956-06-26
Note that this operations do not happen in place, i.e., the original dataframe is kept intact.
In case we would like to perform in place modifications, we use the argument inplace=True
:
print(df)
name toy born 0 Superman NaN NaT 1 Batman Batmobile 1956-06-26 2 Spiderman Spiderman toy NaT
df.dropna(inplace=True)
print(df)
name toy born 1 Batman Batmobile 1956-06-26
Data frames in Python are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame, but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (double).
Plotting¶
We will make full use of Matplotlib, Seaborn,
and Plotly
for all our plots :) See the example below for how to plot data stored in a Pandas dataframe
with seaborn
.
Let's examine this code snippet from matplotlib's documentation page
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
# Load a numpy record array from yahoo csv data with fields date, open, close,
# volume, adj_close from the mpl-data/example directory. The record array
# stores the date as an np.datetime64 with a day unit ('D') in the date column.
data_file = cbook.get_sample_data('goog.npz', asfileobj=False)
price_data = np.load(data_file)['price_data']
price_data = price_data[-250:] # get the most recent 250 trading days
type(price_data)
delta1 = np.diff(price_data['adj_close']) / price_data['adj_close'][:-1]
# Plot
volume = (15 * price_data['volume'][:-2] / price_data['volume'][0])**2 # for size
close = 0.003 * price_data['close'][:-2] / 0.003 * price_data['open'][:-2] # for color
fig, ax = plt.subplots()
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)
ax.set_xlabel(r'$\Delta_i$', fontsize=15)
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)
ax.set_title('Volume and percent change')
ax.grid(True)
fig.tight_layout()
plt.show()
Matplotlib basically deals with numpy nd-arrays and its subclasses like np.recarray
used in the previous
example. Seaborn, on the other hand, can deal with Pandas DataFrames too. Let's see a basic example from
seaborn's documentation page:
import seaborn as sns
sns.set(style="darkgrid")
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")
print(type(fmri))
print(fmri.head())
# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri)
<class 'pandas.core.frame.DataFrame'> subject timepoint event region signal 0 s13 18 stim parietal -0.017552 1 s5 14 stim parietal -0.080883 2 s12 18 stim parietal -0.081033 3 s11 18 stim parietal -0.046134 4 s10 18 stim parietal -0.037970
<AxesSubplot: xlabel='timepoint', ylabel='signal'>
Key packages for finance¶
We will make use of several key libraries in Python
Package yfinance¶
The package yfinance let us query financial instruments data from the Yahoo! Finance platform.
Let's see how to get Apple stock price data from Yahoo! Finance:
import yfinance as yf
#apple = yf.download('AAPL', auto_adjust=False) # auto_adjust=False is to get also the adjusted close price
apple = yf.download("AAPL", start="2017-01-01", end="2017-04-30", auto_adjust=False)
[*********************100%***********************] 1 of 1 completed
Now, let's inspect the Pandas dataframe apple
via methods such as head()
, tail()
, that show
the first and last few observations of the dataframe:
apple.head()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
Date | ||||||
2017-01-03 00:00:00-05:00 | 28.950001 | 29.082500 | 28.690001 | 29.037500 | 26.891964 | 115127600 |
2017-01-04 00:00:00-05:00 | 28.962500 | 29.127501 | 28.937500 | 29.004999 | 26.861866 | 84472400 |
2017-01-05 00:00:00-05:00 | 28.980000 | 29.215000 | 28.952499 | 29.152500 | 26.998465 | 88774400 |
2017-01-06 00:00:00-05:00 | 29.195000 | 29.540001 | 29.117500 | 29.477501 | 27.299444 | 127007600 |
2017-01-09 00:00:00-05:00 | 29.487499 | 29.857500 | 29.485001 | 29.747499 | 27.549498 | 134247600 |
apple.tail()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
Date | ||||||
2017-04-24 00:00:00-04:00 | 35.875000 | 35.987499 | 35.794998 | 35.910000 | 33.400845 | 68537200 |
2017-04-25 00:00:00-04:00 | 35.977501 | 36.224998 | 35.967499 | 36.132500 | 33.607796 | 75486000 |
2017-04-26 00:00:00-04:00 | 36.117500 | 36.150002 | 35.845001 | 35.919998 | 33.410152 | 80164800 |
2017-04-27 00:00:00-04:00 | 35.980000 | 36.040001 | 35.827499 | 35.947498 | 33.435722 | 56985200 |
2017-04-28 00:00:00-04:00 | 36.022499 | 36.075001 | 35.817501 | 35.912498 | 33.403168 | 83441600 |
We can also plot the some desired columns of a dataframe using the plot()
method:
print(apple[['High', 'Low', 'Open', 'Close', 'Adj Close']].plot())
AxesSubplot(0.125,0.2;0.775x0.68)
Package empyrical¶
The empyrical package is an open source Python package developed by Quantopian Inc. It's widely used by practitioners to compute common risk and performance measures.
from empyrical import max_drawdown, roll_max_drawdown, cum_returns, omega_ratio, sharpe_ratio
# create a synthetic array of returns
returns = np.array([.01, .02, .03, -.4, -.06, -.02])
max_drawdown(returns) # calculate the maximum drawdown
-0.4472800000000001
roll_max_drawdown(returns, window=3) # calculate the maximum drawdown in a rolling window fashion
array([ 0. , -0.4 , -0.436 , -0.44728])
cum_returns(returns) # calculate the cumulative returns
array([ 0.01 , 0.0302 , 0.061106 , -0.3633364 , -0.40153622, -0.41350549])
omega_ratio(returns) # calculate the Omega ratio
0.12499999999999999
sharpe_ratio(returns) # calculate the Sharpe ratio
-6.7377339531573535
Package riskparityportfolio¶
riskparityportfolio is a package to design risk parity portfolios, a different approach to investment that is primarily used to control how much risk goes into each asset.
import riskparityportfolio as rpp # pip install riskparityportfolio (dependencies: numpy, jax, quadprog, pybind, and tqdm)
import numpy as np
import matplotlib.pyplot as plt
cov_matrix = np.vstack((np.array((1.0000, 0.0015, -0.0119)),
np.array((0.0015, 1.0000, -0.0308)),
np.array((-0.0119, -0.0308, 1.0000))))
risk_budget_vector = np.array((0.1594, 0.0126, 0.8280))
w = rpp.vanilla.design(cov_matrix, risk_budget_vector)
plt.bar(["stock " + item for item in ["A", "B", "C"]], w)
plt.xlabel("")
plt.ylabel("portfolio weight")
plt.show()
Python Scripts and Jupyter Notebooks¶
Python Scripts¶
One simple way to use Python is by typing the commands in the IPython terminal one by one. However, this quickly becomes unscalable and it is necessary to write scripts. You can use your favourite text editor to create Python scripts.
Jupyter Notebook/Lab¶
Jupyter notebooks enable you to write report-like documents containing code, documentation, mathematical equations, figures, and so on. This document is an example.
To explore further¶
Check out Awesome Quant for a list of curated packages relevant to financial applications.