Primer on Python for Finance

Daniel P. Palomar (2025). Portfolio Optimization: Theory and Application. Cambridge University Press.

Last update: February 06, 2025

Contributors:


Introduction

What is Python?

Python vs R

Let's not even get started :) Both are great! Learn both! You'll have to use whatever your boss/advisor/team needs. But don't forget to know a bit of C++ too :)

Installation

First, install the Python distribution, for example, from Ananconda (in macOS and Linux you can also use brew install python).

Then, install your favorite code editor or IDE. Some examples are:

To get started coding, start your code editor or IDE. For example, with JupyterLab either click the app or from a terminal (or cmd on Windows) simply type jupyter lab (and a browser window will pop up). A Jupyter notebook is an environment where you can write code and interactively evaluate its output. This feature is very convenient for exploratory analysis. Now you are ready to start using Python from within JupyterLab.

Packages

To see the versions of Python and the installed packages just type !pip list or !conda list on a Jupyter notebook and press SHIFT+ENTER. Alternatively, type pip list or conda list in a terminal window and press ENTER.

To see the version of a specific package use import package_name; print(package_name.__version__).

As time progresses, you will have to install different packages from PYPI or Conda with the command pip install package_name or conda install package_name. Note that you can execute these commands from within a Jupyter notebook by prepending them with an exclamation mark (!).

After installing a package, it needs to be imported before it can be used with the command import package_name:

# we need to import it first and then we can use it:
import numpy  # to install do: pip install numpy
x = [1, 2, 3]
y = numpy.mean(x)
y
2.0

It is common to use shortcuts for the names of the imported libraries:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 5, 100)
y = (x - np.pi) * (x - 1.618)
plt.plot(x, y, label = "2nd order degree polynomial")
plt.legend()
plt.show()

Good style

The are several Python style guides:

  • Style Guide for Python Code: The Python Enhancement Proposal (PEP) 8 is a widely accepted document that outlines good programming practices for Python. It was created by Guido van Rossum, Barry Warsaw, and Alyssa Coghlan, and it evolves over time as new conventions are identified and old ones become obsolete. The guide emphasizes that code is read much more often than it is written, and therefore, readability and consistency are crucial. It provides guidelines on various aspects of coding in Python, including naming conventions, indentation, and use of whitespace, among others.

  • Google style guide: Google also has its own style guide for Python, which includes a list of dos and don'ts for Python programs. It emphasizes the importance of using descriptive names for public APIs, making modules importable, and using the right style for module, function, method docstrings, and inline comments. It also recommends using tools like pylint for finding bugs and style problems in Python source code.

  • The Hitchhiker's Guide to Python: It is another resource that also recommends following PEP 8. It highlights the importance of readability and provides some common Python idioms[6].

Remember, while these style guides provide useful guidelines, they are not absolute rules. They are intended to improve the readability and consistency of your code, but there may be instances where it makes sense to deviate from the guidelines. As PEP 8 itself says, "A Foolish Consistency is the Hobgoblin of Little Minds". When in doubt, use your best judgment and consider the readability and maintainability of your code.

Variables: lists, dicts, arrays, and data frames

In Python, we can easily assign a value to a variable or object with = (if the variable does not exist it will be created):

x = "Hello"
x
'Hello'

We can combine several elements with lists:

y = ["Hello", "everyone"]
y
['Hello', 'everyone']

Note that elements in a list need not have the same datatype (we'll see the datatypes in a few minutes)

y = [1, "hello", 2., "everyone"]
y
[1, 'hello', 2.0, 'everyone']

A dictionary, or simply "dict", is a data structure that allows mappings between keywords and values. There are many ways to create a dict, the simplest one is just to use curly brackets {} as follows:

x = {'a': 1, 'b': 2, 'c': 3}
x
{'a': 1, 'b': 2, 'c': 3}

Another way is to explicitly specify the keywords and values:

x = dict(a = 1, b = 2, c = 3)
x
{'a': 1, 'b': 2, 'c': 3}

A dict can store different data types for different keywords:

x = dict(a = '1', b = 2, c = 3)
x
{'a': '1', 'b': 2, 'c': 3}

The usual way to query a value from a dict is to pass the desired keyword:

x['a']
'1'

We can also modify the contents of dicts or add new entries:

x = dict(a = 1, b = 2, c = 3)
x['a'] = 2
x
{'a': 2, 'b': 2, 'c': 3}
x['d'] = 10
x
{'a': 2, 'b': 2, 'c': 3, 'd': 10}

Sets in Python are a collection of unordered unique elements. The main purposes of sets are to verify membership, remove duplicate elements from a sequence, and computing standard math operations on sets.

We can create sets from lists as follows:

x = set([1, 2, 3, 1])
y = set([2, 4, 5, 3])
x.intersection(y)
{2, 3}
x.difference(y)
{1}
x.symmetric_difference(y)
{1, 4, 5}

A useful command is ?variable. It gives you various information about the variable, i.e., type, dimensions, contents, etc. Note that this is a feature only in IPython and JupyterLab.

Another useful feature in Python is slicing. It is specially good for arrays or lists of large dimensions, showing you the first and last n elements, respectively.

x = np.arange(1000)
x[:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x[-10:]
array([990, 991, 992, 993, 994, 995, 996, 997, 998, 999])

It is important to keep in mind that in Python almost everything is done through functions or methods of all sorts such as max(), min(), arange(), linspace(), and so on.

Data types

Operators in Python: arithmetic operators include +, -, *, /, ** for addition, subtraction, multiplication, division, and exponentiation. Binary comparison operators are >, >=, ==, !=. Boolean operators are and, or, and their bitwise versions are &, |.

Python has a wide variety of data types including scalars (float, double, integers, complex numbers), strings, lists, tuples, dictionaries (dicts), sets, nd-arrays, and data frames.

Scalars and strings

Scalars are basically float and integers, for example:

x = 1
type(x)
int
x = 1.1
type(x)
float

Can you think about why Python gives the following answer to the sum 3.3 - (1.1 + 2.2)?

x = 3.3 - (1.1 + 2.2)
x
-4.440892098500626e-16

Try the same thing in R and MATLAB. what do you see? Is it different from what Python computed? What is the binary expansion of 0.333333...? How do computers represent anything? :)

x = "Hello IEDA/ELEC 3180"
x
'Hello IEDA/ELEC 3180'

Complex numbers are defined by using the object aj, where a is any float or integer number, which represents $a\sqrt{-1}$:

x = 1 + 2j
x
(1+2j)
x.conjugate()
(1-2j)
x.real
1.0
x.imag
2.0
abs(x)
2.23606797749979

Note that the function abs, which in case of scalars computes the absolute value, calculates the length of a complex number as it were a vector in $\mathbb{R}^2$.

Lists

The most basic data structure in Python is a list. It is an ordered collection of variables of any type defined with squared brackets []. For example:

x = [1, 2, 3., "hello", True]

Unlike R, MATLAB, and Julia, Python is 0-index based, which means that x[0] is the actual first element of the list x. The length of a list may be obtained via the function len().

len(x)
5

To access the value in a given position of a list, use indexes:

x[1]  # 2nd element of the list x
2

Python allows for negative indexes to be given, e.g., x[-1] returns the last item of a list, x[-2] returns the second last item, and so on:

x[-1]
True
x[-2]
'hello'

Additionally, we can retrieve sublists of a list by using slices, e.g., x[1:3] returns a sublist containing the elements x[1] and x[2]; x[1:] returns a sublist containing all the elements to the right of (and including) x[1]; x[:3] returns a sublist containing all the elements to the left of x[3]. In general, x[a:b], for integers a,b, b > a, returns the ordered sublist from x[a] to x[b-1].

However, in many cases we would like to retrieve a sublist such that the indexes posses a particular property. In the Python jargon this is called "list comprehension". For instance, let's retrieve the sublist of elements which are in even positions of the original list x:

y = [x[i] for i in range(len(x)) if i % 2 == 0]
y
[1, 3.0, True]

We can concatenate lists by using the "+" operator:

[1, 2, 3] + [3, 2, 1]
[1, 2, 3, 3, 2, 1]

NumPy arrays

Python was not designed specificaly for scientific computing, however libraries such as NumPy, started by Travis Oliphant, extend the language data structures so as to deal more easily with vectors, matrices, and the mathematical operations involved.

Note that in Python, 1d numpy arrays (or simply 1d-arrays) are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an $n\times 1$ matrix.

It is also important to differentiate elementwise multiplication * from inner or dot product @ (also np.dot()):

x = np.array([1, 2])
y = np.array([10, 20])
z = x.reshape((len(x), 1))
x * y
array([10, 40])
x @ y
50
z @ np.transpose(z)
array([[1, 2],
       [2, 4]])
z @ z.T
array([[1, 2],
       [2, 4]])

Outer product between two arrays can be done via the function np.outer:

x = np.array([1, 2])
np.outer(x, x)
array([[1, 2],
       [2, 4]])

The number of elements of a numpy array can be retrieved via len:

y = np.array([10, 20])
len(y)
2

Be careful when using len with arrays with more than one dimension! len always return the "size" of the first dimension:

y = np.array([1, 2]).reshape((2, 1))
len(y)
2
len(np.transpose(y))
1
len(y.T)
1

Matrices

A matrix is two-dimensional collection of several variables of the same type.

We can easily create a matrix with np.array:

# generate 5 x 4 numeric matrix 
x = np.random.uniform(size=20).reshape((5, 4))
x
array([[0.99986311, 0.85728769, 0.8866678 , 0.74687179],
       [0.18634443, 0.93322182, 0.4948763 , 0.33255002],
       [0.47607828, 0.34158423, 0.84196406, 0.1878747 ],
       [0.76494651, 0.91290035, 0.87672029, 0.70672659],
       [0.22290809, 0.4503684 , 0.68715706, 0.26613961]])
# we can get the dimensions or number of rows/columns
np.shape(x)
(5, 4)
x.shape
(5, 4)

Identify rows, columns or elements using subscripts:

x[:,3]  # 4th column of matrix (returned as an 1D-array)
array([0.74687179, 0.33255002, 0.1878747 , 0.70672659, 0.26613961])
x[2,:]  # 3rd row of matrix (returned as an 1D-array)
array([0.47607828, 0.34158423, 0.84196406, 0.1878747 ])

Pandas data frames

Pandas is a data analysis library, started by Wes McKinney, whose main data structure is the so called Pandas Data Frames. The basic unit of Pandas dataframes is the Series class. Basically, a dataframe is a collection of column stacked Series objects that share the same "index". In finance, that "index" usually corresponds to time data (seconds, minutes, hours, days, weeks, months, etc).

A Pandas data frame is more general than numpy nd-arrays in the sense that we can attach labels to columns and have them with different data types

import pandas as pd
df = pd.DataFrame(
    {
        'float': [1., 2., 3.],
        'int': [1, 2, 3],
        'datetime': [pd.Timestamp('20180310'), pd.Timestamp('20190310'), pd.Timestamp('20200310')],
        'string': ['foo', 'bar', 'buzz']
    }
)
print(df)
   float  int   datetime string
0    1.0    1 2018-03-10    foo
1    2.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

Let's check the type of one of the variables:

type(df['float'])
pandas.core.series.Series

There are a variety of ways to retrieve the elements of a data frame:

df['float']
0    1.0
1    2.0
2    3.0
Name: float, dtype: float64
df['float'][0]
1.0
df['datetime']
0   2018-03-10
1   2019-03-10
2   2020-03-10
Name: datetime, dtype: datetime64[ns]
df['datetime'][2]
Timestamp('2020-03-10 00:00:00')
df.at[1, 'float']
2.0
df.loc[1]
float                       2.0
int                           2
datetime    2019-03-10 00:00:00
string                      bar
Name: 1, dtype: object
df.loc[:, 'float']
0    1.0
1    2.0
2    3.0
Name: float, dtype: float64
df.loc[1, 'float']
2.0

We can also set values provided that they can be castable to the type of the column:

df.at[1, 'float'] = 10
print(df)
   float  int   datetime string
0    1.0    1 2018-03-10    foo
1   10.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

The variable (column) names can be retrieved via the .columns attribute:

df.columns
Index(['float', 'int', 'datetime', 'string'], dtype='object')

A few other useful methods to inspect dataframes are head() and tail() that show the first and last few rows (observations) in a dataframe:

print(df.head(n=2))
   float  int   datetime string
0    1.0    1 2018-03-10    foo
1   10.0    2 2019-03-10    bar
print(df.tail(n=2))
   float  int   datetime string
1   10.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

In finance, often times data comes with missing values, usually labeled as "NaN" (not a number) or "NaT" (not a time, for time values). Let's check out a basic example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"name": ['Superman', 'Batman', 'Spiderman'],
                   "toy": [np.nan, 'Batmobile', 'Spiderman toy'],
                   "born": [pd.NaT, pd.Timestamp("1956-06-26"), pd.NaT]})
print(df)
        name            toy       born
0   Superman            NaN        NaT
1     Batman      Batmobile 1956-06-26
2  Spiderman  Spiderman toy        NaT

Now, in case we simply would like to remove the rows (observations) where at least one element is NaN or NaT, we use df.dropna():

print(df.dropna())
     name        toy       born
1  Batman  Batmobile 1956-06-26

In case we would like to remove the columns (variables) where at least one element is missing, we make use of the argument axis:

print(df.dropna(axis='columns'))
        name
0   Superman
1     Batman
2  Spiderman

Many other options are available to deal with NaNs and NaTs, such as specifying which columns to look for missing values:

print(df.dropna(subset=['name', 'born']))
     name        toy       born
1  Batman  Batmobile 1956-06-26

Note that this operations do not happen in place, i.e., the original dataframe is kept intact. In case we would like to perform in place modifications, we use the argument inplace=True:

print(df)
        name            toy       born
0   Superman            NaN        NaT
1     Batman      Batmobile 1956-06-26
2  Spiderman  Spiderman toy        NaT
df.dropna(inplace=True)
print(df)
     name        toy       born
1  Batman  Batmobile 1956-06-26

Data frames in Python are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame, but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (double).

Plotting

We will make full use of Matplotlib, Seaborn, and Plotly for all our plots :) See the example below for how to plot data stored in a Pandas dataframe with seaborn.

Let's examine this code snippet from matplotlib's documentation page

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

# Load a numpy record array from yahoo csv data with fields date, open, close,
# volume, adj_close from the mpl-data/example directory. The record array
# stores the date as an np.datetime64 with a day unit ('D') in the date column.
data_file = cbook.get_sample_data('goog.npz', asfileobj=False)
price_data = np.load(data_file)['price_data']
price_data = price_data[-250:]  # get the most recent 250 trading days
type(price_data)
delta1 = np.diff(price_data['adj_close']) / price_data['adj_close'][:-1]

# Plot
volume = (15 * price_data['volume'][:-2] / price_data['volume'][0])**2  # for size
close = 0.003 * price_data['close'][:-2] / 0.003 * price_data['open'][:-2]  # for color
fig, ax = plt.subplots()
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)
ax.set_xlabel(r'$\Delta_i$', fontsize=15)
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)
ax.set_title('Volume and percent change')
ax.grid(True)
fig.tight_layout()
plt.show()

Matplotlib basically deals with numpy nd-arrays and its subclasses like np.recarray used in the previous example. Seaborn, on the other hand, can deal with Pandas DataFrames too. Let's see a basic example from seaborn's documentation page:

import seaborn as sns
sns.set(style="darkgrid")
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")
print(type(fmri))
print(fmri.head())
# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri)
<class 'pandas.core.frame.DataFrame'>
  subject  timepoint event    region    signal
0     s13         18  stim  parietal -0.017552
1      s5         14  stim  parietal -0.080883
2     s12         18  stim  parietal -0.081033
3     s11         18  stim  parietal -0.046134
4     s10         18  stim  parietal -0.037970
<AxesSubplot: xlabel='timepoint', ylabel='signal'>

Key packages for finance

We will make use of several key libraries in Python

Package yfinance

The package yfinance let us query financial instruments data from the Yahoo! Finance platform.

Let's see how to get Apple stock price data from Yahoo! Finance:

import yfinance as yf

#apple = yf.download('AAPL', auto_adjust=False)  # auto_adjust=False is to get also the adjusted close price
apple = yf.download("AAPL", start="2017-01-01", end="2017-04-30", auto_adjust=False)
[*********************100%***********************]  1 of 1 completed

Now, let's inspect the Pandas dataframe apple via methods such as head(), tail(), that show the first and last few observations of the dataframe:

apple.head()
Open High Low Close Adj Close Volume
Date
2017-01-03 00:00:00-05:00 28.950001 29.082500 28.690001 29.037500 26.891964 115127600
2017-01-04 00:00:00-05:00 28.962500 29.127501 28.937500 29.004999 26.861866 84472400
2017-01-05 00:00:00-05:00 28.980000 29.215000 28.952499 29.152500 26.998465 88774400
2017-01-06 00:00:00-05:00 29.195000 29.540001 29.117500 29.477501 27.299444 127007600
2017-01-09 00:00:00-05:00 29.487499 29.857500 29.485001 29.747499 27.549498 134247600
apple.tail()
Open High Low Close Adj Close Volume
Date
2017-04-24 00:00:00-04:00 35.875000 35.987499 35.794998 35.910000 33.400845 68537200
2017-04-25 00:00:00-04:00 35.977501 36.224998 35.967499 36.132500 33.607796 75486000
2017-04-26 00:00:00-04:00 36.117500 36.150002 35.845001 35.919998 33.410152 80164800
2017-04-27 00:00:00-04:00 35.980000 36.040001 35.827499 35.947498 33.435722 56985200
2017-04-28 00:00:00-04:00 36.022499 36.075001 35.817501 35.912498 33.403168 83441600

We can also plot the some desired columns of a dataframe using the plot() method:

print(apple[['High', 'Low', 'Open', 'Close', 'Adj Close']].plot())
AxesSubplot(0.125,0.2;0.775x0.68)

Package empyrical

The empyrical package is an open source Python package developed by Quantopian Inc. It's widely used by practitioners to compute common risk and performance measures.

from empyrical import max_drawdown, roll_max_drawdown, cum_returns, omega_ratio, sharpe_ratio

# create a synthetic array of returns
returns = np.array([.01, .02, .03, -.4, -.06, -.02])
max_drawdown(returns)  # calculate the maximum drawdown
-0.4472800000000001
roll_max_drawdown(returns, window=3) # calculate the maximum drawdown in a rolling window fashion
array([ 0.     , -0.4    , -0.436  , -0.44728])
cum_returns(returns) # calculate the cumulative returns
array([ 0.01      ,  0.0302    ,  0.061106  , -0.3633364 , -0.40153622,
       -0.41350549])
omega_ratio(returns) # calculate the Omega ratio
0.12499999999999999
sharpe_ratio(returns) # calculate the Sharpe ratio
-6.7377339531573535

Package riskparityportfolio

riskparityportfolio is a package to design risk parity portfolios, a different approach to investment that is primarily used to control how much risk goes into each asset.

import riskparityportfolio as rpp  # pip install riskparityportfolio (dependencies: numpy, jax, quadprog, pybind, and tqdm)
import numpy as np
import matplotlib.pyplot as plt

cov_matrix = np.vstack((np.array((1.0000, 0.0015, -0.0119)),
                        np.array((0.0015, 1.0000, -0.0308)),
                        np.array((-0.0119, -0.0308, 1.0000))))
risk_budget_vector = np.array((0.1594, 0.0126, 0.8280))
w = rpp.vanilla.design(cov_matrix, risk_budget_vector)
plt.bar(["stock " + item for item in ["A", "B", "C"]], w)
plt.xlabel("")
plt.ylabel("portfolio weight")
plt.show()

Python Scripts and Jupyter Notebooks

Python Scripts

One simple way to use Python is by typing the commands in the IPython terminal one by one. However, this quickly becomes unscalable and it is necessary to write scripts. You can use your favourite text editor to create Python scripts.

Jupyter Notebook/Lab

Jupyter notebooks enable you to write report-like documents containing code, documentation, mathematical equations, figures, and so on. This document is an example.

To explore further

Check out Awesome Quant for a list of curated packages relevant to financial applications.