Primer on Python for Finance

Daniel P. Palomar (2025). Portfolio Optimization: Theory and Application. Cambridge University Press.

Last update: February 06, 2025

Contributors:

Introduction¶

What is Python?¶

Python is a general purpose, high-level, interpreted language that has become one of the de facto tools for data science, data analysis, machine learning, and finance. For instance, Python has topped the IEEE rank of top programming languages for many years in a row.
Python was created by Guido van Rossum in 1991. Guido acted as Python's Benevolent Dictator for Life (BDFL) and rencently retired as a Principal Engineer from Dropbox. Python's main philosophy focuses on code readability, i.e., it provides a simple, easy-to-follow grammar, which often translates into rapid code development.
Python is an actively developed open source language. The Python Software Foundation (PSF) holds the intellectual property rights and protects the trademarks associated with Python. Agencies, foundations, private companies, and non-profit organizations support the development of Python and many of its open source libraries. Just to name a few, the Moore Foundation, the NumFOCUS organization, Microsoft, J.P. Morgan, and so on, have been funding the development of Python for many years.
Python is distributed by a variety of sources of which we recommend the Anaconda Python primarily due to its straightforward installation procedure.
Useful Python links:
- Anaconda
- Searching packages: PYPI
- Python documentation
- JupyterLab is the de facto environment for data analysis
- Learning: LearnPython, Python for Beginners, Learn Python 3, The Hitchhiker's Guide to Python, NumPy basics by Andrej Karpathy, Matplotlib basics, Seaborn basics, Pandas Crash Course by Datacamp,
- Stack Overflow
- Python homepage
- Other resources: Book: Python for Finance, mastering data-driven finance

Python vs R¶

Let's not even get started :) Both are great! Learn both! You'll have to use whatever your boss/advisor/team needs. But don't forget to know a bit of C++ too :)

Installation¶

First, install the Python distribution, for example, from Ananconda (in macOS and Linux you can also use brew install python).

Then, install your favorite code editor or IDE. Some examples are:

JupyterLab, which is the de facto IDE. It can be installed from a terminal window with pip install jupyterlab (in macOS and Linux you can also use brew install jupyterlab) and there is also a desktop version of Jupyter Lab.
PyCharm
VS Code
spyder

To get started coding, start your code editor or IDE. For example, with JupyterLab either click the app or from a terminal (or cmd on Windows) simply type jupyter lab (and a browser window will pop up). A Jupyter notebook is an environment where you can write code and interactively evaluate its output. This feature is very convenient for exploratory analysis. Now you are ready to start using Python from within JupyterLab.

Packages¶

To see the versions of Python and the installed packages just type !pip list or !conda list on a Jupyter notebook and press SHIFT+ENTER. Alternatively, type pip list or conda list in a terminal window and press ENTER.

To see the version of a specific package use import package_name; print(package_name.__version__).

As time progresses, you will have to install different packages from PYPI or Conda with the command pip install package_name or conda install package_name. Note that you can execute these commands from within a Jupyter notebook by prepending them with an exclamation mark (!).

After installing a package, it needs to be imported before it can be used with the command import package_name:

# we need to import it first and then we can use it:
import numpy  # to install do: pip install numpy
x = [1, 2, 3]
y = numpy.mean(x)
y

2.0

It is common to use shortcuts for the names of the imported libraries:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 5, 100)
y = (x - np.pi) * (x - 1.618)
plt.plot(x, y, label = "2nd order degree polynomial")
plt.legend()
plt.show()

Good style¶

The are several Python style guides:

Style Guide for Python Code: The Python Enhancement Proposal (PEP) 8 is a widely accepted document that outlines good programming practices for Python. It was created by Guido van Rossum, Barry Warsaw, and Alyssa Coghlan, and it evolves over time as new conventions are identified and old ones become obsolete. The guide emphasizes that code is read much more often than it is written, and therefore, readability and consistency are crucial. It provides guidelines on various aspects of coding in Python, including naming conventions, indentation, and use of whitespace, among others.
Google style guide: Google also has its own style guide for Python, which includes a list of dos and don'ts for Python programs. It emphasizes the importance of using descriptive names for public APIs, making modules importable, and using the right style for module, function, method docstrings, and inline comments. It also recommends using tools like pylint for finding bugs and style problems in Python source code.
The Hitchhiker's Guide to Python: It is another resource that also recommends following PEP 8. It highlights the importance of readability and provides some common Python idioms[6].

Remember, while these style guides provide useful guidelines, they are not absolute rules. They are intended to improve the readability and consistency of your code, but there may be instances where it makes sense to deviate from the guidelines. As PEP 8 itself says, "A Foolish Consistency is the Hobgoblin of Little Minds". When in doubt, use your best judgment and consider the readability and maintainability of your code.

Variables: lists, dicts, arrays, and data frames¶

In Python, we can easily assign a value to a variable or object with = (if the variable does not exist it will be created):

x = "Hello"
x

'Hello'

We can combine several elements with lists:

y = ["Hello", "everyone"]
y

['Hello', 'everyone']

Note that elements in a list need not have the same datatype (we'll see the datatypes in a few minutes)

y = [1, "hello", 2., "everyone"]
y

[1, 'hello', 2.0, 'everyone']

A dictionary, or simply "dict", is a data structure that allows mappings between keywords and values. There are many ways to create a dict, the simplest one is just to use curly brackets {} as follows:

x = {'a': 1, 'b': 2, 'c': 3}
x

{'a': 1, 'b': 2, 'c': 3}

Another way is to explicitly specify the keywords and values:

x = dict(a = 1, b = 2, c = 3)
x

{'a': 1, 'b': 2, 'c': 3}

A dict can store different data types for different keywords:

x = dict(a = '1', b = 2, c = 3)
x

{'a': '1', 'b': 2, 'c': 3}

The usual way to query a value from a dict is to pass the desired keyword:

x['a']

'1'

We can also modify the contents of dicts or add new entries:

x = dict(a = 1, b = 2, c = 3)
x['a'] = 2
x

{'a': 2, 'b': 2, 'c': 3}

x['d'] = 10
x

{'a': 2, 'b': 2, 'c': 3, 'd': 10}

Sets in Python are a collection of unordered unique elements. The main purposes of sets are to verify membership, remove duplicate elements from a sequence, and computing standard math operations on sets.

We can create sets from lists as follows:

x = set([1, 2, 3, 1])
y = set([2, 4, 5, 3])
x.intersection(y)

{2, 3}

x.difference(y)

{1}

x.symmetric_difference(y)

{1, 4, 5}

A useful command is ?variable. It gives you various information about the variable, i.e., type, dimensions, contents, etc. Note that this is a feature only in IPython and JupyterLab.

Another useful feature in Python is slicing. It is specially good for arrays or lists of large dimensions, showing you the first and last n elements, respectively.

x = np.arange(1000)
x[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x[-10:]

array([990, 991, 992, 993, 994, 995, 996, 997, 998, 999])

It is important to keep in mind that in Python almost everything is done through functions or methods of all sorts such as max(), min(), arange(), linspace(), and so on.

Data types¶

Operators in Python: arithmetic operators include +, -, *, /, ** for addition, subtraction, multiplication, division, and exponentiation. Binary comparison operators are >, >=, ==, !=. Boolean operators are and, or, and their bitwise versions are &, |.

Python has a wide variety of data types including scalars (float, double, integers, complex numbers), strings, lists, tuples, dictionaries (dicts), sets, nd-arrays, and data frames.

Scalars and strings¶

Scalars are basically float and integers, for example:

x = 1
type(x)

int

x = 1.1
type(x)

float

Can you think about why Python gives the following answer to the sum 3.3 - (1.1 + 2.2)?

x = 3.3 - (1.1 + 2.2)
x

-4.440892098500626e-16

Try the same thing in R and MATLAB. what do you see? Is it different from what Python computed? What is the binary expansion of 0.333333...? How do computers represent anything? :)

x = "Hello IEDA/ELEC 3180"
x

'Hello IEDA/ELEC 3180'

Complex numbers are defined by using the object aj, where a is any float or integer number, which represents $a\sqrt{-1}$:

x = 1 + 2j
x

(1+2j)

x.conjugate()

(1-2j)

x.real

1.0

x.imag

2.0

abs(x)

2.23606797749979

Note that the function abs, which in case of scalars computes the absolute value, calculates the length of a complex number as it were a vector in $\mathbb{R}^2$.

Lists¶

The most basic data structure in Python is a list. It is an ordered collection of variables of any type defined with squared brackets []. For example:

x = [1, 2, 3., "hello", True]

Unlike R, MATLAB, and Julia, Python is 0-index based, which means that x[0] is the actual first element of the list x. The length of a list may be obtained via the function len().

len(x)

To access the value in a given position of a list, use indexes:

x[1]  # 2nd element of the list x

Python allows for negative indexes to be given, e.g., x[-1] returns the last item of a list, x[-2] returns the second last item, and so on:

x[-1]

True

x[-2]

'hello'

Additionally, we can retrieve sublists of a list by using slices, e.g., x[1:3] returns a sublist containing the elements x[1] and x[2]; x[1:] returns a sublist containing all the elements to the right of (and including) x[1]; x[:3] returns a sublist containing all the elements to the left of x[3]. In general, x[a:b], for integers a,b, b > a, returns the ordered sublist from x[a] to x[b-1].

However, in many cases we would like to retrieve a sublist such that the indexes posses a particular property. In the Python jargon this is called "list comprehension". For instance, let's retrieve the sublist of elements which are in even positions of the original list x:

y = [x[i] for i in range(len(x)) if i % 2 == 0]
y

[1, 3.0, True]

We can concatenate lists by using the "+" operator:

[1, 2, 3] + [3, 2, 1]

[1, 2, 3, 3, 2, 1]

NumPy arrays¶

Python was not designed specificaly for scientific computing, however libraries such as NumPy, started by Travis Oliphant, extend the language data structures so as to deal more easily with vectors, matrices, and the mathematical operations involved.

Note that in Python, 1d numpy arrays (or simply 1d-arrays) are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an $n\times 1$ matrix.

It is also important to differentiate elementwise multiplication * from inner or dot product @ (also np.dot()):

x = np.array([1, 2])
y = np.array([10, 20])
z = x.reshape((len(x), 1))
x * y

array([10, 40])

x @ y

z @ np.transpose(z)

array([[1, 2],
       [2, 4]])

z @ z.T

array([[1, 2],
       [2, 4]])

Outer product between two arrays can be done via the function np.outer:

x = np.array([1, 2])
np.outer(x, x)

array([[1, 2],
       [2, 4]])

The number of elements of a numpy array can be retrieved via len:

y = np.array([10, 20])
len(y)

Be careful when using len with arrays with more than one dimension! len always return the "size" of the first dimension:

y = np.array([1, 2]).reshape((2, 1))
len(y)

len(np.transpose(y))

len(y.T)

Matrices¶

A matrix is two-dimensional collection of several variables of the same type.

We can easily create a matrix with np.array:

# generate 5 x 4 numeric matrix 
x = np.random.uniform(size=20).reshape((5, 4))
x

array([[0.99986311, 0.85728769, 0.8866678 , 0.74687179],
       [0.18634443, 0.93322182, 0.4948763 , 0.33255002],
       [0.47607828, 0.34158423, 0.84196406, 0.1878747 ],
       [0.76494651, 0.91290035, 0.87672029, 0.70672659],
       [0.22290809, 0.4503684 , 0.68715706, 0.26613961]])

# we can get the dimensions or number of rows/columns
np.shape(x)

(5, 4)

x.shape

(5, 4)

Identify rows, columns or elements using subscripts:

x[:,3]  # 4th column of matrix (returned as an 1D-array)

array([0.74687179, 0.33255002, 0.1878747 , 0.70672659, 0.26613961])

x[2,:]  # 3rd row of matrix (returned as an 1D-array)

array([0.47607828, 0.34158423, 0.84196406, 0.1878747 ])

Pandas data frames¶

Pandas is a data analysis library, started by Wes McKinney, whose main data structure is the so called Pandas Data Frames. The basic unit of Pandas dataframes is the Series class. Basically, a dataframe is a collection of column stacked Series objects that share the same "index". In finance, that "index" usually corresponds to time data (seconds, minutes, hours, days, weeks, months, etc).

A Pandas data frame is more general than numpy nd-arrays in the sense that we can attach labels to columns and have them with different data types

import pandas as pd
df = pd.DataFrame(
    {
        'float': [1., 2., 3.],
        'int': [1, 2, 3],
        'datetime': [pd.Timestamp('20180310'), pd.Timestamp('20190310'), pd.Timestamp('20200310')],
        'string': ['foo', 'bar', 'buzz']
    }
)
print(df)

   float  int   datetime string
0    1.0    1 2018-03-10    foo
1    2.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

Let's check the type of one of the variables:

type(df['float'])

pandas.core.series.Series

There are a variety of ways to retrieve the elements of a data frame:

df['float']

0    1.0
1    2.0
2    3.0
Name: float, dtype: float64

df['float'][0]

1.0

df['datetime']

0   2018-03-10
1   2019-03-10
2   2020-03-10
Name: datetime, dtype: datetime64[ns]

df['datetime'][2]

Timestamp('2020-03-10 00:00:00')

df.at[1, 'float']

2.0

df.loc[1]

float                       2.0
int                           2
datetime    2019-03-10 00:00:00
string                      bar
Name: 1, dtype: object

df.loc[:, 'float']

0    1.0
1    2.0
2    3.0
Name: float, dtype: float64

df.loc[1, 'float']

2.0

We can also set values provided that they can be castable to the type of the column:

df.at[1, 'float'] = 10
print(df)

   float  int   datetime string
0    1.0    1 2018-03-10    foo
1   10.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

The variable (column) names can be retrieved via the .columns attribute:

df.columns

Index(['float', 'int', 'datetime', 'string'], dtype='object')

A few other useful methods to inspect dataframes are head() and tail() that show the first and last few rows (observations) in a dataframe:

print(df.head(n=2))

   float  int   datetime string
0    1.0    1 2018-03-10    foo
1   10.0    2 2019-03-10    bar

print(df.tail(n=2))

   float  int   datetime string
1   10.0    2 2019-03-10    bar
2    3.0    3 2020-03-10   buzz

In finance, often times data comes with missing values, usually labeled as "NaN" (not a number) or "NaT" (not a time, for time values). Let's check out a basic example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"name": ['Superman', 'Batman', 'Spiderman'],
                   "toy": [np.nan, 'Batmobile', 'Spiderman toy'],
                   "born": [pd.NaT, pd.Timestamp("1956-06-26"), pd.NaT]})
print(df)

        name            toy       born
0   Superman            NaN        NaT
1     Batman      Batmobile 1956-06-26
2  Spiderman  Spiderman toy        NaT

Now, in case we simply would like to remove the rows (observations) where at least one element is NaN or NaT, we use df.dropna():

print(df.dropna())

     name        toy       born
1  Batman  Batmobile 1956-06-26

In case we would like to remove the columns (variables) where at least one element is missing, we make use of the argument axis:

print(df.dropna(axis='columns'))

        name
0   Superman
1     Batman
2  Spiderman

Many other options are available to deal with NaNs and NaTs, such as specifying which columns to look for missing values:

print(df.dropna(subset=['name', 'born']))

     name        toy       born
1  Batman  Batmobile 1956-06-26

Note that this operations do not happen in place, i.e., the original dataframe is kept intact. In case we would like to perform in place modifications, we use the argument inplace=True:

print(df)

        name            toy       born
0   Superman            NaN        NaT
1     Batman      Batmobile 1956-06-26
2  Spiderman  Spiderman toy        NaT

df.dropna(inplace=True)
print(df)

     name        toy       born
1  Batman  Batmobile 1956-06-26

Data frames in Python are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame, but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (double).

Plotting¶

We will make full use of Matplotlib, Seaborn, and Plotly for all our plots :) See the example below for how to plot data stored in a Pandas dataframe with seaborn.

Let's examine this code snippet from matplotlib's documentation page

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

# Load a numpy record array from yahoo csv data with fields date, open, close,
# volume, adj_close from the mpl-data/example directory. The record array
# stores the date as an np.datetime64 with a day unit ('D') in the date column.
data_file = cbook.get_sample_data('goog.npz', asfileobj=False)
price_data = np.load(data_file)['price_data']
price_data = price_data[-250:]  # get the most recent 250 trading days
type(price_data)
delta1 = np.diff(price_data['adj_close']) / price_data['adj_close'][:-1]

# Plot
volume = (15 * price_data['volume'][:-2] / price_data['volume'][0])**2  # for size
close = 0.003 * price_data['close'][:-2] / 0.003 * price_data['open'][:-2]  # for color
fig, ax = plt.subplots()
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)
ax.set_xlabel(r'$\Delta_i$', fontsize=15)
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)
ax.set_title('Volume and percent change')
ax.grid(True)
fig.tight_layout()
plt.show()

Matplotlib basically deals with numpy nd-arrays and its subclasses like np.recarray used in the previous example. Seaborn, on the other hand, can deal with Pandas DataFrames too. Let's see a basic example from seaborn's documentation page:

import seaborn as sns
sns.set(style="darkgrid")
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")
print(type(fmri))
print(fmri.head())
# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri)

<class 'pandas.core.frame.DataFrame'>
  subject  timepoint event    region    signal
0     s13         18  stim  parietal -0.017552
1      s5         14  stim  parietal -0.080883
2     s12         18  stim  parietal -0.081033
3     s11         18  stim  parietal -0.046134
4     s10         18  stim  parietal -0.037970

<AxesSubplot: xlabel='timepoint', ylabel='signal'>

Key packages for finance¶

We will make use of several key libraries in Python

Package yfinance¶

The package yfinance let us query financial instruments data from the Yahoo! Finance platform.

Let's see how to get Apple stock price data from Yahoo! Finance:

import yfinance as yf

#apple = yf.download('AAPL', auto_adjust=False)  # auto_adjust=False is to get also the adjusted close price
apple = yf.download("AAPL", start="2017-01-01", end="2017-04-30", auto_adjust=False)

[*********************100%***********************]  1 of 1 completed

Now, let's inspect the Pandas dataframe apple via methods such as head(), tail(), that show the first and last few observations of the dataframe:

apple.head()

	Open	High	Low	Close	Adj Close	Volume
Date
2017-01-03 00:00:00-05:00	28.950001	29.082500	28.690001	29.037500	26.891964	115127600
2017-01-04 00:00:00-05:00	28.962500	29.127501	28.937500	29.004999	26.861866	84472400
2017-01-05 00:00:00-05:00	28.980000	29.215000	28.952499	29.152500	26.998465	88774400
2017-01-06 00:00:00-05:00	29.195000	29.540001	29.117500	29.477501	27.299444	127007600
2017-01-09 00:00:00-05:00	29.487499	29.857500	29.485001	29.747499	27.549498	134247600

apple.tail()

	Open	High	Low	Close	Adj Close	Volume
Date
2017-04-24 00:00:00-04:00	35.875000	35.987499	35.794998	35.910000	33.400845	68537200
2017-04-25 00:00:00-04:00	35.977501	36.224998	35.967499	36.132500	33.607796	75486000
2017-04-26 00:00:00-04:00	36.117500	36.150002	35.845001	35.919998	33.410152	80164800
2017-04-27 00:00:00-04:00	35.980000	36.040001	35.827499	35.947498	33.435722	56985200
2017-04-28 00:00:00-04:00	36.022499	36.075001	35.817501	35.912498	33.403168	83441600

We can also plot the some desired columns of a dataframe using the plot() method:

print(apple[['High', 'Low', 'Open', 'Close', 'Adj Close']].plot())

AxesSubplot(0.125,0.2;0.775x0.68)

Package empyrical¶

The empyrical package is an open source Python package developed by Quantopian Inc. It's widely used by practitioners to compute common risk and performance measures.

from empyrical import max_drawdown, roll_max_drawdown, cum_returns, omega_ratio, sharpe_ratio

# create a synthetic array of returns
returns = np.array([.01, .02, .03, -.4, -.06, -.02])
max_drawdown(returns)  # calculate the maximum drawdown

-0.4472800000000001

roll_max_drawdown(returns, window=3) # calculate the maximum drawdown in a rolling window fashion

array([ 0.     , -0.4    , -0.436  , -0.44728])

cum_returns(returns) # calculate the cumulative returns

array([ 0.01      ,  0.0302    ,  0.061106  , -0.3633364 , -0.40153622,
       -0.41350549])

omega_ratio(returns) # calculate the Omega ratio

0.12499999999999999

sharpe_ratio(returns) # calculate the Sharpe ratio

-6.7377339531573535

Package riskparityportfolio¶

riskparityportfolio is a package to design risk parity portfolios, a different approach to investment that is primarily used to control how much risk goes into each asset.

import riskparityportfolio as rpp  # pip install riskparityportfolio (dependencies: numpy, jax, quadprog, pybind, and tqdm)
import numpy as np
import matplotlib.pyplot as plt

cov_matrix = np.vstack((np.array((1.0000, 0.0015, -0.0119)),
                        np.array((0.0015, 1.0000, -0.0308)),
                        np.array((-0.0119, -0.0308, 1.0000))))
risk_budget_vector = np.array((0.1594, 0.0126, 0.8280))
w = rpp.vanilla.design(cov_matrix, risk_budget_vector)
plt.bar(["stock " + item for item in ["A", "B", "C"]], w)
plt.xlabel("")
plt.ylabel("portfolio weight")
plt.show()

Python Scripts and Jupyter Notebooks¶

Python Scripts¶

One simple way to use Python is by typing the commands in the IPython terminal one by one. However, this quickly becomes unscalable and it is necessary to write scripts. You can use your favourite text editor to create Python scripts.

Jupyter Notebook/Lab¶

Jupyter notebooks enable you to write report-like documents containing code, documentation, mathematical equations, figures, and so on. This document is an example.

To explore further¶

Check out Awesome Quant for a list of curated packages relevant to financial applications.