Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized implementations of 3/4 partner selection approaches #6

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ __pycache__/

# C extensions
*.so

.idea
.empty
# Distribution / packaging
.Python
build/
Expand Down Expand Up @@ -127,3 +128,4 @@ dmypy.json

# Pyre type checker
.pyre/
.vscode/
13 changes: 13 additions & 0 deletions FranzKrekeler/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
The vinecopulaslab is a basic implementation of the paper [Statistical Arbitrage with Vine Copulas [Stübinger, Mangold, Krauss (2016)]](https://www.econstor.eu/bitstream/10419/147450/1/870932616.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool name: vinecopulaslab


Its purpose is to find correlated stock partners for statistical arbitrage

For installation currently you can only use git clone

Afterwards you can access the tutorial ipynb

The requirements are mainly numpy, pandas.
The work was done in effort for the application of the march apprenticeship of Hudson & Thames


![uml diagram](UML_data.png)
Binary file added FranzKrekeler/UML_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
608 changes: 608 additions & 0 deletions FranzKrekeler/submission_and_tutorial.ipynb

Large diffs are not rendered by default.

Empty file.
6 changes: 6 additions & 0 deletions FranzKrekeler/tests/unit/datafetchertest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from unittest import TestCase


class TestDownload(TestCase):
def test_download_sample(self):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the thought is there :)

18 changes: 18 additions & 0 deletions FranzKrekeler/tests/unit/geometrictest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import sys
sys.path.append("./../../")
import unittest
from unittest import TestCase
import numpy as np
from vinecopulaslab.partnerselection.geometric import GeometricSelection

class TestDownload(TestCase):
def test_distance_to_diagonal(self):
# test template
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start with a capital letter. (Our convention)

# TODO: add tests
line = np.array([1, 1, 1])
pts = np.array([[0, 0, 0]])
self.assertEqual(GeometricSelection.distance_to_line(line, pts), 0, "Should be 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted



if __name__ == '__main__':
unittest.main()
Binary file added FranzKrekeler/tmp/sp500_2017-01-01_2018-01-01.p
Binary file not shown.
3 changes: 3 additions & 0 deletions FranzKrekeler/vinecopulaslab/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from vinecopulaslab.partnerselection import TraditionalSelection, ExtendedSelection, GeometricSelection, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see that you put the framework in place to build your own package.

ExtremalSelection
from vinecopulaslab.universe.universe import UniverseDownloader
8 changes: 8 additions & 0 deletions FranzKrekeler/vinecopulaslab/partnerselection/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""
Classes from the partner selection module
"""

from vinecopulaslab.partnerselection import TraditionalSelection, ExtendedSelection, GeometricSelection, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

ExtremalSelection

# TODO: Add PartnerSelection wrapper class
82 changes: 82 additions & 0 deletions FranzKrekeler/vinecopulaslab/partnerselection/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Author: Franz Krekeler 2021
from typing import List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good import order

import itertools
import numpy as np
import pandas as pd


class SelectionBase(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was an abstract base class, wouldn't that be better?

"""The base class for the partner selection framework.
"""

def __init__(self):
"""Initialization
"""
self.corr_returns_top_n = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good idea to list this attribute in the docstring.


@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see the use of decorators. Many people missed this.

def calculate_returns(prices: pd.DataFrame) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for hard typings.

"""Calculate percentage based daily returns
:param: prices (pd.DataFrame): The columns must the closing prices for the stocks
:return: returns (pd.DataFrame)
"""
return prices.pct_change(fill_method='ffill').dropna(how='all')

@staticmethod
def _ranked_correlation(returns: pd.DataFrame) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hidden and public methods, nice.

"""Given a df of returns calculated it's Spearman correlation matrix
:param: returns (pd.DataFrame): The input needs to be in percentage based returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for matching our style of docstrings.

:return: returns_correlation (pd.DataFrame)
"""
return returns.corr("spearman")

@staticmethod
def _rankings_pct(returns: pd.DataFrame):
"""Calculate the rank of a given dataframe and then convert it to percentage based
:param: returns (pd.DataFrame)
:return: returns_ranked_percentile (pd.DataFrame)
"""
return returns.rank(pct=True)

@staticmethod
def _top_n_correlations(corr_returns: pd.DataFrame, top_n: int = 50) -> pd.DataFrame:
"""For correlation matrix return the top n correlations (default 50)
:param: corr_returns (pd.DataFrame): correlation matrix
:return: corr_returns_top_n pd.DataFrame shape is (n,n)
"""
# Filter self correlated and self correlated stocks
corr_returns_unstacked = corr_returns[corr_returns < 1].unstack().sort_values(ascending=False)
corr_returns_unstacked = corr_returns_unstacked.reset_index().dropna()
corr_returns_unstacked.columns = "TARGET_STOCK", "STOCK_PAIR", "CORRELATION"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you opt for all capital letters?

# Note pandas was chosen here, but many ways lead to rome
corr_returns_top_n = corr_returns_unstacked.groupby("TARGET_STOCK").head(top_n)
return corr_returns_top_n.sort_values(['TARGET_STOCK', 'CORRELATION'])

@staticmethod
def _prepare_combinations_of_partners(stock_selection: List[str]) -> pd.DataFrame:
"""Helper function to calculate all combinations for a target stock and it's potential partners
:param: stock_selection (pd.DataFrame): the target stock has to be the first element of the array
:return: the possible combinations for the quadruples.Shape (19600,4) or
if the target stock is left out (19600,3)
"""
# We will convert the stock names into integers and then get a list of all combinations with a length of 3
num_of_stocks = len(stock_selection)
# We turn our partner stocks into numerical indices so we can use them directly for indexing
partner_stocks_idx = np.arange(1, num_of_stocks) # basically exclude the target stock
partner_stocks_idx_combs = itertools.combinations(partner_stocks_idx, 3)
return np.array(list((0,) + comb for comb in partner_stocks_idx_combs))

def _find_partners(self, target_stocks: List[str] = []):
"""
Helper functions where we apply the approach to each stock. Optional a subset of target stocks can be chosen.
:param: return_target_stock (List[str]): the subset of target stocks to analyze (default [])
:return: (pd.DataFrame)
"""
assert self.corr_returns_top_n is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is a bit clumped together. Using whitespace would improve readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice assertion

corr_returns_top_n = self.corr_returns_top_n.copy()
if len(target_stocks):
sublist = corr_returns_top_n.TARGET_STOCK.isin(target_stocks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what is: TARGET_STOCK?

An ENV VAR? Why all caps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are ref the column name in the DF. Hmm ok.

corr_returns_top_n = corr_returns_top_n[sublist]
target_stocks_partners_quadruples = corr_returns_top_n.groupby('TARGET_STOCK').apply(
self._partner_selection_approach)
return target_stocks_partners_quadruples
82 changes: 82 additions & 0 deletions FranzKrekeler/vinecopulaslab/partnerselection/extended.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
from typing import List
import numpy as np
import pandas as pd
import scipy.special
from statsmodels.distributions.empirical_distribution import ECDF
from vinecopulaslab.partnerselection.base import SelectionBase


class ExtendedSelection(SelectionBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of class inheritance. I was waiting for someone to use this pattern.

"""
This class implements the extended approach for partner selection. Mentioned section 3.1
of the paper "Statistical arbitrage with vine copulas"
https://www.econstor.eu/bitstream/10419/147450/1/870932616.pdf
It is an extension to the spearman correlation
"""

def __init__(self):
"""Initialization
"""
super().__init__()
self.corr_returns_top_n = None

def _partner_selection_approach(self, group) -> List[str]:
"""
Find the partners stocks for the groupby group of the data df.groupby("TARGET_STOCK").apply(...)
:param: group (pd.group) The group of n most correlated stocks
:return: (List[str]) returns a list of highest correlated quadruple
"""
target_stock = group.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is very grouped together. Readability is suffering.

partner_stocks = group.STOCK_PAIR.tolist()
stock_selection = [target_stock] + partner_stocks
# We create a subset of our ecdf dataframe to increase lookup speed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of inline comments.

data_subset = self.ecdf_df[stock_selection].copy()
# We turn our partner stocks into numerical indices so we can use them directly for indexing
quadruples_combinations = self._prepare_combinations_of_partners(stock_selection)
# We can now use our list of possible quadruples as an index
quadruples_combinations_data = data_subset.values[:, quadruples_combinations]
# Now we can get closer to a vectorized calculation
# n is equal to the total number of returns d to the number of stocks
# we use lodash because we don't need the 19600 dimension
n, _, d = quadruples_combinations_data.shape
# We split up the given formula
# For reference:
# https://github.com/hudson-and-thames/march_applications_21/blob/main/Guide%20for%20the%20Extended%20Approach.pdf
hd = (d + 1) / (2 ** d - d - 1)
ecdf_df_product_1 = np.product(1-quadruples_combinations_data, axis=-1)
ecdf_df_product_2 = np.product(quadruples_combinations_data, axis=-1)
est1 = hd * (-1 + (2 ** d / n) * ecdf_df_product_1.sum(axis=0))
est2 = hd * (-1 + (2 ** d / n) * ecdf_df_product_2.sum(axis=0))
# here we create the index as we will use it on specific dimensions
idx = np.array([(k, l) for l in range(0, d) for k in range(0, l)])
est3 = -3 + (12 / (n * scipy.special.comb(n, 2, exact=True))) * (
(1 - quadruples_combinations_data[:, :, idx[:, 0]]) * (
1 - quadruples_combinations_data[:, :, idx[:, 1]])).sum(axis=(1, 2))
quadruples_scores = (est1 + est2 + est3) / 3
# The quadruple scores have the shape of (19600,1) now
max_index = np.argmax(quadruples_scores)
return data_subset.columns[list(quadruples_combinations[max_index])].tolist()

def _preprocess(self, close: pd.DataFrame) -> pd.DataFrame:
"""
Helper function for preparing the data. Here we already prepare the ecdf
:param close: (pd.DataFrame) the closing prices
"""
close.sort_index(axis=1, inplace=True)
self.close_returns = self.calculate_returns(close)
self.ranked_correlation = self._ranked_correlation(self.close_returns)
self.corr_returns_top_n = self._top_n_correlations(self.ranked_correlation)
self.ecdf_df = self.close_returns.apply(lambda x: ECDF(x)(x), axis=0)

def find_partners(self, close: pd.DataFrame, target_stocks: List[str] = []):
"""
Find partners based on an extension of the Spearmann correlation. Mentioned in section 3.1
of the paper "Statistical arbitrage with vine copulas"
https://www.econstor.eu/bitstream/10419/147450/1/870932616.pdf
:param: close (pd.DataFrame) The close prices of the SP500
:param: target_stocks (List[str]) A list of target stocks to analyze
:return: (List[str]) returns a list of highest correlated quadruple
"""
self._preprocess(close)
# find_partners could be moved to the base class but then it wouldn't have the right docstring... looking for best practice
return self._find_partners(target_stocks)
99 changes: 99 additions & 0 deletions FranzKrekeler/vinecopulaslab/partnerselection/extremal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
from typing import List
import itertools
import logging
import pandas as pd
import numpy as np
import scipy.special
import scipy.linalg
from vinecopulaslab.partnerselection.base import SelectionBase


class ExtremalSelection(SelectionBase):
"""
Class for partner selection based on "A multivariate linear rank test of independence
based on a multiparametric copula with cubic sections"
Mangold 2015
"""

def __init__(self):
"""Initialization
"""
super().__init__()
self.corr_returns_top_n = None

def _partner_selection_approach(self, group):
"""
Approach function Partner selection based on "A multivariate linear rank test of independence based on
a multiparametric copula with cubic sections"
for df.groupby("TARGET_STOCK").apply(...)
This has only been implemented for performance testing.
References:
https://www.researchgate.net/publication/309408947_A_multivariate_linear_rank_test_of_independence_based_on_a_multiparametric_copula_with_cubic_sections
https://pypi.org/project/Independence-test/
:param group: (group) The group of 50 most correlated stocks
:return: (List[str]) returns a list of highest correlated quadruple
"""
logging.warning("Extremal approach is still under construction")
target_stock = group.name
partner_stocks = group.STOCK_PAIR.tolist()
stock_selection = [target_stock] + partner_stocks
# We create a subset of our ranked returns dataframe to increase lookup speed.
data_subset = self.ranked_returns[stock_selection].copy()
# We turn our partner stocks into numerical indices so we can use them directly for indexing
quadruples_combinations = self._prepare_combinations_of_partners(stock_selection)
# We can now use our list of possible quadruples as an index
quadruples_combinations_data = data_subset.values[:, quadruples_combinations]
# Now we can get closer to a vectorized calculation
# n is equal to the total number of returns d to the number of stocks
# we use lodash because we don't need the 19600 dimension
n, _, d = quadruples_combinations_data.shape
# Here the math from the Mangold 2015 paper begins
permut_mat = np.array(list(itertools.product([-1, 1], repeat=d)), dtype=np.int8)
sub_mat = permut_mat @ permut_mat.T
F = (d + sub_mat) / 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of variables that are all CAPS unless they are env variables. Maybe you tried to match the notation in the paper? If not then perhaps a more descriptive name would help.

D = (d - sub_mat) / 2
cov_mat = ((2 / 15) ** F) * ((1 / 30) ** D)
cov_mat_inv = scipy.linalg.inv(cov_mat)
rank_df_norm = quadruples_combinations_data / (n + 1)
pos_rank_df = (rank_df_norm - 1) * (3 * rank_df_norm - 1)
neg_rank_df = rank_df_norm * (2 - 3 * rank_df_norm)
# performance here is still lagging
# Proposition 3.3. from the paper
pos_neg_combined = np.add(np.einsum('ijk,lmk->jmik', pos_rank_df, np.expand_dims(permut_mat > 0, axis=0)),
np.einsum('ijk,lmk->jmik', neg_rank_df, np.expand_dims(permut_mat < 0, axis=0)))
TNP = pos_neg_combined.prod(axis=-1).mean(-1)
# Incomplete: Still not documented
# performance here is also not optimal here
T = ((np.expand_dims(TNP, axis=1) @ np.expand_dims(cov_mat_inv, axis=0)) @ TNP.T)
T_results = np.diag(T[:, 0, :]) * n
max_index = np.argmax(T_results)
partners = data_subset.columns[list(quadruples_combinations[max_index])].tolist()
# Please take this with a grain of salt, I was too obsessed with a proof of concept.
return partners

def _preprocess(self, close: pd.DataFrame) -> pd.DataFrame:
"""
Helper function for preparing the data.
:param close: (pd.DataFrame) the closing prices
"""
close.sort_index(axis=1, inplace=True)
self.close_returns = self.calculate_returns(close)
self.ranked_correlation = self._ranked_correlation(self.close_returns)
self.corr_returns_top_n = self._top_n_correlations(self.ranked_correlation)
self.ranked_returns = self.close_returns.rank()

def find_partners(self, close: pd.DataFrame, target_stocks: List[str] = []):
"""
Find partners based on the extremal approach mentioned in section 3.1
of the paper "Statistical arbitrage with vine copulas"
https://www.econstor.eu/bitstream/10419/147450/1/870932616.pdf
Based on the paper Class for partner selection based on "A multivariate linear rank test of independence
based on a multiparametric copula with cubic sections" Mangold 2015
:param: close (pd.DataFrame) The close prices of the SP500
:param: target_stocks (List[str]) A list of target stocks to analyze
:return: (List[str]) returns a list of highest correlated quadruple
"""
self._preprocess(close)
# find_partners could be moved to the base class but then it wouldn't have the right docstring...
# looking for best practice
return self._find_partners(target_stocks)
Loading