The R-package TdaCpdSim
https://github.com/chroetz/TdaCpdSim
provides functions to execute and evaluate simulations for the
evaluation of different change point detection (CP/CPD) methods in time
series (TS) using topological data analysis (TDA).
To install the package from github, execute following commands:
install.packages("remotes")
remotes::install_github("chroetz/TdaCpdSim")
Call user functions of the package by TdaCpdSim::user_function()
or by
calling library(TdaCpdSim)
once and then user_function()
. Call
internal functions of the package by TdaCpdSim:::internal_function()
.
The package makes use of the tidyverse
, a collection of packages for
data manipulation. So it is usually convenient to load the tidyverse
.
options(tidyverse.quiet = TRUE)
library(tidyverse)
library(TdaCpdSim)
- Sample n sets X1, …, Xn ⊂ ℝp of points in ℝp (usually p = 2) to create a time series of length n. Typically each set has the same size |Xi| = m, but in principle different sizes are allowed.
- Apply a filtration to X1, …, Xn to obtain persistence diagrams P1, …, Pn.
- Apply estimators for a change point in the time series.
- Repeat 1 to 3 several times.
- Evaluate the relative mean absolute error of the estimators.
To run a simulation, call the function TdaCpdSim::simulate_ts()
.
# samplers <- ...
# estimators <- ...
# required_dists <- ...
results <- simulate_ts(samplers, estimators, required_dists)
All three arguments are tables (R-objects of class tibble
).
The rows of samplers
describe the distributions of the time series
sampled during the simulation, as well as the number of repetitions and
the filtration applied to the time series of sets of points.
The rows of required_dists
describe for which distances d of
persistence diagrams the values
d(Pi, Pj) are calculated for use in the
estimators. This is not done by individual estimators as it may be
computationally intensive and should not be calculated multiple times.
The rows of estimators
describe the estimators applied to the time
series. They take the time series of sets of points, the time series of
persistence diagrams, and the distance matrices as arguments and
ultimately produce an element in [1, n] as estimated value of the
change point.
Here is an example of a table of samplers.
# s_name must be unique
samplers <- tribble(
~s_name, ~distri1, ~distri2,
"o-.", "circ", "dot",
"o-n", "circ", "horseshoe") %>%
mutate(m=20, n=300, n1=200, rate=0.05,
kind = "birth_death",
filt = "Rips",
reps=3)
samplers
## # A tibble: 2 x 10
## s_name distri1 distri2 m n n1 rate kind filt reps
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 o-. circ dot 20 300 200 0.05 birth_death Rips 3
## 2 o-n circ horseshoe 20 300 200 0.05 birth_death Rips 3
s_name
: a unique identifier of the sampler.n
: the length of the TS to be produced by the sampler.kind
: describes which sampling method to use. Currently"birth_death"
or"iid"
.filt
: filtration to create persistence diagram as supported by theTDA
R-package. Currently"Rips"
or"Alpha"
.reps
: number of repetitions of one experiment (sample TS, apply estimators)n1
: The TS has two parts of length n1 and n2 respectively such that n1 + n2 = n is the total length of the time series.n1
is n1. n2 will be calculated formn1
andn
. The true location of the CP is defined as n1 + 0.5.distri1
,distri2
: The distributions used to sample points in the two parts of the TS.- only
kind="birth_death"
rate
: rate for the exponential distribution of the lifetime of pointsm
: fix number of point at every time point
- only
kind="iid"
m1
,m2
: number of points for the two parts of the TS
The argument distri1
and distri2
name a distribution that is created
from the images in a subfolder named img
of the working directory
(this path can be changed via set_image_folder_path("new/path/")
, see
also get_image_folder_path()
). A distribution (on ℝ2) is
created from an image by a mixture of Gaussians on each pixel, where the
individual Gaussian is weighted by the lightness of the pixel. The
variance of each Gaussian is set to (numberOfPixels) − 1. As
we do not want to just find changes in mean or variance, these
distributions are then normalized to have zero mean and an identity
covariance matrix.
"iid"
Sample n1
times m1
points form distri1
(independently) to obtain
X1, …, Xn1. Then sample
n-n1
times m2
points form distri2
(independently) to obtain
Xn1 + 1, …, Xn. All
Xi are independent. Inside each of the two parts of the
time series, the Xi are iid. Thus, also the resulting
persistence diagrams are independent / iid.
"birth_death"
Each Xi consists of m
points which were sampled from
distri1
or distri2
. To create X1, sample m
points
from distri1
(independently) and for each point sample a lifetime from
an exponential distribution (independently) with rate rate
. Given
Xi with lifetimes associated to each point, create
Xi + 1 as follows: Reduce the lifetimes of each point
by 1. Delete all points which now have a negative lifetime. Sample new
points to again have m
points in total. Reset the lifetimes of the new
points to newly sampled values from the exponential distribution with
rate rate
. For i ≤ n1 + 0.5 − 0.5 * (1/rate) distri1
is used. If i is larger, distri2
is used. This should make the “mean
change point” n1 + 0.5.
# e_name must be unique
estimators <- tribble(
~e_name, ~prepro, ~select, ~trim, ~postpro, ~detector,
"IYG_PC1_D0", "IYG_D0", "1", 0, "cusum", "argmax",
"IYG_PC3_D0", "IYG_D0", "1:3", 0, "cusum", "argmax",
"IYG_PC1_D1", "IYG_D1", "1", 0, "cusum", "argmax",
"IYG_PC3_D1", "IYG_D1", "1:3", 0, "cusum", "argmax",
"LT_D0", "LT_D0", "", 0, "cusum", "argmax",
"LT_D1", "LT_D1", "", 0, "cusum", "argmax",
"FR_M_D0", "FR_RM_D0", "1", 20, "id", "argmax",
"FR_MV_D0", "FR_RM_D0", "2", 20, "id", "argmax",
"FR_V_D0", "FR_RM_D0", "3", 20, "id", "argmax",
"FR_VV_D0", "FR_VV_D0", "", 20, "id", "argmax",
"FR_M_D1", "FR_RM_D1", "1", 20, "id", "argmax",
"FR_MV_D1", "FR_RM_D1", "2", 20, "id", "argmax",
"FR_V_D1", "FR_RM_D1", "3", 20, "id", "argmax",
"FR_VV_D1", "FR_VV_D1", "", 20, "id", "argmax")
estimators
## # A tibble: 14 x 6
## e_name prepro select trim postpro detector
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 IYG_PC1_D0 IYG_D0 "1" 0 cusum argmax
## 2 IYG_PC3_D0 IYG_D0 "1:3" 0 cusum argmax
## 3 IYG_PC1_D1 IYG_D1 "1" 0 cusum argmax
## 4 IYG_PC3_D1 IYG_D1 "1:3" 0 cusum argmax
## 5 LT_D0 LT_D0 "" 0 cusum argmax
## 6 LT_D1 LT_D1 "" 0 cusum argmax
## 7 FR_M_D0 FR_RM_D0 "1" 20 id argmax
## 8 FR_MV_D0 FR_RM_D0 "2" 20 id argmax
## 9 FR_V_D0 FR_RM_D0 "3" 20 id argmax
## 10 FR_VV_D0 FR_VV_D0 "" 20 id argmax
## 11 FR_M_D1 FR_RM_D1 "1" 20 id argmax
## 12 FR_MV_D1 FR_RM_D1 "2" 20 id argmax
## 13 FR_V_D1 FR_RM_D1 "3" 20 id argmax
## 14 FR_VV_D1 FR_VV_D1 "" 20 id argmax
Estimators consist of following parts:
- A unique name
e_name
. - A Pre-Processor
prepro
. - Selection
select
and Trimmingtrim
. - A Post-Processor
postpro
. - A Detector
detector
.
The Pre-Processor maps from the three lists of a) point sets, b) persistence diagrams, and c) distance matrices to a matrix ℝℓ × q which represents a TS of length ℓ (typically ℓ = n or ℓ is close to n) of points of dimension q, e.g., a q-dimensional statistic calculated form each persistence diagram. The entry in the table is a character value that refers to an implemented pre-processor, see below.
Via a selection, one can select q′ columns of the q columns of the output of the pre-processor. The entry in the table is a character value that can be evaluated as R-code. An empty string selects all q columns.
With trim
, one can trim the output of the pre-processor from both ends
of the TS. For a trim value t, the TS is reduced to length
ℓ′ = ℓ − 2t.
The Post-Processor takes the (ℓ′ × q′)-matrix
and maps it to a one-dimensional TS of length k. The trivial
post-processor id
maps
ℝℓ′ × 1 → ℝℓ′ via the
identity such that k = ℓ′. The cusum
post-processor
applies a cusum-like transformation to each column and takes the
squared Euclidean norm in each row. Here, we have
k = ℓ′ − 1.
The Detector takes the one-dimensional TS of length k and the original length n and maps them to a value in [1, n], the estimated location of the change point. This is usually done by an arg max in combination with centering, i.e., adding 0.5 * (n − k), to account for k ≠ n.
Following functions list the names of all available pre-/post-processors and detectors.
get_prepro_names()
## [1] "FR_RM_D0" "FR_RM_D1" "FR_VV_D0" "FR_VV_D1" "IYG_D0" "IYG_D1" "LT_D0"
## [8] "LT_D1"
get_postpro_names()
## [1] "id" "cusum"
get_detector_names()
## [1] "argmax"
# d_name must be unique
required_dists <- tribble(
~d_name, ~power, ~dim,
"0-Inf", Inf, 0,
"1-Inf", Inf, 1)
required_dists
## # A tibble: 2 x 3
## d_name power dim
## <chr> <dbl> <dbl>
## 1 0-Inf Inf 0
## 2 1-Inf Inf 1
This table must contain all distances that are used by one of the
estimators. The estimators may refer to a distance matrix via the name
d_name
. The available distances are Wasserstein distances of power
power
(1 to ∞ Inf
) applied to features of dimension dim
of the
persistence diagrams.
The three arguments are all tables that can be saved and loaded as
CSV-files. See, e.g., the functions readr::read_csv()
and
readr::write_csv()
.
Simulations may take hours to complete. Thus, it recommended to always
save the results (and how the results were created, i.e., the arguments
of simulate_ts()
) in a file. For example, following code saves results
and arguments as an RDS-file with a file name that contains a time stamp
(requires library(tidyverse)
).
# results <- simulate_ts(samplers, estimators, required_dists)
write_rds(
list(results = results,
samplers = samplers,
estimators = estimators,
required_dists = required_dists),
paste0("sim_", format(Sys.time(), "%Y%m%d-%H%M%S"), ".RDS"))
To load these elements and add them to the global environment, commands similar to the following can be used.
lst <- read_rds("sim_example.RDS")
list2env(lst, rlang::global_env())
simulate_ts()
returns a table with one row for each repetition of each
sampler and each estimator. The columns are
e_name
: name of the estimators_name
: name of the samplerrep_nr
: number of repetitionestimation
: the estimated location of the change pointpostpro_ts
: the time series of length k returned by the post-processorlen
: k
See also rmae_eval()
, plot_result_summary()
, and
plot_result_all()
.
Use the argument n_cores
to enable parallel runs of repetitions. The
default is n_cores=1
, which runs the code non-parallel. There is some
overhead when using multiple cores. Thus, n_cores=2
might not always
be faster than n_cores=1
and is always less than twice as fast.
results <- simulate_ts(
samplers, estimators, required_dists,
n_cores=parallel::detectCores())
Set a random seed via set.seed()
to make simulations reproducible.
This should also work with parallel execution, i.e., with an argument
n_cores
greater than 1, but was not extensively tested.
set.seed(1)
results <- simulate_ts(samplers, estimators, required_dists)
In file def_prepros.R
add a named function object to the list
prepros
. The function must take the three arguments
point_ts, pd_ts, dist_mats
and return a matrix. Alternatively, one can
add Pre-Processors without changing the package code using the function
register_prepro(name, fun)
, e.g.:
f <- function(point_ts, pd_ts, dist_mats) {
t(sapply(point_ts, rowMeans))
}
register_prepro("EUCL_MEAN", f)
In file def_postpros.R
add a named function object to the list
postpros
. The function must one argument – a
(ℓ′ × q′)-matrix and return a vector.
Alternatively, one can add Post-Processors without changing the package
code using the function register_postpro(name, fun)
, e.g.:
f <- function(X) {
rowMeans(X^2)
}
register_postpro("norm_2_sq", f)
In file def_dectectors.R
add a named function object to the list
detectors
. The function must 2 argument – a vector representing the
post-processed TS and the length of the original TS n as a single
integer – and return a single double
(real value) – the estimated
location of the CP (should be in [1, n]. Alternatively, one can add
Detectors without changing the package code using the function
register_detector(name, fun)
, e.g.:
f <- function(y, n) {
which.min(y) + (n-length(y))/2
}
register_detector("argmin", f)
Just create new PNGs and put them in a subfolder img
of your working
directory. They can be referred to by their filename (without ending
.png
).
Currently, no other types of distributions are implemented. But the
function create_distris()
(file distris.R
) has an argument type
which allows for such extensions in the future. Then the call of
create_distris()
inside of sample_ts()
might also need change.
Change sample_point_ts()
in file simulate.R
.