Usage

Variable Length Markov Chain

Build variable length markov models
Blazingly fast top down approeach using the Peres-Shield method
Built with Rust

Documentation • Paper • Author • Literature

Implementation of Variable Length Markov Chains (VLMC) for Python.

Suffix tree building is done top-down using the Peres-Shield order estimation method. It is written in Rust with Python Bindings.

Installation

Pre-built packages for many Linux, Windows, and OSX systems are available in PyPI and can be installed with:

pip install vlmc

On uncommon architectures, you may need to first install Cargo before running pip install vlmc.

Compilation from source

In order to compile from source you will need to install Rust/Cargo and maturin for the python bindings. Maturin is best used within a Python virtual environment:

# activate your desired virtual environment first, then:
pip install maturin
git clone https://github.com/antonio-leitao/vlmc.git
cd vlmc
# build and install the package:
maturin develop --release

Usage

Complete documentation is available here

import vlmc
tree = vlmc.VLMC(alphabet_size,max_depth=10)

Parameters:

alphabet_size: Total number of symbols in the alphabet. This number has to be bigger than the highest integer encountered, else it will cause runtime errors.
max_depth: Maximum depth of tree. Subsequences whose length exceed the max_depth will not be considered nor counted.

`fit`

Note fit method returns None and not self. This is by design as to not expose the rust object to python.

data = [
  [1,2,3],
  [2,3],
  [1,0,1],
  [2]
]

tree.fit(data)

Arguments:

data: List of lists containing sequences of discrete values to fit on. Values are assumed to be integers form 0 to alphabet_size. List is expected to be two dimensional.

`get_suffix`

Given a sequence, returns the longest suffix that is present in the VLMC.

suffix = tree.get_suffix(sequence)

Arguments:

sequence: list of integers representing a sequence of discrete varaibles.

Returns:

suffix : longest suffix of sequence that is present in the VLMC.

`get_counts`

Gets the total number of occurences of a given sequence of integers. Will throw a KeyError if the sequence is not a tree node. Consider using get_suffix to make sure to get a tree node.

counts = tree.get_counts(sequence)

Arguments:

sequence: list of integers representing a sequence of discrete varaibles.

Returns:

counts : integer

`get_distribution`

Gets the vector of probabilities over the entire alphabet for the given sequence. Will throw a KeyError if the sequence is not a tree node. Consider using get_suffix to make sure to get a tree node.

probabilities = tree.get_distribution(sequence)

Arguments:

sequence: list of integers representing a sequence of discrete variables.

Returns:

probabilities : list of floats representing the probability of observing a specific state (index) as the next symbol.

`get_contexts`

contexts = tree.get_contexts()

Returns:

contexts: list of relevant contexts according to the Peres-Shield tree prunning method. Contexts are ordered by length.

TODO

Paralelization

After experimentation the best possible idea for paralelization would be to create different hashmaps for each sunsequence length. Hashmaps are then joined from longest to smallest. The hashmap at max_depth + 1 can be discarded after. Could be very fast depending on merging algo.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variable Length Markov Chain

Contents

Installation

Compilation from source

Usage

`fit`

`get_suffix`

`get_counts`

`get_distribution`

`get_contexts`

TODO

Paralelization

About

Releases 6

Packages

Languages

License

antonio-leitao/vlmc

Folders and files

Latest commit

History

Repository files navigation

Variable Length Markov Chain

Contents

Installation

Compilation from source

Usage

fit

get_suffix

get_counts

get_distribution

get_contexts

TODO

Paralelization

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

`fit`

`get_suffix`

`get_counts`

`get_distribution`

`get_contexts`

Packages