Skip to content
/ LCSvec Public

Package solving the Longest Common Subsequence/Substring (LCS) for integer vectors (numpy, PyTorch, Jax, Tensorflow)

License

Notifications You must be signed in to change notification settings

Natooz/LCSvec

Folders and files

NameName
Last commit message
Last commit date
Mar 25, 2025
Mar 7, 2024
Mar 12, 2024
Mar 12, 2024
Mar 8, 2024
Mar 7, 2024
Aug 5, 2024
Sep 18, 2024
Aug 22, 2024
Feb 26, 2024
Aug 5, 2024
Sep 18, 2024

Repository files navigation

Python 3.8 GitHub CI GitHub license Downloads Code style

LCSvec

Longest Common Subsequence/Substring (LCS) solving package for vectors.

Why LCSvec

While looking for fast implementations solving the Longest Common Subsequence and Longest Common Substring (i.e. Contiguous Subsequence) (LCCS) problems, I only found pieces of code for strings, while I needed something working for vectors of integers. Yet, string LCS implementations 1) work at the character level, which might not be suitable for certain use-cases where one wants to work at word or sentence level; 2) only work with strings, other modalities will not work and might not be designed to be converted to bytes.

LCSvec aims to solve this gap by providing user-friendly and fast implementations of the LCS and LCCS problems for vectors. It works with numpy, pytorch, tensorflow and jax arrays/tensors! The code is written in C++ and the methods are bind with Python with nanobind for optimal performances.

Example

You can install the package with pip: pip install lcsvec. Here we will use numpy, the usage for other deep learning libraries is the same.

from lcsvec import lccs, lccs_length, lcs, lcs_length
import numpy as np

seq1 = np.arange(0, 12)
seq2 = np.array([8, 0, 1, 2, 8, 2, 3, 4, 5, 6], dtype=np.int64)

lcs_ = lcs(seq1, seq2)  # [0, 1, 2, 3, 4, 5, 6]
lcs_len = lcs_length(seq1, seq2)  # 7, more efficient than calling len(lcs(seq1, seq2))

lccs = lccs(seq1, seq2)  # [2, 3, 4, 5, 6]
lccs_len = lccs_length(seq1, seq2)  # 5, more efficient than calling len(lccs(seq1, seq2))

TODOs

  • implement lccs with suffix tree;
  • batch methods, i.e. supporting 2D arrays;
  • batch methods with any number of dimensions (nD array) and add a dim argument;
  • make it work with an unlimited number of sequences, and set dim and pad_token as kwargs only;

About

Package solving the Longest Common Subsequence/Substring (LCS) for integer vectors (numpy, PyTorch, Jax, Tensorflow)

Resources

License

Stars

Watchers

Forks

Packages

No packages published