Fast NLP in Rust with Python bindings
This package aims to provide a convenient and high performance toolkit for ingesting textual data for machine learning applications. Making existing NLP Rust crates accesible in Python with a common API is another goal of this project.
The API is currently unstable.
- Tokenization: Regexp tokenizer, Unicode segmentation
- Stemming: Snowball (in Python 15-20x faster than NLTK)
- Analyzers (planned): word and character n-grams, skip grams
- Token counting: converting token counts to sparse matrices for use
in machine learning libraries. Similar to
CountVectorizer
andHashingVectorizer
in scikit-learn. - Feature weighting (planned): feature weighting based on document frequency (TF-IDF), supervised term weighting (e.g. TF-IGM), feature normalization
Add the following to Cargo.toml
,
[dependencies]
text-vectorize = {"git" = "https://github.com/rth/vtext"}
A simple example can be found below,
extern crate vtext;
use vtext::CountVectorizer;
let documents = vec![
String::from("Some text input"),
String::from("Another line"),
];
let mut vect = CountVectorizer::new();
let X = vect.fit_transform(&documents);
where X
is a CSRArray
struct with the following attributes
X.indptr
, X.indices
, X.values
.
The API aims to be compatible with scikit-learn's CountVectorizer and HashingVectorizer though only a subset of features will be implemented.
Below are some very preliminary benchmarks on the 20 newsgroups dataset of 19924 documents (~91 MB in total),
CountVectorizer | HashingVectorizer | |
---|---|---|
CountVectorizer | scikit-learn 0.20.1 | 14 MB/s |
CountVectorizer | vtext 0.1.0-a1 | 33 MB/s |
HashingVectorizer | scikit-learn 0.20.1 | 18 MB/s |
HashingVectorizer | vtext 0.1.0-a1 | 68 MB/s |
see benchmarks/README.md for more details. Note that these are not strictly equivalent, and are meant as a rough estimate for the possible performance improvements.
text-vectorize is released under the BSD 3-clause license.