You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.
The fst as described in the excellent blog post Index 1,600,000,000 Keys with Automata and Rust allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.
In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the fuzzy-search crate, but don't know how production ready it is.
The text was updated successfully, but these errors were encountered:
From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.
The fst as described in the excellent blog post Index 1,600,000,000 Keys with Automata and Rust allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.
In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the fuzzy-search crate, but don't know how production ready it is.
The text was updated successfully, but these errors were encountered: