Skip to content

Latest commit

 

History

History
198 lines (133 loc) · 7.14 KB

README.org

File metadata and controls

198 lines (133 loc) · 7.14 KB

Clojure Tightly Packed Trie

https://img.shields.io/clojars/v/com.owoga/tightly-packed-trie.svg

What does this do?

Tries as hash-maps are common, but hash-maps take up a lot of memory (relatively speaking).

For example, creating a hash-map trie of 1, 2, and 3-grams of short story by Edgar Allen Poe results in a hash-map that consumes over 2 megabytes of memory. See this markov language model example.

If you’re dealing with much larger corpuses, the memory footprint could become an issue.

A tightly packed trie, on the other hand, is tiny. A tightly packed trie on the same corpus is only 37 kilobytes. That’s ~4% of the original trie’s size, even after the original trie’s keys/values have all been condensed to numbers!

How do you use library?

A trie is created similar to a hash-map by passing a variable number of “trie entries” to trie.

A “trie entry” is basically the same thing as a map entry. It’s just a key and a value.

But for a Trie, the key must be seqable and for a tightly-packed trie all keys must be comparable.

(require '[com.owoga.trie :as trie])

(def loosely-packed-trie (trie/make-trie "dog" :dog "dot" :dot "do" :do "day" :day))
loosely-packed-trie
;; => {[\d \a \y] :day, [\d \o \g] :dog, [\d \o \t] :dot, [\d \o] :do}

You’ll see from the output of that last line above that the default REPL representation of a Trie is a flat hash-map-looking-thing. It’s actually a sorted-hash-map-looking-thing, because if you seq over it, you’ll get the trie-entries in depth-first post-order traversal.

In some ways, a Trie behaves a lot like a map.

`get` returns the value at the key.

(get loosely-packed-trie "dog")
;; => :dog
(get loosely-packed-trie "do")
;; => :do
(get (assoc loosely-packed-trie "dove" {:value "dove" :count 10}) "dove")
;; => {:value "dove", :count 10}

But there’s a couple cool Trie-specific functions.

`lookup` returns the Trie at the key. This way, you have access to all of the node’s descendants.

(trie/lookup loosely-packed-trie "do")
;; => {[\g] :dog, [\t] :dot}
(seq (trie/lookup loosely-packed-trie "do"))

`children` returns the direct children of a node.

(trie/children (trie/lookup loosely-packed-trie "do"))
;; => ({} {})

That’s odd… there’s two things in there that look like empty maps.

(map #(get % []) (trie/children (trie/lookup loosely-packed-trie "do")))
;; => (:dog :dot)

The REPL representation of a Trie only shows children key/values. The “root” node (not necessarily the “true” root node if you’ve travsersed down with `lookup`) doesn’t print any data to REPL. So if you’re looking ata node with no children, you’ll see `{}` in the REPL. But you can get the value of that node with `(get node [])`

Tightly Packed Tries

The trie above is backed by regular old Clojure data structures: hash-maps and vectors.

It’s not very efficient. All of the strings, nested maps, pointers… it all adds up to a lot of wasted memory.

A tightly packed trie provides the same functionality at an impressively small fraction of the memory footprint.

One restriction though: all keys and values must be integers. To convert them from integer identifiers back into the values that your biological self can process, you’ll need to keep some type of database or in-memory map of ids to human-parseable things.

Here’s a similar example to that above, but with values that we can tightly pack.

(require '[com.owoga.tightly-packed-trie :as tpt]
         '[com.owoga.tightly-packed-trie.encoding :as encoding])

(defn encode-fn [v]
  (if (nil? v)
    (encoding/encode 0)
    (encoding/encode v)))

(defn decode-fn [byte-buffer]
  (let [v (encoding/decode byte-buffer)]
    v
    (if (zero? v) nil v)))

(def tight-ready-loosely-packed-trie
  (trie/make-trie '(1 2 3) 123 '(1 2 1) 121 '(1 2 2) 122 '(1 3 1) 131))

(def tightly-packed-trie
  (tpt/tightly-packed-trie
   tight-ready-loosely-packed-trie
   encode-fn
   decode-fn))

(get tightly-packed-trie [1 2 3])
;; => 123

(map #(get % []) (trie/children (trie/lookup tightly-packed-trie [1 2])))
;; => (121 122 123)

(seq tightly-packed-trie)
;; => ([[1 2 1] 121]
;;     [[1 2 2] 122]
;;     [[1 2 3] 123]
;;     [[1 2] nil]
;;     [[1 3 1] 131]
;;     [[1 3] nil]
;;     [[1] nil])

Instead of a map with all of its pointers, we are storing all of the information necessary for this trie in just 39 bytes!

(require '[cljol.dig9 :as d])

(.capacity (.byte-buffer tightly-packed-trie))
;; => 39

It’s backed by a byte-buffer so saving to disk is trivial, but there’s a helper for that.

Here’s the process of saving to and loading from disk. (Only works for tightly-packed tries.)

(tpt/save-tightly-packed-trie-to-file "/tmp/tpt.bin" tightly-packed-trie)

(def saved-and-loaded-tpt
  (tpt/load-tightly-packed-trie-from-file "/tmp/tpt.bin" decode-fn))

(get saved-and-loaded-tpt '(1 2 3))
;; => 123

Credits

Ulrich Germann, Eric Joanis, and Samuel Larkin of the National Research Institute of Canada for the paper Tightly Packed Tries: How to Fit Large Models into Memory,and Make them Load Fast, Too.

Lots of credit also goes to the Clojurians community.

Why would you want a trie data structure?

TODO: The below is closer to a CSCI lesson than library documentation. If it’s necessary, figure out where to put it, how to word it, etc… It might not be worth cluttering documentation with so much detail.

Autocomplete

A user types in the characters “D” “O” and you want to show all possible autocompletions.

Typical “List” data structure

  • Iterate through each word starting from the beginning.
  • When you get to the first word that starts with the letters “D” “O”, start keeping track of words
  • When you get to the next word that doesn’t start with “D” “O”, you have all the words you want to use for autocomplete.
(def dictionary ["Apple" "Banana" "Carrot" "Do" "Dog" "Dot" "Dude" "Egg"])

Problems with a list.

It’s slow if you have a big list. If you have a dictionary with hundreds of thousands of words and the user is typing in letters that don’t show up until the end of the list, then you’re searching through the first few hundred thousand items in the list before you get to what you need.

If you’re familiar with binary search over sorted lists, you’ll know this is a contrived example.

Typical “Trie” in Clojure

{"A" {:children {"P" {,,,} :value nil}}
 "D" {:children {"O"
                 :children {"G" {:children {} :value "DOG"}
                            "T" {:children {} :value "DOT"}}
                 :value "DO"}
      :value nil}}

How is a trie faster?

Development