docs(mkdwarfs): more accurate details of nilsimsa ordering algorithm

mhx · mhx · commit 91a58b1eea91 · 2024-02-27T19:52:47.000+01:00
diff --git a/doc/mkdwarfs.md b/doc/mkdwarfs.md
@@ -298,15 +298,13 @@ Most other options are concerned with compression tuning:
   determine a good ordering.
   However, the new implementation of this algorithm can be parallelized and
   will perform much better on huge numbers of files. `nilsimsa` ordering can
-  be tweaked by specifying `max-children` and `max-cluster-size`. Both options
-  determine how the set of files will be split into clusters, each of which will
-  be further split recursively. `max-children` is the maximum number of child
-  nodes resulting from a clustering step. If `max-children` distinct clusters
-  have been found, new files will be added to the closest cluster. `max-cluster-size`
-  determines at which point a cluster will no longer be split further. Typically,
-  larger values will result in better ordering, but will also make the algorithm
-  slower. Unlike the old implementation, `nilsimsa` ordering is now completely
-  deterministic.
+  be tweaked by specifying `max-children` and `max-cluster-size`. In general,
+  larger values for `max-cluster-size` tend to result in better compression,
+  but will slow down the algorithm quadratically. There is no point in setting
+  `max-cluster-size` larger than the number of files in the input.
+  Unlike the old implementation, `nilsimsa` ordering is now completely
+  deterministic. See [Nilsimsa Ordering](#nilsimsa-ordering) for a detailed
+  description of the algorithm.
 
 - `--max-similarity-size=`*value*:
   Don't perform similarity ordering for fragments (or files if they are not split
@@ -792,6 +790,44 @@ When using different ordering schemes, the file inodes will be
 either sorted upfront, or just sent to the segmenter in the order
 in which they were discovered.
 
+### Nilsimsa Ordering
+
+The actual ordering step for nilsimsa ordering uses a recursive
+divide-and-conquer approach in order for the algorithm to be both
+parallelizable and deterministic. In the following description, a
+"node" is typically equivalent to a "unique file", although that's
+not a requirement for the algorithm.
+
+1. Nodes are clustered by distance to centroids into up to `max-children`
+   clusters. With `max-children` set to 1, all nodes end up in the same
+   cluster. If `max-children` is larger than 1, a new cluster is created
+   as soon as all existing clusters are further away than a "maximum
+   distance" that gets smaller as recursion depth increases.
+
+2. After the nodes have been clustered, each cluster that is larger than
+   `max-cluster-size` is recursively clustered again with a smaller
+   "maximum distance" as per (1). These recursive clustering steps can
+   potentially run in parallel.
+
+3. As soon as a cluster is smaller than `max-cluster-size`, its nodes
+   will be ordered by performing a nearest neighbour search. Note that
+   this only happens for leaf clusters in the tree. The ordering of each
+   leaf cluster can also run parallel with clustering / ordering of other
+   clusters.
+
+4. Once all clustering / ordering is done, the nodes are "collected" from
+   the clusters in the tree. There is currently no similarity ordering
+   between individual clusters, but this is something that can be explored
+   further in the future.
+
+By setting `max-children` to 1 and `max-cluster-size` to a really large
+number, only a single cluster will be created and the nearest neighbour
+search will be performed on the set of all nodes. Since the algorithm is
+O(n^2), this does not scale well for `max-cluster-size` beyong a few
+100,000. Also, since the algorithm does not minimize the global distance
+between all nodes, there's no guarantee that the result will be better
+if you use only a single cluster.
+
 ## AUTHOR
 
 Written by Marcus Holland-Moritz.