@@ -298,15 +298,13 @@ Most other options are concerned with compression tuning:
298
298
determine a good ordering.
299
299
However, the new implementation of this algorithm can be parallelized and
300
300
will perform much better on huge numbers of files. ` nilsimsa ` ordering can
301
- be tweaked by specifying ` max-children ` and ` max-cluster-size ` . Both options
302
- determine how the set of files will be split into clusters, each of which will
303
- be further split recursively. ` max-children ` is the maximum number of child
304
- nodes resulting from a clustering step. If ` max-children ` distinct clusters
305
- have been found, new files will be added to the closest cluster. ` max-cluster-size `
306
- determines at which point a cluster will no longer be split further. Typically,
307
- larger values will result in better ordering, but will also make the algorithm
308
- slower. Unlike the old implementation, ` nilsimsa ` ordering is now completely
309
- deterministic.
301
+ be tweaked by specifying ` max-children ` and ` max-cluster-size ` . In general,
302
+ larger values for ` max-cluster-size ` tend to result in better compression,
303
+ but will slow down the algorithm quadratically. There is no point in setting
304
+ ` max-cluster-size ` larger than the number of files in the input.
305
+ Unlike the old implementation, ` nilsimsa ` ordering is now completely
306
+ deterministic. See [ Nilsimsa Ordering] ( #nilsimsa-ordering ) for a detailed
307
+ description of the algorithm.
310
308
311
309
- ` --max-similarity-size= ` * value* :
312
310
Don't perform similarity ordering for fragments (or files if they are not split
@@ -792,6 +790,44 @@ When using different ordering schemes, the file inodes will be
792
790
either sorted upfront, or just sent to the segmenter in the order
793
791
in which they were discovered.
794
792
793
+ ### Nilsimsa Ordering
794
+
795
+ The actual ordering step for nilsimsa ordering uses a recursive
796
+ divide-and-conquer approach in order for the algorithm to be both
797
+ parallelizable and deterministic. In the following description, a
798
+ "node" is typically equivalent to a "unique file", although that's
799
+ not a requirement for the algorithm.
800
+
801
+ 1 . Nodes are clustered by distance to centroids into up to ` max-children `
802
+ clusters. With ` max-children ` set to 1, all nodes end up in the same
803
+ cluster. If ` max-children ` is larger than 1, a new cluster is created
804
+ as soon as all existing clusters are further away than a "maximum
805
+ distance" that gets smaller as recursion depth increases.
806
+
807
+ 2 . After the nodes have been clustered, each cluster that is larger than
808
+ ` max-cluster-size ` is recursively clustered again with a smaller
809
+ "maximum distance" as per (1). These recursive clustering steps can
810
+ potentially run in parallel.
811
+
812
+ 3 . As soon as a cluster is smaller than ` max-cluster-size ` , its nodes
813
+ will be ordered by performing a nearest neighbour search. Note that
814
+ this only happens for leaf clusters in the tree. The ordering of each
815
+ leaf cluster can also run parallel with clustering / ordering of other
816
+ clusters.
817
+
818
+ 4 . Once all clustering / ordering is done, the nodes are "collected" from
819
+ the clusters in the tree. There is currently no similarity ordering
820
+ between individual clusters, but this is something that can be explored
821
+ further in the future.
822
+
823
+ By setting ` max-children ` to 1 and ` max-cluster-size ` to a really large
824
+ number, only a single cluster will be created and the nearest neighbour
825
+ search will be performed on the set of all nodes. Since the algorithm is
826
+ O(n^2), this does not scale well for ` max-cluster-size ` beyong a few
827
+ 100,000. Also, since the algorithm does not minimize the global distance
828
+ between all nodes, there's no guarantee that the result will be better
829
+ if you use only a single cluster.
830
+
795
831
## AUTHOR
796
832
797
833
Written by Marcus Holland-Moritz.
0 commit comments