-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation #12633
Conversation
Here are the results from running
One WTF (wow that's funny) is why a Another observation is that it takes quite a few RAM MB to bring the final FST size close-ish to its optimal / minimal size (350.2 MB). It's also curious how the FST Build time grows with a larger I will try soonish to post a similar table from |
For comparison, this is how the curve (RAM required during construction vs final FST size) looks on trunk, using the god-like parameters as best I could. I sorted the results in reverse I'll try to turn these two tables into a single chart comparing RAM used and final FST size and maybe build time:
|
I didn't get into all the details but I think this looks good. Your questions are indeed intriguing - I can't provide any explanation off the top of my head, really. |
Translating/merging the above two tables into a graph: Some observations:
|
If we replace |
@@ -99,31 +87,23 @@ public class FSTCompiler<T> { | |||
* tuning and tweaking, see {@link Builder}. | |||
*/ | |||
public FSTCompiler(FST.INPUT_TYPE inputType, Outputs<T> outputs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR and just my thought that would it be more maintainable in the long run if we only have a single way to build the FST through the Builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point -- I'm not sure why we have a public ctor on this class. I'll see if I can remove it, maybe as follow-on PR.
this.fst = fst; | ||
this.in = in; | ||
} | ||
|
||
/** Compares an unfrozen node (UnCompiledNode) with a frozen node at byte location address (long), returning | ||
* true if they are equal. */ | ||
private boolean nodesEqual(FSTCompiler.UnCompiledNode<T> node, long address) throws IOException { | ||
fst.readFirstRealTargetArc(address, scratchArc, in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this means we still need to read from the in-writing FST? I'm wondering would it be possible to only read from the cache, then we can decouple the FST from NodeHash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do that (fully decouple NodeHash
from the FST's byte[]
) as a follow-on issue? This change is tricky enough :)
I love this idea! This way the hash grows to consume only as much RAM as needed, up until the specified limit, at which point it begins pruning to cap the RAM usage at the limit. |
Thanks for the suggestions @dungba88! I took the approach you suggested, with a few more pushed commits just now. Despite the increase in
I tested again on all terms from
Rendered as a graph vs It's less RAM than the previous |
assert node != 0; | ||
|
||
// confirm frozen hash and unfrozen hash are the same | ||
assert hash(node) == hash : "mismatch frozenHash=" + hash(node) + " vs hash=" + hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily related to this CR
This seems to be only for assertion, but hash(long)
require reading the FST and thus one obstacle to decouple the NodeHash & FST. I'm wondering if it would make sense to let fst.addNode
(it should be fstCompiler.addNode()
now) return the node value instead of node address.
The 2nd and final obstacle is nodeEquals
. But it seems it would require the cache table to store the node value & address instead of just address. Maybe we can build a LRU cache from nodeAddress to nodeValue? Storing value would mean more heap for small FST though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be only for assertion, but
hash(long)
require reading the FST and thus one obstacle to decouple the NodeHash & FST. I'm wondering if it would make sense to letfst.addNode
(it should befstCompiler.addNode()
now) return the node value instead of node address.
This assertion is quite important for now, because we have two completely different hash(...)
implementations: one reads from an already frozen node written into the growing FST (a byte[]
) and the other from an unfrozen node.
The 2nd and final obstacle is
nodeEquals
. But it seems it would require the cache table to store the node value & address instead of just address. Maybe we can build a LRU cache from nodeAddress to nodeValue? Storing value would mean more heap for small FST though.
Hmm, if the hash table stores the full node value byte[]
(instead of a reference into the growing FST byte[]
), I think we would still need a nodeEquals
method to compare unfrozen and frozen nodes? Or maybe we could just always freeze the pending unfrozen node to do our comparisons, but often that would be wasted work since the same node (suffix) already exists in the NodeHash
?
OK, I switched the accounting to approximate RAM usage of the If you pass I think this is nearly ready -- I'll next clean up the remaining |
dd7ca3c
to
f0db846
Compare
OK I think this is ready -- I removed/downgraded all I set the default RAM limit to 32 MB, which should be plenty to get close (within 10%? at least for "real" terms) to the true minimal FST. I think we could backport this to 9.x. There is an API break in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mikemccand ! Great work!
I left some minor comments/questions otherwise looks good to me :)
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @mikemccand ! This is really a huge improvement in param friendliness :)
Thanks @gf2121 -- I agree! So much more intuitive to tell the FST compiler how much RAM it can use to make as minimal an FST as it can. This means we can build bigger FSTs with less worry about heap sizes, and once we get "stream to disk" working, then FST building in a fixed amount of RAM is truly possible. Thank you Tantivy's FST implementation for inspiring these changes! I'll leave this for another day or so, and then merge at first only to |
boolean doShareSuffix, | ||
boolean doShareNonSingletonNodes, | ||
int shareMaxTailLength, | ||
double suffixRAMLimitMB, // pass 0 to disable suffix compression/trie; larger values create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@param
in method javadoc instead? This is not easy to read with spotless truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bruno-roustant -- I agree it looks awful :) -- but I think I'll just remove this wimpy comment. These private ctor parameters are better documented on the Builder setters.
* bounded by the number of unique suffixes. If you pass a value smaller than the builder would | ||
* use, the least recently used suffixes will be discarded, thus reducing suffix sharing and | ||
* creating a non-minimal FST. In this case, the larger the limit, the closer the FST will be to | ||
* its true minimal size, with diminishing returns as you increasea the limit. Pass {@code 0} to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"increasea"
// there -- we would not need any long per entry -- we'd be able to start at the FST end node and | ||
// work backwards from the transitions | ||
|
||
// TODO: couldn't we prune natrually babck until we see a transition with an output? it's highly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"natrually babck"
@@ -110,8 +110,10 @@ protected long baseRamBytesUsed() { | |||
public long ramBytesUsed() { | |||
long bytesUsed = RamUsageEstimator.alignObjectSize(baseRamBytesUsed()); | |||
bytesUsed += RamUsageEstimator.alignObjectSize(RamUsageEstimator.shallowSizeOf(subMutables)); | |||
// System.out.println("abstract.ramBytesUsed:"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we keep these commented prints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a general rule :) I'll remove these ones.
@@ -99,20 +184,18 @@ private long hash(FSTCompiler.UnCompiledNode<T> node) { | |||
h += 17; | |||
} | |||
} | |||
// System.out.println(" ret " + (h&Integer.MAX_VALUE)); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask?
Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public).
This really makes a difference in the evenness of the value distribution. This is one of the secrets of the HPPC hashing. By applying this, we get multiple advantages:
- lookup should be improved (less hash collision)
- we can try to rehash at 3/4 occupancy because the performance should not be impacted until this point.
- in case of hash collision, we can lookup linearly with a pos = pos + 1 instead of quadratic probe (lines 95 and 327); this may avoid some mem cache miss.
(same for the other hash method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mask with Long.MAX_VALUE below, since we mask anyway with the table mask?
You're right, this is pointless -- I'll remove from both hash functions -- then we preserve that top (sign) bit for this following change:
Instead, we should multiply with the gold constant BitMixer#PHI_C64 (make it public).
Whoa, this sounds awesome! I was wondering if we could improve the simplistic hashing here ... I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a spinoff issue with this idea. Sounds like a low hanging hashing fruit!
I opened #12704. Thanks @bruno-roustant!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the graphs!
I'll also confirm |
…n rehash using the wrong bitsRequired (count vs node)
…lockTreeTermsWriter
…util wikipedia index and build an FST from them
…; when half of the double barrel is full, allocate new primary hash at full size to save cost of continuously rehashing for a large FST
…r users); clean up some stale nocommits
…ze in TestFSTs.testRealTerms
…during merge conflict resolution
… JVM; fix bogus assert (uncovered by Test2BFST); add TODO to Test2BFST anticipating building massive FSTs in small bounded RAM
e73bf04
to
e954556
Compare
…w to run it directly
That last check failed because of formatting -- I re-tidy'd and I think it's ready -- I'll push shortly. |
…ache.org * upstream/main: (239 commits) Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633) Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698) Avoid object construction when linear searching arcs (apache#12692) chore: update the Javadoc example in Analyzer (apache#12693) coorect position on entry in CHANGES.txt Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625) Extract the hnsw graph merging from being part of the vector writer (apache#12657) Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670) Speed up TestIndexOrDocValuesQuery. (apache#12672) Remove over-counting of deleted terms (apache#12586) Use MergeSorter in StableStringSorter (apache#12652) Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587) Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687) Optimize OnHeapHnswGraph's data structure (apache#12651) Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684) Move changes entry before backporting CHANGES Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode simple cleanups to vector code (apache#12680) Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677) ...
…ST compilation (#12633) * tweak comments; change if to switch * remove old SOPs, minor comment styling, fixed silly performance bug on rehash using the wrong bitsRequired (count vs node) * first raw cut; some nocommits added; some tests fail * tests pass! * fix silly fallback hash bug * remove SOPs; add some temporary debugging metrics * add temporary tool to test FST performance across differing NodeHash sizes * remove (now deleted) shouldShareNonSingletonNodes call from Lucene90BlockTreeTermsWriter * add simple tool to render results table to GitHub MD * add simple temporary tool to iterate all terms from a provided luceneutil wikipedia index and build an FST from them * first cut at using packed ints for hash t able again * add some nocommits; tweak test_all_sizes.py to new RAM usage approach; when half of the double barrel is full, allocate new primary hash at full size to save cost of continuously rehashing for a large FST * switch to limit suffix hash by RAM usage not count (more intuitive for users); clean up some stale nocommits * switch to more intuitive approximate RAM (mb) limit for allowed size of NodeHash * nuke a few nocommits; a few more remain * remove DO_PRINT_HASH_RAM * no more FST pruning * remove final nocommit: randomly change allowed NodeHash suffix RAM size in TestFSTs.testRealTerms * remove SOP * tidy * delete temp utility tools * remove dead (FST pruning) code * add CHANGES entry; fix one missed fst.addNode -> fstCompiler.addNode during merge conflict resolution * remove a mal-formed nocommit * fold PR feedback * fold feedback * add gradle help test details on how to specify heap size for the test JVM; fix bogus assert (uncovered by Test2BFST); add TODO to Test2BFST anticipating building massive FSTs in small bounded RAM * suppress sysout checks for Test2BFSTs; add helpful comment showing how to run it directly * tidy
This is a first attempt (not committable yet! work in progress! many
nocommit
s!) to use a boundedNodeHash
when finding common suffixes during FST compilation. It uses a simple double-barrel LRU cache to hold theNodeHash
entries.I created a simple tool,
IndexToFST
, to iterate all terms from aluceneutil
benchy index and build an FST from them, andtest_all_sizes.py
to run this tool multiple times with differentNodeHash
sizes and gather the resulting metrics.Relates #12542