revlist is slow for non-local storage layers #909
Description
Over in keybase/client#12730, we have a user of Keybase Git who noticed it can be very slow (and CPU-intensive) to fetch a single new commit from certain repos. Under the cover this is powered by a go-git Remote.Push()
call, which uses the Keybase storage layer as the Remote.Storer
instance, and uses the local git checkout as the remote side of the connection. (So the "fetch" becomes a push from the Keybase repo to the local checkout.)
The push is implemented by first getting the advertised reference heads from the local git checkout, and passing those commits as the ignore list to revlist.Objects()
, along with the commit hashes being pushed out of Keybase repo. The performance problem seems to happen here, which walks the entire commit history and the entire directory tree for each of the commits, in order to build up a more complete ignore list for the real revlist computation a few lines below.
This works acceptably fast when the storage layer is a local git repo, but when it's a distributed storage layer like KBFS, this can be very slow. And it seems unnecessary -- most of the ignored object IDs likely won't be used, since the tree-walking at the real computation will bail out at the first seen
object it sees while walking the tree, which for most single-commit fetches will be most of the entries of the root directory.
I think there is a better algorithm for this. I started playing around with one here, but it's pretty stupid and while it does definitely improves things for Keybase, it's not by as much as I think it should be able to.
I haven't worked out a complete solution yet, as I am hoping for dev feedback first. But I think the outline could be something like this:
- In
revlist.Objects
, for each object inobjs
, find the most-recent parent commit ID inignore
. Call the set of these commitsparents
. - For each commit in
parents
, compute agit diff-tree
-like algorithm against the corresponding commit inobjs
. (I'm not sure if this algorithm exists yet ingo-git
anywhere.) This would traverse and compare the directory entries between the two commits, and find the exact places they differ (traversing only as deeply as needed in each tree to figure out the differences). - The list of similarities between the two are added to the
ignore
list, along with all theparents
. - When constructing the final set of objects for the
revlist
using this newignore
list,reachableObjects
should return early as soon as it reaches a commit in the ignore list, and the pending list is empty.
I could be wrong, but I think this is a more efficient algorithm in all cases. Of course, it requires a bit more code complexity.
What do the devs think? cc: @smola @erizocosmico @mcuadros