revlist is slow for non-local storage layers

Over in https://github.com/keybase/client/issues/12730, we have a user of Keybase Git who noticed it can be very slow (and CPU-intensive) to fetch a single new commit from certain repos.  Under the cover this is powered by a go-git `Remote.Push()` call, which uses the Keybase storage layer as the `Remote.Storer` instance, and uses the local git checkout as the remote side of the connection.  (So the "fetch" becomes a push from the Keybase repo to the local checkout.)

The push is implemented by first getting the advertised reference heads from the local git checkout, and passing those commits as the ignore list to `revlist.Objects()`, along with the commit hashes being pushed out of Keybase repo.  The performance problem seems to happen [here](https://github.com/src-d/go-git/blob/43d17e14b714665ab5bc2ecc220b6740779d733f/plumbing/revlist/revlist.go#L24), which walks the entire commit history and the entire directory tree for each of the commits, in order to build up a more complete ignore list for the _real_ revlist computation a few lines below.

This works acceptably fast when the storage layer is a local git repo, but when it's a distributed storage layer like KBFS, this can be very slow.  And it seems unnecessary -- most of the ignored object IDs likely won't be used, since the tree-walking at the _real_ computation will bail out at the first `seen` object it sees while walking the tree, which for most single-commit fetches will be most of the entries of the root directory.

I think there is a better algorithm for this.  I started playing around with one [here](https://github.com/keybase/go-git/commit/1925dd710a8888f662d1a52df1fda636545b4388), but it's pretty stupid and while it does definitely improves things for Keybase, it's not by as much as I think it should be able to.

I haven't worked out a complete solution yet, as I am hoping for dev feedback first.  But I think the outline could be something like this:

1. In `revlist.Objects`, for each object in `objs`, find the most-recent parent commit ID in `ignore`.  Call the set of these commits `parents`.
2. For each commit in `parents`, compute a `git diff-tree`-like algorithm against the corresponding commit in `objs`.  (I'm not sure if this algorithm exists yet in `go-git` anywhere.)  This would traverse and compare the directory entries between the two commits, and find the exact places they differ (traversing only as deeply as needed in each tree to figure out the differences).
3. The list of similarities between the two are added to the `ignore` list, along with all the `parents`.
4. When constructing the final set of objects for the `revlist` using this new `ignore` list, `reachableObjects` should return early as soon as it reaches a commit in the ignore list, and the pending list is empty.

I could be wrong, but I think this is a more efficient algorithm in all cases.  Of course, it requires a bit more code complexity.

What do the devs think? cc: @smola @erizocosmico @mcuadros 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

revlist is slow for non-local storage layers #909

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

revlist is slow for non-local storage layers #909

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions