Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

revlist is slow for non-local storage layers #909

Open
@strib

Description

@strib

Over in keybase/client#12730, we have a user of Keybase Git who noticed it can be very slow (and CPU-intensive) to fetch a single new commit from certain repos. Under the cover this is powered by a go-git Remote.Push() call, which uses the Keybase storage layer as the Remote.Storer instance, and uses the local git checkout as the remote side of the connection. (So the "fetch" becomes a push from the Keybase repo to the local checkout.)

The push is implemented by first getting the advertised reference heads from the local git checkout, and passing those commits as the ignore list to revlist.Objects(), along with the commit hashes being pushed out of Keybase repo. The performance problem seems to happen here, which walks the entire commit history and the entire directory tree for each of the commits, in order to build up a more complete ignore list for the real revlist computation a few lines below.

This works acceptably fast when the storage layer is a local git repo, but when it's a distributed storage layer like KBFS, this can be very slow. And it seems unnecessary -- most of the ignored object IDs likely won't be used, since the tree-walking at the real computation will bail out at the first seen object it sees while walking the tree, which for most single-commit fetches will be most of the entries of the root directory.

I think there is a better algorithm for this. I started playing around with one here, but it's pretty stupid and while it does definitely improves things for Keybase, it's not by as much as I think it should be able to.

I haven't worked out a complete solution yet, as I am hoping for dev feedback first. But I think the outline could be something like this:

  1. In revlist.Objects, for each object in objs, find the most-recent parent commit ID in ignore. Call the set of these commits parents.
  2. For each commit in parents, compute a git diff-tree-like algorithm against the corresponding commit in objs. (I'm not sure if this algorithm exists yet in go-git anywhere.) This would traverse and compare the directory entries between the two commits, and find the exact places they differ (traversing only as deeply as needed in each tree to figure out the differences).
  3. The list of similarities between the two are added to the ignore list, along with all the parents.
  4. When constructing the final set of objects for the revlist using this new ignore list, reachableObjects should return early as soon as it reaches a commit in the ignore list, and the pending list is empty.

I could be wrong, but I think this is a more efficient algorithm in all cases. Of course, it requires a bit more code complexity.

What do the devs think? cc: @smola @erizocosmico @mcuadros

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions