Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

Closed
4 tasks
clarketm opened this issue Feb 5, 2020 · 9 comments
Closed
4 tasks
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@clarketm
Copy link
Contributor

clarketm commented Feb 5, 2020

History

Problem:

Prow occasionally fails to git clone or git fetch with the following error:

remote error: upload-pack: not our ref $SHA

where $SHA is some git commit SHA available on the remote during ref advertisement but no longer available on object fetch.

Related Issues:

Explanation from GitHub

In git not our ref errors occur when a client performing a fetch has to do multiple requests to negotiate with the server as to which Git objects the server needs to return. The error typically occurs if a different process pushes to the repository between these requests. Git traditionally refuses to serve objects which aren't the tip of a reference (i.e. branch, tag, etc.). The problem is that the Git-over-HTTP/S protocol is stateless; with the ref advertisement and the actual object fetch split into two requests. Coming back to the same server with an intermediate write in between may cause the second request to see a different state.

Git's HTTP/S protocol does try to handle this case by allowing any reachable commit to be fetched (so an intermediate write which advances a branch won't cause the problem). However that doesn't work if a branch is rewound or deleted.

Solutions

  • Use the SSH protocol rather than HTTP/HTTPS for your Git fetches. The SSH protocol does not involve multiple separate requests, so it is not susceptible to this problem. This would be the simplest and most reliable solution, provided you are able to configure SSH authentication for your build servers.

  • Be prepared to retry the fetch if this error occurs. It is a harmless race condition that only occurs when another process simultaneously pushes to one of the references that you are trying to fetch, so a retry is likely to succeed. (Extend retryCmd with configurable retry and bo factor #16107)

  • Fetch only the references that you need. For example, if you are only testing master, then try fetching only master. Such a fetch won't be disturbed if somebody else pushes to a different branch at the same time. In particular, avoid fetching all of refs/pulls/* (for example, don't clone with --mirror), as those references change quite frequently.

  • Lastly, if you could use Git's v2 protocol. It relaxes the reachability requirement entirely, and should make the problem go away. You would want to do something like git config --global protocol.version 2 as the user that performs the fetches (or drop the --global to just set it inside the one problematic repo's config). The v2 protocol will become the default in a future version of Git, maybe the next one released, but it's not confirmed yet.

@clarketm clarketm added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 5, 2020
@clarketm clarketm changed the title Resolve occasionaly "not our ref" errors in Prow on git clone or fetch operations Resolve occasional "not our ref" errors in Prow on git clone or fetch operations Feb 5, 2020
@k8s-ci-robot k8s-ci-robot assigned clarketm and unassigned clarketm Feb 5, 2020
@clarketm
Copy link
Contributor Author

clarketm commented Feb 5, 2020

I would appreciate opinions from Prow and @kubernetes/sig-testing folks regarding the best long-term solution. The solutions are described in the issue description.

@petr-muller
Copy link
Member

I like the git v2 protocol option, but I don't have enough information about how stable it currently is, and how stable it is in git client versions people are likely to use. Otherwise I think the retry is a fine way how to tackle this. Configuring SSH is a mess. The third option sounds like it would only reduce occurrences but not get rid of all of them?

@cjwagner
Copy link
Member

cjwagner commented Feb 5, 2020

I like the 3rd and 4th options.

The third option sounds like it would only reduce occurrences but not get rid of all of them?

I think this would actually get rid of occurrences since PR branches are really the only place where "a branch is rewound or deleted" which GH's response suggest is the cause of the error. It sounds like git clone --mirror simply doesn't behave well with repos with lots of active PRs. Even if there are lots of feature or release branches those shouldn't cause problems unless they are force pushed or deleted which should be very rare.

If we can seamlessly switch to git v2 without any other changes that definitely sounds like the best option though.

@clarketm clarketm changed the title Resolve occasional "not our ref" errors in Prow on git clone or fetch operations Resolve occasional "not our ref" errors in Prow on git clone/fetch operations Feb 5, 2020
@clarketm
Copy link
Contributor Author

clarketm commented Feb 14, 2020

sgtm.

I will implement the 4th option (i.e. v2 protocol) and fallback to the 3rd option (i.e. fetch what is needed w/out mirror clone) if the former is not portable.

In the meantime, can we go with 2nd option (i.e. increase # of timeouts) to mitigate (#16107)? This error is only really a problem for larger repos (e.g. istio/istio) with more source/refs, so a few extra retries with exponential backoff could greatly improve user-experience – and do so immediately.

@clarketm
Copy link
Contributor Author

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 13, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants