Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

clarketm · 2020-02-05T00:00:36Z

History

Problem:

Prow occasionally fails to git clone or git fetch with the following error:

remote error: upload-pack: not our ref $SHA

where $SHA is some git commit SHA available on the remote during ref advertisement but no longer available on object fetch.

Related Issues:

Explanation from GitHub

In git not our ref errors occur when a client performing a fetch has to do multiple requests to negotiate with the server as to which Git objects the server needs to return. The error typically occurs if a different process pushes to the repository between these requests. Git traditionally refuses to serve objects which aren't the tip of a reference (i.e. branch, tag, etc.). The problem is that the Git-over-HTTP/S protocol is stateless; with the ref advertisement and the actual object fetch split into two requests. Coming back to the same server with an intermediate write in between may cause the second request to see a different state.

Git's HTTP/S protocol does try to handle this case by allowing any reachable commit to be fetched (so an intermediate write which advances a branch won't cause the problem). However that doesn't work if a branch is rewound or deleted.

Solutions

Use the SSH protocol rather than HTTP/HTTPS for your Git fetches. The SSH protocol does not involve multiple separate requests, so it is not susceptible to this problem. This would be the simplest and most reliable solution, provided you are able to configure SSH authentication for your build servers.
Be prepared to retry the fetch if this error occurs. It is a harmless race condition that only occurs when another process simultaneously pushes to one of the references that you are trying to fetch, so a retry is likely to succeed. (Extend retryCmd with configurable retry and bo factor #16107)
Fetch only the references that you need. For example, if you are only testing master, then try fetching only master. Such a fetch won't be disturbed if somebody else pushes to a different branch at the same time. In particular, avoid fetching all of refs/pulls/* (for example, don't clone with --mirror), as those references change quite frequently.
Lastly, if you could use Git's v2 protocol. It relaxes the reachability requirement entirely, and should make the problem go away. You would want to do something like git config --global protocol.version 2 as the user that performs the fetches (or drop the --global to just set it inside the one problematic repo's config). The v2 protocol will become the default in a future version of Git, maybe the next one released, but it's not confirmed yet.

The text was updated successfully, but these errors were encountered:

clarketm · 2020-02-05T18:08:08Z

I would appreciate opinions from Prow and @kubernetes/sig-testing folks regarding the best long-term solution. The solutions are described in the issue description.

petr-muller · 2020-02-05T18:19:29Z

I like the git v2 protocol option, but I don't have enough information about how stable it currently is, and how stable it is in git client versions people are likely to use. Otherwise I think the retry is a fine way how to tackle this. Configuring SSH is a mess. The third option sounds like it would only reduce occurrences but not get rid of all of them?

cjwagner · 2020-02-05T22:07:53Z

I like the 3rd and 4th options.

The third option sounds like it would only reduce occurrences but not get rid of all of them?

I think this would actually get rid of occurrences since PR branches are really the only place where "a branch is rewound or deleted" which GH's response suggest is the cause of the error. It sounds like git clone --mirror simply doesn't behave well with repos with lots of active PRs. Even if there are lots of feature or release branches those shouldn't cause problems unless they are force pushed or deleted which should be very rare.

If we can seamlessly switch to git v2 without any other changes that definitely sounds like the best option though.

clarketm · 2020-02-14T22:30:33Z

sgtm.

I will implement the 4th option (i.e. v2 protocol) and fallback to the 3rd option (i.e. fetch what is needed w/out mirror clone) if the former is not portable.

In the meantime, can we go with 2nd option (i.e. increase # of timeouts) to mitigate (#16107)? This error is only really a problem for larger repos (e.g. istio/istio) with more source/refs, so a few extra retries with exponential backoff could greatly improve user-experience – and do so immediately.

clarketm · 2020-02-14T22:31:43Z

/assign

fejta-bot · 2020-05-14T22:56:25Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-06-13T23:40:20Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-07-14T00:21:34Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-07-14T00:21:49Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

clarketm added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 5, 2020

clarketm changed the title ~~Resolve occasionaly "not our ref" errors in Prow on git clone or fetch operations~~ Resolve occasional "not our ref" errors in Prow on git clone or fetch operations Feb 5, 2020

k8s-ci-robot assigned clarketm and unassigned clarketm Feb 5, 2020

clarketm mentioned this issue Feb 5, 2020

Extend retryCmd with configurable retry and bo factor #16107

Closed

clarketm changed the title ~~Resolve occasional "not our ref" errors in Prow on git clone or fetch operations~~ Resolve occasional "not our ref" errors in Prow on git clone/fetch operations Feb 5, 2020

k8s-ci-robot assigned clarketm Feb 14, 2020

clarketm mentioned this issue Mar 19, 2020

Enable cherrypick plugin for k-sigs/{cluster-api,controller-tools,controller-runtime} #16329

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 13, 2020

k8s-ci-robot closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

clarketm commented Feb 5, 2020 •

edited

Loading

clarketm commented Feb 5, 2020 •

edited

Loading

petr-muller commented Feb 5, 2020

cjwagner commented Feb 5, 2020

clarketm commented Feb 14, 2020 •

edited

Loading

clarketm commented Feb 14, 2020

fejta-bot commented May 14, 2020

fejta-bot commented Jun 13, 2020

fejta-bot commented Jul 14, 2020

k8s-ci-robot commented Jul 14, 2020

Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

Resolve occasional "not our ref" errors in Prow on git clone/fetch operations #16130

Comments

clarketm commented Feb 5, 2020 • edited Loading

History

Problem:

Related Issues:

Explanation from GitHub

Solutions

clarketm commented Feb 5, 2020 • edited Loading

petr-muller commented Feb 5, 2020

cjwagner commented Feb 5, 2020

clarketm commented Feb 14, 2020 • edited Loading

clarketm commented Feb 14, 2020

fejta-bot commented May 14, 2020

fejta-bot commented Jun 13, 2020

fejta-bot commented Jul 14, 2020

k8s-ci-robot commented Jul 14, 2020

clarketm commented Feb 5, 2020 •

edited

Loading

clarketm commented Feb 5, 2020 •

edited

Loading

clarketm commented Feb 14, 2020 •

edited

Loading