-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Action sporadically times out when fetching Ubuntu #260
Comments
This is happening almost 1 of every 5 times for us. Seems like something that might be getting rate limited just slowed down. |
happens to us regularly:
is there a way to add a retry? |
I'd say we're hitting this on about 5-10% of action invocations, multiplied by several jobs in a workflow all using the same action. It's causing a painfully high incidence of spurious CI failures |
Yes, there is retry functionality in the toolkit used, we may also need to consider new / better caching strategy. |
Hey all, we actually already retry :) I had go to take a second out to look at the code, we can increase this number and possibly make it configurable. I'll open up a PR, and do some experimentation. |
@ericmj or @wojtekmach do you know of issues going on with hex in regards to where the builds.txt files reside? It may nothing on your end, and just problems with inside github runners. |
Additional problem: Their http client interface does not currently retry 408 (request timeout), instead it just throws an error. Thus, even if we increase the number of retries as part of the http client itself, it will make no difference. The only thing we can do it seems is wait for this to be merged and released by github or add retry ourselves. We might be able to use their retry helper Here are the options at our disposal :
@paulo-ferraz-oliveira Have any thoughts on this? Edit: Additional note : We may have to add retry logic ourselves. Without going through what tool cache will and will not retry on, when I looked at the error in the screen shot above and the one in #261 this appears to be a socket timeout vs a 408 . |
We are not aware of any issues other than that github has issues fetching it. |
This 408 error is interesting, where is it coming from? We do not explicitly return it in our custom CDN code and I cannot find any documentation about Fastly returning it. |
I fired off to fast here, this seems like the socket timing out vs a 408 being returned. I could be wrong, but the http client code in github's action toolkit reads that way As such, there's not much we can do besides retry ourselves. |
As an aside the action could have built in cache (if it already does please ignore this whole comment) for the build artifacts as well as builds.txt, and so it would mostly download stuff from gh infra as opposed to repo.hex.pm, increasing performance and reliability. I suppose builds.txt is ideally fresh to resolve versions so maybe it’d be a fallback to cache unless version matching is strict. This does not solve this network problem directly however for projects with warm cache, which I think would be a lot of projects, it kind of does. |
This is where my head is at. There's no build information on GH right now though? But yes, cache builds.txt, if there is a build not in what is cached, then invalidate the cache and attempt to obtain a fresh copy. We cache builds.txt for say 24h (it isn't updated too frequently). |
could you elaborate?
Just to be clear I meant to use actions/cache API and quickly looking at the docs it doesn’t have a ttl option. On second thought, caching builds.txt is maybe not such a great idea. The point is if we already have fresh OTP build there’s no point in hitting builds.txt to resolve versions. So scratch that. The point still stands though, if we have built in cache we make it less likely to hit the network outside gh infra and run into issues. |
What I meant is
Yes, agreed, and on inspection there is no ttl option, we'd simply overwrite the cache if there is no corresponding build there. I also agree with your other point : If we already have the cached artifact, then there's no point in even looking at builds.txt, but yes, it should still be cached in order to mitigate network boo boos. |
Right, builds.txt is not on GitHub but on repo.hex.pm. There is a link to it on GitHub.com/hexpm/bob. |
@starbelly, my thoughts on this are (
|
Hello all! This is happening commonly for our projects. It would be really helpful if anything could be done to help with the instability of this action 🙏 |
I'm not sure this is anything instable about the action, but rather github (as I see it), and perhaps it's http-client in the toolkit. Based on the conversations above, what would you like to see? Caching is the only viable course I believe, but has it's own problems, though builds.txt should change very infrequently, assuming that is your main issue. |
This action is definitely abnormally unstable. |
I'm running a test locally where by builds.txt is fetched every 10 seconds, it will fail if the fetch fails. My thoughts here still are for some reason github has problems fetching from hex sometimes (although I rarely see this myself). The results shall be interesting, I will let this run for say 12 hours (every 10 seconds). |
My tests aside as noted above, which are for my own curiosity, would it be possible to publish builds.txt to github when it's made available on repo.hex.pm? I suspect that everything that hits github is going to have better odds. Edit : Meant to ping @wojtekmach |
This has run for over 12 hours with problems on my machine. This doesn't tell us what goes wrong in github ofc when hitting repo.hex.pm (fastly), but it does give some merit to the notion that the timeout events experienced are isolated to that environment. There are at least 3 possibilities here :
My curiosity is satisfied regardless. I think the best we can do is either cache builds.txt or if hexpm team can push those fiels to github on release. |
could you elaborate? On release of what? |
Whenever builds are updated, and builds.txt is updated. It seems, for whatever reason, this is what people seem to see fail for people the the most. The hypothesis here is that if it were on github, then this would cease to be a problem (or less likely anyway). Yet, that's only a hypothesis. |
Got it, thanks. I don't think there is a great place on GitHub to attach these builds to, though. There are hacks like a repo that stores builds in git or a repo that stores builds in "fake" releases (i.e. releases follow OTP releases but the underlying repo doesn't actually change), but neither sounds great. |
You will find no disagreement here, just trying to obviate caching 😄 It also doesn't fix the problem at the source. Maybe we just need to knock on github's door. |
I think caching is the way to go. If anyone is able to add basic built-in build caching I think it will go a long way. Perhaps it works just on |
@wojtekmach I suppose, I think we already stated a PR for this would be welcome. |
There's a linked pull request with a potential fix for this, if y'all wanna test it. Instead of |
The bug
On occasion, while running the action a request to fetch the underlying Ubuntu image times out. Due to the time out, the action fails. To resolve the time-outs, we can attempt to increase the request time-out for fetching the Ubuntu image.
Software versions
A list of software versions where the bug is apparent, as detailed as possible:
setup-beam
: @v1ubuntu
: ubuntu-22.04How to replicate
erlef/setup-beam@v1
Expected behaviour
erlef/setup-beam@v1
Additional context
The text was updated successfully, but these errors were encountered: