Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: 403 errors when validating a URL #216

Closed
emmambd opened this issue Sep 8, 2022 · 7 comments · Fixed by #702
Closed

bug: 403 errors when validating a URL #216

emmambd opened this issue Sep 8, 2022 · 7 comments · Fixed by #702
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@emmambd
Copy link
Contributor

emmambd commented Sep 8, 2022

What problem is your feature request trying to solve?
One of the GitHub workflow checks evaluates if the GTFS feed can be downloaded. Often this check returns requests.exceptions.HTTPError: 403 Client Error: Forbidden for url due to SSL certificate errors, even when the URL can be downloaded manually. This becomes a blocker to add new sources. Example here.

Other examples where this is affecting our ability to get feeds: mdb-534 http://www.centro.org/CentroGTFS/CentroGTFS.zip

https://www.fayettevillenc.gov/home/showpublisheddocument/16121/638612293378070000

Describe the solution you'd like
Add headers in the response through the Python operations (add and update GTFS Schedule and realtime feeds).

How will we know when this is done?
As a user, I can add a source when the source is downloadable manually.

@emmambd emmambd self-assigned this Sep 8, 2022
@emmambd emmambd changed the title SSL errors when validating a URL bug: SSL errors when validating a URL Sep 8, 2022
@emmambd emmambd added bug Something isn't working help wanted Extra attention is needed labels Sep 8, 2022
@maximearmstrong
Copy link
Contributor

It seems like part of the problem is related to not identifying a User-Agent in the request header, like described here. Adding a User-Agent has solved the problem for many URLs locally, so it would be useful adding it to the workflow and code.

@emmambd
Copy link
Contributor Author

emmambd commented Sep 20, 2022

This PR seems to have decreased instances of the error, but it still has not removed all of them.

@emmambd emmambd changed the title bug: SSL errors when validating a URL bug: 403 errors when validating a URL Oct 6, 2022
@themightychris
Copy link

themightychris commented Oct 11, 2022

The example referenced (http://datos.gob.cl/dataset/c77c9a50-6dd1-449d-b5ab-947ec0139b31/resource/a4edcf07-0657-456d-bbbc-54b2aec1de8d/download/coquimbo10feb16.zip) fails checks for complete certificate chain in a couple of popular SSL checkers:

Screen Shot 2022-10-11 at 12 15 45 PM

Screen Shot 2022-10-11 at 12 15 29 PM

It looks like it's using an SSL root that's not widely distributed yet. This would be a matter of updating the root certificates installed at the operating system level, or instructing the command that checks that the URLs can be downloaded to ignore SSL errors. These appear to be Amazon-issued certificates so it's surprising that the GitHub runners aren't coming with them installed. Bumping the runner to ubuntu-22.04 may fix the issue but the current runner is ubuntu-20.04 which is LTS and it's surprising that it wouldn't have Amazon's CA backported into the default trusted root certs

Edit: looks like Python doesn't use system certificates by default, so this could be a matter of the Python version. This SA post indicates how to tell Python to use the system certificates which might be a good idea here: https://stackoverflow.com/a/42982144/964125

There's also a bigger question of how strict SSL checking should be to consider a feed valid. Using the system-installed root certs that come with ubuntu-latest rather than depending on what comes with the particular Python version being installed is probably a good baseline, anything failing SSL checks under that probably should be indicated as a failing feed

@emmambd
Copy link
Contributor Author

emmambd commented Oct 20, 2022

@themightychris Thank you for digging into this! re: ubuntu, it looks like GitHub Actions haven't updated to ubuntu-22.04 as the default for ubuntu-latest which is why it's running an older version.

I added a draft PR that points to the system certificates to see if that would have an impact on the workflow test, but it seems to not have made a difference (which could definitely be a problem on my end). Do you mind taking a look?

As a short term solution, we've talked about ignoring the test when it fails and manually testing that the URL is working and downloads a ZIP file. This is obviously not ideal, but may help with adding feeds as we debug and evaluate the certs problem.

@dancory-urbanfootprint
Copy link
Contributor

ubuntu-latest upgraded to 22.04 in Nov 2022. actions/runner-images#6512
24.04 rollout will begin on December 5th and will complete on January 17th, 2025. actions/runner-images#10636

@dancory-urbanfootprint
Copy link
Contributor

These URLs could all be added as http instead of https for now, so at least we'd have the data. I checked they all worked as http except transporlis.

Hopefully all URLs in the database will be changed to https at some point.

@emmambd
Copy link
Contributor Author

emmambd commented Jan 27, 2025

Thanks for the further investigation on this @dancory-urbanfootprint! We're exploring resolving the core issue right now so fewer feeds are blocked regardless of HTTP settings - what you suggest is a good workaround though if we run into issues cc @AlfredNwolisa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants