Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better solutions to web.archive.org rate limiting #21

Open
jclark1913 opened this issue Nov 7, 2023 · 1 comment
Open

Better solutions to web.archive.org rate limiting #21

jclark1913 opened this issue Nov 7, 2023 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jclark1913
Copy link
Collaborator

Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

@jclark1913 jclark1913 added the enhancement New feature or request label Nov 7, 2023
@msramalho
Copy link
Contributor

So I recently discovered that the cdx api has the following rate limit logic:
Requests are limited to an average of 60/min. Over that and we start getting 429s. If 429s are ignored for more than a minute the IP gets blocked for 1 hour. Subsequent 429s over a given period will double that time each occurrence.
So ideally, If we can keep the api request < 60/minute we will prevent this from happening.

@msramalho msramalho added the help wanted Extra attention is needed label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants