Better solutions to web.archive.org rate limiting #21

jclark1913 · 2023-11-07T20:57:50Z

Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

msramalho · 2023-11-27T12:08:45Z

So I recently discovered that the cdx api has the following rate limit logic:
Requests are limited to an average of 60/min. Over that and we start getting 429s. If 429s are ignored for more than a minute the IP gets blocked for 1 hour. Subsequent 429s over a given period will double that time each occurrence.
So ideally, If we can keep the api request < 60/minute we will prevent this from happening.

jclark1913 added the enhancement New feature or request label Nov 7, 2023

msramalho added the help wanted Extra attention is needed label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better solutions to web.archive.org rate limiting #21

Better solutions to web.archive.org rate limiting #21

jclark1913 commented Nov 7, 2023

msramalho commented Nov 27, 2023

Better solutions to web.archive.org rate limiting #21

Better solutions to web.archive.org rate limiting #21

Comments

jclark1913 commented Nov 7, 2023

Overview

Possible solutions

Incorporating a library w/ exponential delays

Custom solution

msramalho commented Nov 27, 2023