Cached parsing #2881

whalemare · 2025-03-09T18:45:40Z

whalemare
Mar 9, 2025

Hi, it's is possible to split the whole crawling process to 3 steps?

Fetch all required webpages and save it to local files
Parse this files, instead of doing fresh requests
Save parsed data related to this page

This approach is very handy in case where pages is very unstructured, so you can freely experiment with parsing function (on step 2) without making a real request to website

janbuchar · 2025-03-10T15:09:54Z

janbuchar
Mar 10, 2025
Maintainer

Hello, and thanks for your interest in Crawlee! This usecase is not explicitly supported by the framework, but it should be pretty easy to achieve anyways. For caching of websites, I'd recommend using a standalone caching proxy such as https://www.npmjs.com/package/@loopback/http-caching-proxy.

Then I believe you can just repeat steps 2 and 3, overwrite the results each time, and adjust your parsing until you're happy with the result.

Does this work for you?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cached parsing #2881

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Cached parsing #2881

whalemare Mar 9, 2025

Replies: 1 comment

janbuchar Mar 10, 2025 Maintainer

whalemare
Mar 9, 2025

janbuchar
Mar 10, 2025
Maintainer