Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to crawl heavy Javascript based website #1148

Open
deathofabat opened this issue Feb 20, 2025 · 2 comments
Open

Unable to crawl heavy Javascript based website #1148

deathofabat opened this issue Feb 20, 2025 · 2 comments

Comments

@deathofabat
Copy link

Any tips on crawling below websites?

https://www.samsung.com/us/smartphones/galaxy-s25-ultra/compare/
https://www.apple.com/macbook-pro/compare/

So far, I'm only able to get gibberish content in the jina reader API response

@nomagick
Copy link
Member

Hi.
For the two pages you mention, Javascript is not the problem.
It's the semantic content of the two pages, it's been defined to repeat similar contents in a row.

At this point, there's not much Reader can do about it.
Maybe try x-return-format: pageshot to get a graphical screenshot of the page, then present the screenshot to LLMs with the text content.

@deathofabat
Copy link
Author

This is still not working for me.
Is there any other guidance on scraping the contents out of these types of webpages(which are heavy JS based or used heavy AJAX calls)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants