Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(academy): add advanced crawling section with sitemaps and search #1217

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

metalwarrior665
Copy link
Member

No description provided.

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Sep 17, 2024

Will process the lint issues soon

@fnesveda fnesveda added the t-academy Issues related to Web Scraping and Apify academies. label Sep 18, 2024
@metalwarrior665
Copy link
Member Author

@TC-MO If we change URL of an article, do I need to contact web team to set a hard redirect?

@TC-MO
Copy link
Contributor

TC-MO commented Sep 18, 2024

I think we do redirects in nginx.conf file not sure if there is any other way

@metalwarrior665
Copy link
Member Author

TODO redirect
..._web_scraping/scraping_paginated_sites.md → ...scraping/crawling/crawling-with-search.md

@@ -9,6 +9,16 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';

# How to scrape from sitemaps {#scraping-with-sitemaps}

>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like unnecesary dating ? When is "recently" ? Also I think this could work better as admonitions instead of blockquote.

---
title: Crawling sitemaps
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
menuWeight: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that is custom created by Apify? I haven't seen this anywhere else

title: Crawling sitemaps
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
menuWeight: 2
paths:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this supposed to be slug: ?


Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually.

## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anchors do not differ from headings then these are unnecessary from what I remember


Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code:

```javascript
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we switch it to ```js? Sometime back we changed this for consistency across Academy & Platform docs. I'll add this info to contributing guidelines.

- advanced-web-scraping/crawling/sitemaps-vs-search
---

The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if home page is correct, perhaps @TheoVasilis could weigh in? Home page or homepage?


The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.

Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

category: web scraping & automation
slug: /advanced-web-scraping
paths:
- advanced-web-scraping
---

# Advanced web scraping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If title in frontmatter does not differ from h1, h1 is unnecessary it will be automatically generated by docusaurus

---
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.

## [](#what-does-production-ready-mean) What does production-ready mean?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, headers should not use punctuation

@honzajavorek
Copy link
Collaborator

Will review, but I think I will wait for @TC-MO's comments to be addressed first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants