Frontmatter

Abstract

This workshop goes over how to web scrape using python library, Beautiful Soup 4, or bs4. In short, bs4 is a Python library for "web scraping," or pulling data out of HTML and XML files. It allows people to take information, including text and images, from websites. It's often used for taking all kinds of data, like news, financial, sports, social media data. Once we take this data, we usually put it into a format like JSON or CSV (spreadsheet) for further analysis. In this workshop, we will be using bs4 to scrape news data from the New York Times website. By this end of this workshop, you will have a python script that can grab data from a website and export that data into a CSV file. Then, at the very end, I will show you a couple of other ways to scrape websites, that go beyond bs4, for scraping social media.

Learning Objectives

In this workshop, participants will:

discuss Web Scraping, distinguishing it from other data gathering methods like APIs
quickly review the basics of HTML (for those who need a refresher)
jump into using bs4 for exploring web pages; learning about various functions and methods for bs4, going into some specificity.
learn to write scripts in bs4 to automate searching for website data, and to export our scraped data to csv files (spreadsheets)
explore web wbrowser tools (the inspector) for identifying elements to scrape from websites

Estimated time

2 hours.

Prerequisites

DHRI workshop on Python (complete lessons 1-7, through "Loops")
DHRI workshop on the Command Line (complete lessons 1-6, through "Navigation")
DHRI workshop on HTML/CSS (complete lessons 1-6, through "Links")

Contexts

Pre-reading suggestions

Beautiful Soup 4 on PiPy

Projects that use these skills

Scraping reddit into JSON format. Script (with walkthrough) on how to scrape data from Reddit. Although reddit also has a handy .
Instagram-scraper. Module for scraping instagram that outputs automatically into JSON file.

Ethical Considerations

Always consider the human element of who you are scraping. Is it ethical to work with this data? To pull it from websites? There is also the issue of legality in some cases. Some websites will not allow you to scrape their data.

Resources (optional)

Corey's Schafer's YouTube tutorial, from html to csv
Real Python's web scrapping with bs4 tutorial
Beautiful Soup 4 docs

License

Workshop leader: Filipa Calado, Graduate Center Digital Fellows

Creative Commons Attribution-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frontmatter.md

frontmatter.md

Frontmatter

Abstract

Learning Objectives

Estimated time

Prerequisites

Contexts

Pre-reading suggestions

Projects that use these skills

Ethical Considerations

Resources (optional)

License

Files

frontmatter.md

Latest commit

History

frontmatter.md

File metadata and controls

Frontmatter

Abstract

Learning Objectives

Estimated time

Prerequisites

Contexts

Pre-reading suggestions

Projects that use these skills

Ethical Considerations

Resources (optional)

License