Skip to content

Latest commit

 

History

History
59 lines (36 loc) · 3.3 KB

frontmatter.md

File metadata and controls

59 lines (36 loc) · 3.3 KB

Frontmatter

Skip to Workshop.

Abstract

This workshop goes over how to web scrape using python library, Beautiful Soup 4, or bs4. In short, bs4 is a Python library for "web scraping," or pulling data out of HTML and XML files. It allows people to take information, including text and images, from websites. It's often used for taking all kinds of data, like news, financial, sports, social media data. Once we take this data, we usually put it into a format like JSON or CSV (spreadsheet) for further analysis. In this workshop, we will be using bs4 to scrape news data from the New York Times website. By this end of this workshop, you will have a python script that can grab data from a website and export that data into a CSV file. Then, at the very end, I will show you a couple of other ways to scrape websites, that go beyond bs4, for scraping social media.

Learning Objectives

In this workshop, participants will:

  • discuss Web Scraping, distinguishing it from other data gathering methods like APIs
  • quickly review the basics of HTML (for those who need a refresher)
  • jump into using bs4 for exploring web pages; learning about various functions and methods for bs4, going into some specificity.
  • learn to write scripts in bs4 to automate searching for website data, and to export our scraped data to csv files (spreadsheets)
  • explore web wbrowser tools (the inspector) for identifying elements to scrape from websites

Estimated time

2 hours.

Prerequisites

  • DHRI workshop on Python (complete lessons 1-7, through "Loops")
  • DHRI workshop on the Command Line (complete lessons 1-6, through "Navigation")
  • DHRI workshop on HTML/CSS (complete lessons 1-6, through "Links")

Contexts

Pre-reading suggestions

Projects that use these skills

Ethical Considerations

Always consider the human element of who you are scraping. Is it ethical to work with this data? To pull it from websites? There is also the issue of legality in some cases. Some websites will not allow you to scrape their data.

Resources (optional)

License

Workshop leader: Filipa Calado, Graduate Center Digital Fellows

Creative Commons License

Creative Commons Attribution-ShareAlike 4.0 International License.