Skip to Frontmatter.
This workshop introduces web scraping with Python library bs4. It can be taught asynchronously or synchronously, as its own workshop or as part of the Python track on the digital institutes.
This workshop was written by Filipa Calado.
It was first taught at CUNY GC by Filipa Calado in the Spring of 2021 as a two hour online syncronous workshop.
Abstract:
This workshop goes over how to web scrape using python library, Beautiful Soup 4, or bs4. In short, bs4 is a Python library for "web scraping," or pulling data out of HTML and XML files. In this workshop, we will be using bs4 to scrape news data from the New York Times website. By this end of this workshop, you will have a python script that can grab data from a website and export that data into a CSV file. Then, at the very end, I will show you a couple of other ways to scrape websites, that go beyond bs4, for scraping social media.
- Students need to be familiar with the Python language, having completed the Introduction to Python workshop before taking this workshop.
- Students should install the most recent Anaconda Python distribution on their computers, as well as the python libraries
requests
,bs4
,lxml
andcsv
.
Feedback was very good. Students thought the pace and content was effective.
There is some interest in expanding this workshop into a two or three part web scraping series.
Workshop leader: Filipa Calado, Graduate Center Digital Fellows
Creative Commons Attribution-ShareAlike 4.0 International License.