scrape games on metacritic.com
git clone https://github.com/atloo1/metacritic-games-scrape.git
cd metacritic-games-scrape/
poetry install --only scrape
poetry run scrapy crawl metacritic_games -L DEBUG -O post_scrape/data.json --logfile post_scrape/logs.txt
poetry run python -m post_scrape.clean_json
http_cache.tar.gz
: HTTP cache of the scrape underlyingwrite-up.pdf
- use with
tar -xzf http_cache.tar.gz -C <project-root>/.scrapy/httpcache/
- use with
poetry install --only analyze
poetry run jupyter notebook ./post_scrape/
# or do as you please with data_cleaned.json
data_cleaned.json.gz
: the dataset for all analyses & starting point if not scraping for yourselfdata.json.gz
:scrapy crawl
outputlogs.json.gz
:scrapy crawl
logs
reading materials available upon request:
supplement_citations.pdf
: textual citations; an alternative of write-up's hyperlinkssupplement_figures.pdf
: enlarged figures from write-upsupplement_table_1.pdf
: table 1 as alluded to in write-upsupplement_table_3.pdf
: complete table 3 as alluded to in write-upwrite-up.pdf
: a research-style paper presenting this repository's work
clean_data_validation.ipynb
:data_cleaned.json
inspectionclean_json.py
:data.json
→data_cleaned.json
fig_1.ipynb
: reproduce figure 1fig_2.ipynb
: reproduce figure 2fig_3.ipynb
: reproduce figure 3fig_4.ipynb
: reproduce figure 4load_scrape_data.py
:data_cleaned.json
→ DataFramenlp_utils.py
: text normalization, topic modeling, & cross validation pipelinetable_2.py
: reproduce underlying data for table 2tables_3_4_fig_5.ipynb
: reproduce underlying data for tables 3 & 4 + figure 5tables_2_3_4_fig_5_pretty.ipynb
: reproduce tables 2, 3, & 4 + figure 5 as seen in write-up
pyenv install 3.9 --skip-existing # or your choice
pyenv local 3.9 # or your choice
poetry install
poetry run pre-commit install
wc -l post_scrape/data.json
query_url="https://www.metacritic.com/game/halo-2/" # set me
query_str="{'url': '${query_url}'"
find .scrapy/httpcache/metacritic_games/ -type f -name "meta" -exec bash -c '[[ "$(head -n 1 "$0")" == "$1"* ]] && echo "$0 starts with query_str"' {} "$query_str" \;