metacritic-games-scrape

scrape games on metacritic.com

prerequisites

poetry

git clone https://github.com/atloo1/metacritic-games-scrape.git
cd metacritic-games-scrape/

run

scrape

poetry install --only scrape
poetry run scrapy crawl metacritic_games -L DEBUG -O post_scrape/data.json --logfile post_scrape/logs.txt
poetry run python -m post_scrape.clean_json

artifacts available upon request:

http_cache.tar.gz: HTTP cache of the scrape underlying write-up.pdf
- use with tar -xzf http_cache.tar.gz -C <project-root>/.scrapy/httpcache/

analyze

poetry install --only analyze
poetry run jupyter notebook ./post_scrape/
# or do as you please with data_cleaned.json

artifacts available upon request:

data_cleaned.json.gz: the dataset for all analyses & starting point if not scraping for yourself
data.json.gz: scrapy crawl output
logs.json.gz: scrapy crawl logs

file descriptions

reading materials available upon request:

supplement_citations.pdf: textual citations; an alternative of write-up's hyperlinks
supplement_figures.pdf: enlarged figures from write-up
supplement_table_1.pdf: table 1 as alluded to in write-up
supplement_table_3.pdf: complete table 3 as alluded to in write-up
write-up.pdf: a research-style paper presenting this repository's work

`./post_scrape/*`:

clean_data_validation.ipynb: data_cleaned.json inspection
clean_json.py: data.json → data_cleaned.json
fig_1.ipynb: reproduce figure 1
fig_2.ipynb: reproduce figure 2
fig_3.ipynb: reproduce figure 3
fig_4.ipynb: reproduce figure 4
load_scrape_data.py: data_cleaned.json → DataFrame
nlp_utils.py: text normalization, topic modeling, & cross validation pipeline
table_2.py: reproduce underlying data for table 2
tables_3_4_fig_5.ipynb: reproduce underlying data for tables 3 & 4 + figure 5
tables_2_3_4_fig_5_pretty.ipynb: reproduce tables 2, 3, & 4 + figure 5 as seen in write-up

develop

prerequisites

pyenv

1st time setup

pyenv install 3.9 --skip-existing   # or your choice
pyenv local 3.9   # or your choice
poetry install
poetry run pre-commit install

helpful Bash

monitor scrape progress: 1 line = 1 web page

wc -l post_scrape/data.json

find a page in HTTP cache; next scrape will re-download it if deleted

query_url="https://www.metacritic.com/game/halo-2/"  # set me
query_str="{'url': '${query_url}'"
find .scrapy/httpcache/metacritic_games/ -type f -name "meta" -exec bash -c '[[ "$(head -n 1 "$0")" == "$1"* ]] && echo "$0 starts with query_str"' {} "$query_str" \;

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
metacritic_games_scrape		metacritic_games_scrape
post_scrape		post_scrape
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metacritic-games-scrape

prerequisites

run

scrape

artifacts available upon request:

analyze

artifacts available upon request:

file descriptions

reading materials available upon request:

`./post_scrape/*`:

develop

prerequisites

1st time setup

helpful Bash

monitor scrape progress: 1 line = 1 web page

find a page in HTTP cache; next scrape will re-download it if deleted

About

Contributors 2

Languages

License

atloo1/metacritic-games-scrape

Folders and files

Latest commit

History

Repository files navigation

metacritic-games-scrape

prerequisites

run

scrape

artifacts available upon request:

analyze

artifacts available upon request:

file descriptions

reading materials available upon request:

./post_scrape/*:

develop

prerequisites

1st time setup

helpful Bash

monitor scrape progress: 1 line = 1 web page

find a page in HTTP cache; next scrape will re-download it if deleted

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages

`./post_scrape/*`: