This repo supports the activities described in this repo.
- Linux or MacOS system
- Node v22+
- Bun (for package management only, as faster/smaller replacement for Yarn)
The automated steps in this repo are roughly as follows:
- Gather
- Get raw data from an external resource, e.g. scraping an HTML page, downloading/parsing a PDF/CSV, making a request to an API, etc.
- Save raw data exactly as-is for provenance and caching.
- Collate most important information from raw data into common high-level output data format suited to making desired dashboard pages and PDF reports.
- Repeat previous steps in order of dependency (e.g. opportunity number -> grant numbers) until all needed info is gathered.
- Print
- Run dashboard webapp.
- Import output data from gather step, and do some minimal final processing (e.g. combine journal info with each publication listing).
- Render select dashboard pages (e.g.
/core-project/abc123
) to PDF reports.
- Deploy dashboard and PDFs to private web addresses.
/app
- Dashboard webapp made with Vue. Also used for generating PDF reports./public/pdfs
- Outputted PDF reports.
/data
- All other functionality involving data./api
- Types and functions for getting raw data from external APIs./raw
- Raw data gathered from external sources, for provenance./gather
- Functions for gathering data and putting it in a common format./output
- Gathered data in format for making desired reports./print
- Functions specific to making printed reports./util
- Small-scope general purpose functions.
- TypeScript - Language used to provide type-safety from beginning to end of pipeline.
- Playwright - Tool used for scraping public web pages and rendering dashboard pages to PDF reports.
- Netlify - Service used for privately hosting dashboard webapp (and PR previews).
The pipeline is optimized wherever possible and appropriate. Things like network requests and rendering are parallelized (e.g. PDF reports are printed simultaneously in separate tabs of the same Playwright browser instance). External resources are cached in their raw format to speed up subsequent runs, and to avoid being rate-limited or blocked by those providers.
Use ./run.sh
with a --flag
to conveniently run a script
of the same name in /data/package.json
and /app/package.json
(if it exists) from the root of this repo.
Most important scripts:
Flag | Description |
---|---|
--install |
Install packages and dependencies |
--install-playwright |
Install Playwright |
no flag | Run main pipeline steps in order |
--test |
Run all tests (type-checking, linting/formatting checks, etc.) |
--lint |
Auto-fix linting/formatting |
--dev |
Run dashboard webapp in dev mode |
See readmes in sub folders for all commands.