Open-gov spiders written in Python

Data Sources

		Source code	Dataset
Australia	Family, domestic and sexual violence...	`parser` \| `spider` \| `tests`	`json`
Australia	IP Glossary	`parser` \| `spider` \| `tests`	`json`
Australia	Design IP Glossary	`parser` \| `spider` \| `tests`	`json`
Australia	Law Handbook Glossary	`parser` \| `spider` \| `tests`	`json`
Canada	Parliamentary Glossary	`parser` \| `spider` \| `tests`	`json`
Canada	Patents Glossary	`parser` \| `spider` \| `tests`	`json`
Great Britain	Criminal Procedure Rules Glossary	`parser` \| `spider` \| `tests`	`json`
Great Britain	Family Procedure Rules Glossary	`parser` \| `spider` \| `tests`	`json`
Ireland	Courts Glossary	`parser` \| `spider` \| `tests`	`json`
New Zealand	Justice Glossary	`parser` \| `spider` \| `tests`	`json`
USA	US Courts Glossary	`parser` \| `spider` \| `tests`	`json`
USA	USCIS Glossary	`parser` \| `spider` \| `tests`	`json`
USA	Criminal Glossary	`parser` \| `spider` \| `tests` \| `spider tests`	`json`
Intergovernmental	Rome Statute	`parser` \| `spider` \| `tests`	`json`

Tip

The USA Courts Glossary spider and parser are the best examples of our new architecture and coding style.

Example: USA Courts Glossary Parser

The spiders retrieve HTML pages and output well formed JSON. Glossary parsers all output the same JSON format.

First, we can see which spiders are available:

$ scrapy list

aus_designip_glossary
aus_dv_glossary
aus_ip_glossary
aus_lawhandbook_glossary
can_parliamentary_glossary
can_patents_glossary
gbr_cpr_glossary
gbr_fpr_glossary
int_rome_statute
irl_courts_glossary
nzl_justice_glossary
usa_courts_glossary
usa_criminal_glossary
usa_uscis_glossary
...

Then we can run one of the spiders, e.g. the USA Courts Glossary:

$ scrapy crawl --overwrite-output tmp/output.json usa_courts_glossary

Here's a snippet of the output:

{
  "phrase": "Sentence",
  "definition": "The punishment ordered by a court for a defendant convicted of a crime."
},
{
  "phrase": "Sentencing guidelines",
  "definition": "A set of rules and principles established by the United States Sentencing Commission that trial judges use to determine the sentence for a convicted defendant."
},
{
  "phrase": "Sequester",
  "definition": "To separate. Sometimes juries are sequestered from outside influences during their deliberations."
},

Note

See the wiki for a deep dive explanation of our parsing strategy.

Development Environment Notes

Python 3.12

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

poetry run pytest

I run them automatically when a file changes using pytest-watcher. It's automatically installed with the project:

poetry run ptw .

Other tools

Java is required by the Python Tika package.
Pylance/Pyright for type-checking

Dependencies; helpful links

The Scrapy Playbook

Contributing

To add a new glossary crawler:

Create the parser in public_law/glossaries/parsers/{jurisdiction}/:
- Write a pure parse_entries(html: HtmlResponse) -> tuple[GlossaryEntry, ...] function
- Focus only on HTML → data extraction, no metadata
- Add parser tests under tests/glossaries/parsers/{jurisdiction}/
Create the spider in public_law/glossaries/spiders/{jurisdiction}/:
- Inherit from EnhancedAutoGlossarySpider
- Implement get_metadata(response: HtmlResponse) -> Metadata method
- Configure name and start_urls attributes
- Add spider tests under tests/glossaries/spiders/{jurisdiction}/
Test and run:
- Run tests: poetry run pytest tests/glossaries/{parser,spiders}/{jurisdiction}/
- Run spider: scrapy crawl --overwrite-output tmp/output.json {spider_name}

To add a new legal text crawler:

Add a new spider under public_law/legal_texts/spiders/{jurisdiction}/.
Write a parser in public_law/legal_texts/parsers/{jurisdiction}/ that extracts document structure and metadata.
Add a test case under tests/legal_texts/parsers/{jurisdiction}/.
Run the spider using scrapy crawl --overwrite-output tmp/output.json {spider_name}.

The repository follows a business-domain-first organization:

Glossaries: Legal term definitions and dictionaries
Legal Texts: Full legal documents, statutes, and regulations
Shared: Common utilities, base classes, and models used across domains

Need help? Just ask in GitHub Issues or ping @robb.

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
.cursor/rules		.cursor/rules
.github		.github
bin		bin
docs		docs
public_law		public_law
script		script
tests		tests
typings		typings
.gitignore		.gitignore
.python-version		.python-version
.tool-versions		.tool-versions
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py
test_validation.py		test_validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Open-gov spiders written in Python

Data Sources

Example: USA Courts Glossary Parser

Development Environment Notes

Python 3.12

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

Contributing

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Uh oh!

public-law/open-gov-crawlers

Folders and files

Latest commit

History

Repository files navigation

Open-gov spiders written in Python

Data Sources

Example: USA Courts Glossary Parser

Development Environment Notes

Python 3.12

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

Contributing

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages