Skip to content

public-law/open-gov-crawlers

Repository files navigation

Test Suite

Open-gov spiders written in Python

Data Sources

Source code Dataset
Australia Family, domestic and sexual violence... parser | spider | tests json
Australia IP Glossary parser | spider | tests json
Australia Design IP Glossary parser | spider | tests json
Australia Law Handbook Glossary parser | spider | tests json
Canada Parliamentary Glossary parser | spider | tests json
Canada Patents Glossary parser | spider | tests json
Great Britain Criminal Procedure Rules Glossary parser | spider | tests json
Great Britain Family Procedure Rules Glossary parser | spider | tests json
Ireland Courts Glossary parser | spider | tests json
New Zealand Justice Glossary parser | spider | tests json
USA US Courts Glossary parser | spider | tests json
USA USCIS Glossary parser | spider | tests json
USA Criminal Glossary parser | spider | tests | spider tests json
Intergovernmental Rome Statute parser | spider | tests json

Tip

The USA Courts Glossary spider and parser are the best examples of our new architecture and coding style.

Example: USA Courts Glossary Parser

The spiders retrieve HTML pages and output well formed JSON. Glossary parsers all output the same JSON format.

First, we can see which spiders are available:

$ scrapy list

aus_designip_glossary
aus_dv_glossary
aus_ip_glossary
aus_lawhandbook_glossary
can_parliamentary_glossary
can_patents_glossary
gbr_cpr_glossary
gbr_fpr_glossary
int_rome_statute
irl_courts_glossary
nzl_justice_glossary
usa_courts_glossary
usa_criminal_glossary
usa_uscis_glossary
...

Then we can run one of the spiders, e.g. the USA Courts Glossary:

$ scrapy crawl --overwrite-output tmp/output.json usa_courts_glossary

Here's a snippet of the output:

{
  "phrase": "Sentence",
  "definition": "The punishment ordered by a court for a defendant convicted of a crime."
},
{
  "phrase": "Sentencing guidelines",
  "definition": "A set of rules and principles established by the United States Sentencing Commission that trial judges use to determine the sentence for a convicted defendant."
},
{
  "phrase": "Sequester",
  "definition": "To separate. Sometimes juries are sequestered from outside influences during their deliberations."
},

Note

See the wiki for a deep dive explanation of our parsing strategy.

Development Environment Notes

Python 3.12

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

poetry run pytest

I run them automatically when a file changes using pytest-watcher. It's automatically installed with the project:

poetry run ptw .

Other tools

  • Java is required by the Python Tika package.
  • Pylance/Pyright for type-checking

Dependencies; helpful links

Contributing

To add a new glossary crawler:

  1. Create the parser in public_law/glossaries/parsers/{jurisdiction}/:

    • Write a pure parse_entries(html: HtmlResponse) -> tuple[GlossaryEntry, ...] function
    • Focus only on HTML → data extraction, no metadata
    • Add parser tests under tests/glossaries/parsers/{jurisdiction}/
  2. Create the spider in public_law/glossaries/spiders/{jurisdiction}/:

    • Inherit from EnhancedAutoGlossarySpider
    • Implement get_metadata(response: HtmlResponse) -> Metadata method
    • Configure name and start_urls attributes
    • Add spider tests under tests/glossaries/spiders/{jurisdiction}/
  3. Test and run:

    • Run tests: poetry run pytest tests/glossaries/{parser,spiders}/{jurisdiction}/
    • Run spider: scrapy crawl --overwrite-output tmp/output.json {spider_name}

To add a new legal text crawler:

  1. Add a new spider under public_law/legal_texts/spiders/{jurisdiction}/.
  2. Write a parser in public_law/legal_texts/parsers/{jurisdiction}/ that extracts document structure and metadata.
  3. Add a test case under tests/legal_texts/parsers/{jurisdiction}/.
  4. Run the spider using scrapy crawl --overwrite-output tmp/output.json {spider_name}.

The repository follows a business-domain-first organization:

  • Glossaries: Legal term definitions and dictionaries
  • Legal Texts: Full legal documents, statutes, and regulations
  • Shared: Common utilities, base classes, and models used across domains

Need help? Just ask in GitHub Issues or ping @robb.

About

Parse government documents into well formed JSON

Topics

Resources

Contributing

Stars

Watchers

Forks

Sponsor this project

Contributors 3

  •  
  •  
  •  

Languages