Source code | Dataset | ||
---|---|---|---|
Australia | Family, domestic and sexual violence... | parser | spider | tests |
json |
Australia | IP Glossary | parser | spider | tests |
json |
Australia | Design IP Glossary | parser | spider | tests |
json |
Australia | Law Handbook Glossary | parser | spider | tests |
json |
Canada | Parliamentary Glossary | parser | spider | tests |
json |
Canada | Patents Glossary | parser | spider | tests |
json |
Great Britain | Criminal Procedure Rules Glossary | parser | spider | tests |
json |
Great Britain | Family Procedure Rules Glossary | parser | spider | tests |
json |
Ireland | Courts Glossary | parser | spider | tests |
json |
New Zealand | Justice Glossary | parser | spider | tests |
json |
USA | US Courts Glossary | parser | spider | tests |
json |
USA | USCIS Glossary | parser | spider | tests |
json |
USA | Criminal Glossary | parser | spider | tests | spider tests |
json |
Intergovernmental | Rome Statute | parser | spider | tests |
json |
Tip
The USA Courts Glossary spider and parser are the best examples of our new architecture and coding style.
The spiders retrieve HTML pages and output well formed JSON. Glossary parsers all output the same JSON format.
First, we can see which spiders are available:
$ scrapy list
aus_designip_glossary
aus_dv_glossary
aus_ip_glossary
aus_lawhandbook_glossary
can_parliamentary_glossary
can_patents_glossary
gbr_cpr_glossary
gbr_fpr_glossary
int_rome_statute
irl_courts_glossary
nzl_justice_glossary
usa_courts_glossary
usa_criminal_glossary
usa_uscis_glossary
...
Then we can run one of the spiders, e.g. the USA Courts Glossary:
$ scrapy crawl --overwrite-output tmp/output.json usa_courts_glossary
Here's a snippet of the output:
{
"phrase": "Sentence",
"definition": "The punishment ordered by a court for a defendant convicted of a crime."
},
{
"phrase": "Sentencing guidelines",
"definition": "A set of rules and principles established by the United States Sentencing Commission that trial judges use to determine the sentence for a convicted defendant."
},
{
"phrase": "Sequester",
"definition": "To separate. Sometimes juries are sequestered from outside influences during their deliberations."
},
Note
See the wiki for a deep dive explanation of our parsing strategy.
I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.
Poetry for dependency management
Making sure I have the current deps installed is always good to do:
poetry install
The pytest tests run easily:
poetry run pytest
I run them automatically when a file changes using pytest-watcher. It's automatically installed with the project:
poetry run ptw .
- Java is required by the Python Tika package.
- Pylance/Pyright for type-checking
To add a new glossary crawler:
-
Create the parser in
public_law/glossaries/parsers/{jurisdiction}/
:- Write a pure
parse_entries(html: HtmlResponse) -> tuple[GlossaryEntry, ...]
function - Focus only on HTML → data extraction, no metadata
- Add parser tests under
tests/glossaries/parsers/{jurisdiction}/
- Write a pure
-
Create the spider in
public_law/glossaries/spiders/{jurisdiction}/
:- Inherit from
EnhancedAutoGlossarySpider
- Implement
get_metadata(response: HtmlResponse) -> Metadata
method - Configure
name
andstart_urls
attributes - Add spider tests under
tests/glossaries/spiders/{jurisdiction}/
- Inherit from
-
Test and run:
- Run tests:
poetry run pytest tests/glossaries/{parser,spiders}/{jurisdiction}/
- Run spider:
scrapy crawl --overwrite-output tmp/output.json {spider_name}
- Run tests:
To add a new legal text crawler:
- Add a new spider under
public_law/legal_texts/spiders/{jurisdiction}/
. - Write a parser in
public_law/legal_texts/parsers/{jurisdiction}/
that extracts document structure and metadata. - Add a test case under
tests/legal_texts/parsers/{jurisdiction}/
. - Run the spider using
scrapy crawl --overwrite-output tmp/output.json {spider_name}
.
The repository follows a business-domain-first organization:
- Glossaries: Legal term definitions and dictionaries
- Legal Texts: Full legal documents, statutes, and regulations
- Shared: Common utilities, base classes, and models used across domains
Need help? Just ask in GitHub Issues or ping @robb.