The version of Zorro used to evaluate BabyBERTa in the paper BabyBERTa: Learning more grammar with small-scale child directed language published in the proceedings of CoNLL2021 is out-of-date. This repository contains an updated version with title-cased proper nouns, and different parings of content words. To use the version used by the authors, please refer to the commit tagged "CoNLL2021".
Inspired by BLiMP,
Zorro
is a Python project for creating minimal pairs that exhibit a variety of grammatical contrasts,
for quantifying the grammatical knowledge of masked language models.
Sentences are created using templates, filled with words from custom, human-curated word lists. There are 13 phenomena, each consisting of one or more paradigms:
- determiner-subject agreement: across prepositional phrase, or relative clause; in 1 or 2 verb question
- subject-verb agreement: across 0, 1, or 2 adjectives
- anaphor agreement: gender
- argument structure: dropped arguments, swapped arguments, transitive
- binding: principle A
- case: subjective pronoun (not in BLiMP)
- ellipsis: N-bar
- filler-gap: wh-question object, or subject
- irregular: verb
- island-effects: adjunct, coordinate_structure_constraint
- local attractor: in question with auxiliary (not in BLiMP)
- NPI licensing: "only" licensor
- quantifiers: existential there, superlative
The data is located in sentences/babyberta
.
Each text file represents one paradigm, and contains 4,000 sentences making up 2,00 minimal pairs.
The grammatical correctness of each sentence is determined by its position in the text file:
- sentences on odd numbered lines (1, 3, etc.) are un-grammatical
- sentences on even numbered lines (2, 4, etc.) are grammatical
Words that make up test sentences are all derived from frequent nouns, verbs and adjectives in 5M words of child-directed speech, 5M words of child-directed written text from the Newsela corpus, and 10M words of adult-directed written text from English Wikipedia.
Note: All words were derived from whole-words in the BPE vocabulary of BabyBERTa trained using the Python tokenizers
package on the above corpora.
Because BabyBERTa is trained on lower-cased data, all words in test sentences, except for proper nouns, are lower-cased.
- Using
script/tag_and_count_vocab_words.py
, we removed any word that is:
- not a whole word in original corpus files (it is a sub-word)
- not in the English dictionary
- a Stanford CoreNLP stop-word
- Using
scripts/chose_legal_words.py
, we:
- automatically retrieved words tagged with desired POS
- manually tagged words as legal or illegal
Currently, information about each sentence's template is saved, but is unused. It may be useful to use information about a sentence's template for more in-depth analyses, such as breaking down performance by singular vs. plural noun or verb.
- get vocab from which words will be chosen for inclusion in test sentences using
scripts/tag_and_count_vocab_words.py
- chose words to be included by part-of-speech using
scripts/chose_legal_words.py
- make and save test sentences using
scripts/make_sentences.py
Use UnMasked to load a custom model from the huggingface
model hub.
The evaluation script is located in UnMasked/scripts/score_model_from_repo.py