Text correction benchmarks

Benchmarks and baselines for various text correction tasks. Light-weight and easy to use.

Installation

pip install text-correction-benchmarks

or

git clone https://github.com/ad-freiburg/text-correction-benchmarks
cd text-correction-benchmarks && pip install .

After installation you will have two commands available to you:

tcb.evaluate for evaluating model predictions on benchmarks
tcb.baseline for running baselines on benchmarks

Usage

This repository contains benchmarks for text correction tasks such as

Whitespace correction (wsc)
Spelling error correction (sec)
Word-level spelling error detection (sedw)
Sequence-level spelling error detection (seds)

in the following simple text file format:

Whitespace correction
- corrupt.txt: Input text with whitespace errors (can also contains spelling errors, but they remain uncorrected in the groundtruth)
- correct.txt: Groundtruth text without whitespace errors
"Th isis a tset." > "This is a tset."
Spelling error correction
- corrupt.txt: Input text with spelling errors (can also contain whitespace errors to make the task harder)
- correct.txt: Groundtruth text without whitespace and spelling errors
"Th isis a tset." > "This is a test."
Word-level spelling error detection:
- corrupt.txt: Input text with spelling errors (should not contain whitespace errors, since we assume they are already fixed)
- correct.txt: Groundtruth label for each word in the input (split by whitespace), indicating whether a word contains a spelling error (1) or not (0)
"This is a tset." > "0 0 0 1"
Sequence-level spelling error detection:
- corrupt.txt: Input text with spelling errors (can also contain whitespace errors)
- correct.txt: Groundtruth label for each sequence in the input, indicating whether the sequence contains a spelling error (1) or not (0)
"this is a tset." > "1"

For each format one line corresponds to one benchmark sample.

Note that for some benchmarks we also provide versions other than test, e.g. dev or tuning, which can be used to assess performance of your method during developement. Final evaluations should always be done on the test split.

To evaluate predictions on a benchmark using tcb.evaluate, the following procedure is recommended:

Run your model on benchmarks/<split>/<task>/<benchmark>/corrupt.txt
Save your predictions in the expected format for the benchmark under in benchmarks/<split>/<task>/<benchmark>/predictions/<model_name>.txt

Evaluate your predictions on a benchmark using tcb.evaluate:

# evaluate your predictions on the benchmark
tcb.evaluate -b benchmarks/<split>/<task>/<benchmark>

# optionally sort by some metric and highlight the best predictions
tcb.evaluate -b benchmarks/<split>/<task>/<benchmark> --sort "<metric>" --highlight

You can also evaluate across multiple benchmarks like so:

# when evaluating across multiple benchmarks you always need to specify a metric,
# otherwise you will get an error 

# listing multiple benchmarks
tcb.evaluate -b benchmarks/<split>/<task>/<benchmark1> \
    benchmarks/<split>/<task>/<benchmark2> ... -m "<metric>"

# using glob pattern
tcb.evaluate -b benchmarks/<split>/<task>/<gl*ob_patte*rn> -m "<metric>"

# you can highlight the best predictions per benchmark
tcb.evaluate -b benchmarks/<split>/<task>/<gl*ob_patte*rn> -m "<metric>" --highlight

Depending on the task the following metrics are calculated:

Whitespace correction
- F1 (micro-averaged)
- F1 (sequence-averaged)
- Sequence accuracy
Spelling error correction
- F1 (micro-averaged)
- F1 (sequence-averaged)
- Sequence accuracy
Word-level spelling error detection
- Word accuracy
- Binary F1 (micro-averaged)
Sequence-level spelling error detection
- Binary F1
- Sequence accuracy

Baselines

We also provide baselines for each task:

Whitespace correction:
- Dummy (wsc_dummy)
Spelling error correction:
- Dummy (sec_dummy)
- Close to dictionary (sec_ctd)
- Norvig (sec_norvig)
- Aspell (sec_aspell)
- Hunspell (sec_hunspell)
- Jamspell (sec_jamspell)
- Neuspell (BERT) (sec_neuspell_bert)
Word-level spelling error detection:
- Dummy (sedw_dummy)
- Out of dictionary (sedw_ood)
- From spelling error correction¹ (sedw_from_sec)
Sequence-level spelling error detection:
- Dummy (seds_dummy)
- Out of dictionary (seds_ood)
- From spelling error correction¹ (seds_from_sec)

The dummy baselines produce the predictions one gets by leaving the inputs unchanged.

¹ We can reuse spelling error correction baselines to detect spelling errors both on a word and sequence level. For the word level we simply predict that all words changed by a spelling corrector contain a spelling error. For the sequence level we predict that a sequence contains a spelling error if it is changed by a spelling corrector. All spelling error correction baselines or prediction files can be used as underlying spelling correctors for this purpose.

You can run a baseline using tcb.baseline:

# run baseline on stdin and output to stdout
tcb.baseline <baseline_name>

# run baseline on file and output to stdout
tcb.baseline <baseline_name> -f <input_file>

# run baseline on file and write predictions to file
tcb.baseline <baseline_name> -f <input_file> -o <output_file>

# some baselines require you to pass additional arguments,
# you will get error messages if you dont
# e.g. all dictionary based baselines like the out of dictionary baseline
# for word-level spelling error detection need the path to a dictionary
# as additional argument
tcb.baseline sedw_ood -f <input_file> --dictionary <dictionary_file>

Dictionaries can be found here.

Predictions of the baselines and other models from the literature can be found in a subdirectory predictions in each benchmark, see e.g. here.

This repository is backed by the text-correction-utils package.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
dictionaries		dictionaries
scripts		scripts
src/text_correction_benchmarks		src/text_correction_benchmarks
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text correction benchmarks

Installation

Usage

Baselines

About

Releases

Packages

Languages

License

ad-freiburg/text-correction-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Text correction benchmarks

Installation

Usage

Baselines

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages