Skip to content

dataset of csc, chinese spelling correct / check; 14万中文文本纠错数据集(中文拼写纠错), 主要是"地得的"

License

Notifications You must be signed in to change notification settings

yongzhuo/csc_dataset_de3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

license task_categories language tags
apache-2.0
text2text-generation
text-generation
zh
csc
text-correct
macro-correct
chinese-spelling-correct
corrector

csc_public_de3数据集

数据来源

  • 1.由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
  • 2.来自人民日报高质量语料;
  • 3.来自学习强国网站的高质量语料;
  • 4.源数据为qwen生成的好词好句;
  • 5.古诗词chinese-poetry; 文言文garychowcmu/daizhigev20;

测评指标

模型[Macropodus/macbert4mdcspell_v1](https://huggingface.co/Macropodus/macbert4mdcspell_v1)
common Sentence Level detection: acc:0.9547, precision:1.0000, recall:0.9547, f1:0.9768
common Sentence Level correction: acc:0.9309, precision:1.0000, recall:0.9309, f1:0.9642
strict Sentence Level detection: acc:0.9547, precision:0.9629, recall:0.9547, f1:0.9588
strict Sentence Level correction: acc:0.9309, precision:0.9389, recall:0.9309, f1:0.9349
common Token Level correction: precision:0.8865, recall:0.9898, f1:0.9353
common TOken Level correction: precision:0.9797, recall:0.9828, f1:0.9812

数据简介

  • 该数据主要为'的地得'纠错;
  • 其中训练数据130753条, 验证数据5545条, 测试数据5545条;
  • 句子平均长度为36, 最长句子长度为414, 最短为5, 95%的为89, 75%的为46, 60%的为34;
  • 每个句子中字的平均错误数为2;
  • 全量数据在HF(包括train.json):https://huggingface.co/datasets/Macropodus/csc_public_de3

数据详情

################################################################################################################################
train.json
130753
avg_errors: 2.074751630937722
avg_length: 36.39140210932063
414
5
95%:  89
90%:  69
85%:  58
80%:  51
75%:  46
70%:  41
60%:  34
50%:  29
################################################################################################################################
dev.json
5545
avg_errors: 2.047069431920649
avg_length: 35.767177637511274
278
5
95%:  85
90%:  66
85%:  56
80%:  49
75%:  45
70%:  41
60%:  35
50%:  29
################################################################################################################################
tet.json
5545
avg_errors: 2.028494138863841
avg_length: 35.982326420198376
277
5
95%:  88
90%:  66
85%:  56
80%:  49
75%:  45
70%:  41
60%:  34
50%:  29

Cite

For citing this dataset, you can refer to the present GitHub project. For example, with BibTeX:

@software{macro-correct,
    url = {https://github.com/yongzhuo/macro-correct},
    author = {Yongzhuo Mo},
    title = {macro-correct},
    year = {2025}

About

dataset of csc, chinese spelling correct / check; 14万中文文本纠错数据集(中文拼写纠错), 主要是"地得的"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published