[Track] DeepSeek V3/R1 accuracy #3486

zhyncs · 2025-02-11T10:06:53Z

conclusion

gsm8k and mmlu are completely consistent with the official release

server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code

gsm8k

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy: 0.955
Invalid: 0.000
Latency: 109.212 s
Output throughput: 1244.611 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000

subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.884
subject: college_physics, #q:102, acc: 0.833
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.754
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.942
subject: formal_logic, #q:126, acc: 0.794
subject: global_facts, #q:100, acc: 0.670
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.921
subject: high_school_mathematics, #q:270, acc: 0.756
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.828
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.956
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.852
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.940
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.887
subject: moral_scenarios, #q:895, acc: 0.773
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.865
subject: professional_law, #q:1534, acc: 0.702
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.930
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.924
Total latency: 274.759
Average accuracy: 0.871

The text was updated successfully, but these errors were encountered:

yinfan98 · 2025-02-11T13:53:25Z

some 8 * H20 accuracy for deepseek-v3, cc: @zhyncs

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --mem-fraction-static 0.9

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy: 0.950
Invalid: 0.000
Latency: 236.747 s
Output throughput: 587.916 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000

subject: abstract_algebra, #q:100, acc: 0.820
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.917
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.650
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.800
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.814
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.949
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.876
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.730
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.931
subject: high_school_mathematics, #q:270, acc: 0.752
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.861
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.857
subject: management, #q:103, acc: 0.961
subject: marketing, #q:234, acc: 0.962
subject: medical_genetics, #q:100, acc: 0.960
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.864
subject: moral_scenarios, #q:895, acc: 0.806
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.720
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.907
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.869
subject: sociology, #q:201, acc: 0.945
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 435.171
Average accuracy: 0.878

yinfan98 mentioned this issue Feb 11, 2025

[Feature] DeepSeek V3 optimization #2591

Open

18 tasks

yizhang2077 mentioned this issue Feb 12, 2025

integrate blockwise fp8 kernel #3529

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track] DeepSeek V3/R1 accuracy #3486

[Track] DeepSeek V3/R1 accuracy #3486

zhyncs commented Feb 11, 2025

yinfan98 commented Feb 11, 2025 •

edited

Loading

[Track] DeepSeek V3/R1 accuracy #3486

[Track] DeepSeek V3/R1 accuracy #3486

Comments

zhyncs commented Feb 11, 2025

conclusion

server

gsm8k

mmlu

yinfan98 commented Feb 11, 2025 • edited Loading

Server

gsmk8

mmlu

yinfan98 commented Feb 11, 2025 •

edited

Loading