Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Track] DeepSeek V3/R1 accuracy #3486

Open
zhyncs opened this issue Feb 11, 2025 · 1 comment
Open

[Track] DeepSeek V3/R1 accuracy #3486

zhyncs opened this issue Feb 11, 2025 · 1 comment

Comments

@zhyncs
Copy link
Member

zhyncs commented Feb 11, 2025

conclusion

gsm8k and mmlu are completely consistent with the official release

server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code

gsm8k

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.955
Invalid: 0.000
Latency: 109.212 s
Output throughput: 1244.611 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.884
subject: college_physics, #q:102, acc: 0.833
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.754
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.942
subject: formal_logic, #q:126, acc: 0.794
subject: global_facts, #q:100, acc: 0.670
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.921
subject: high_school_mathematics, #q:270, acc: 0.756
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.828
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.956
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.852
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.940
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.887
subject: moral_scenarios, #q:895, acc: 0.773
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.865
subject: professional_law, #q:1534, acc: 0.702
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.930
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.924
Total latency: 274.759
Average accuracy: 0.871
@yinfan98
Copy link
Contributor

yinfan98 commented Feb 11, 2025

some 8 * H20 accuracy for deepseek-v3, cc: @zhyncs

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --mem-fraction-static 0.9

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.950
Invalid: 0.000
Latency: 236.747 s
Output throughput: 587.916 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.820
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.917
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.650
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.800
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.814
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.949
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.876
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.730
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.931
subject: high_school_mathematics, #q:270, acc: 0.752
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.861
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.857
subject: management, #q:103, acc: 0.961
subject: marketing, #q:234, acc: 0.962
subject: medical_genetics, #q:100, acc: 0.960
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.864
subject: moral_scenarios, #q:895, acc: 0.806
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.720
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.907
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.869
subject: sociology, #q:201, acc: 0.945
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 435.171
Average accuracy: 0.878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants