Update Evaluation Logic to Latest `lm_eval` (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Kyle1668 · 2025-03-21T21:43:22Z

I'm training a model where I want to train on the entire datasets. I do not want to split the dataset into train/val/test. I want to evaluate on a set of benchmarks, one of which was introduces in a later version of lm_eval. This PR adds support to evaluate against the configured eval_tasks during training even when we don't define a validation split, and to update to the latest version of lm_eval.

Kyle1668 · 2025-03-22T20:53:11Z

configs/neox_arguments.md

@@ -1307,6 +1445,14 @@ Text Generation arguments



+- **eval_task_limit**: int


This is the only new argument in this PR. The updates elsewhere to this file are from running configs/gen_docs.py.

.github/workflows/cpu_ci.yml

Kyle1668 · 2025-03-24T00:07:28Z

eval_tasks/eval_adapter.py

@@ -27,7 +28,10 @@
 import torch
 import torch.nn.functional as F

+from lm_eval.models.utils import chunks


Recent versions of lm_eval have changed the paths for many of these utility functions.

Kyle1668 · 2025-03-24T20:34:15Z

Local unit tests results. No unexpected failing tests.

=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/kyle/miniconda3/envs/neox/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.2', 'Platform': 'Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.31', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'xdist': '3.6.1', 'metadata': '3.1.1', 'forked': '1.6.0', 'cov': '6.0.0', 'html': '4.1.1'}}
rootdir: /mnt/ssd-1/kyle/gpt-neox/tests
configfile: pytest.ini
plugins: xdist-3.6.1, metadata-3.1.1, forked-1.6.0, cov-6.0.0, html-4.1.1
collected 37 items                                                                                                                                                                                                                        

tests/unit/test_arguments.py::test_main_constructor PASSED                                                                                                                                                                          [  2%]
tests/unit/test_arguments.py::test_constructor_from_ymls PASSED                                                                                                                                                                     [  5%]
tests/unit/test_arguments.py::test_constructor_from_dict PASSED                                                                                                                                                                     [  8%]
tests/unit/test_dependencies.py::test_fused_kernels XFAIL (Fused kernels require manual intervention to install)                                                                                                                    [ 10%]
tests/unit/test_format_conversion_scripts.py::test_gpt_neox_to_huggingface SKIPPED (Conversion test is skipped until we fix the CUDA + torch multiprocessing issue.)                                                                [ 13%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[HFGPT2Tokenizer] PASSED                                                                                                                                                   [ 16%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[HFTokenizer] PASSED                                                                                                                                                       [ 18%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[GPT2BPETokenizer] PASSED                                                                                                                                                  [ 21%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[CharLevelTokenizer] PASSED                                                                                                                                                [ 24%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[TiktokenTokenizer] PASSED                                                                                                                                                 [ 27%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[SPMTokenizer] XFAIL (Expected easy resolution: Need to provide a valid model file from somewhere)                                                                         [ 29%]
tests/unit/test_launcher_scripts.py::test_generate[None] SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                 [ 32%]
tests/unit/test_launcher_scripts.py::test_generate[tests/data/sample_prompt.txt] SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                         [ 35%]
tests/unit/test_launcher_scripts.py::test_evaluate SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                       [ 37%]
tests/unit/test_launcher_scripts.py::test_finetuning SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                     [ 40%]
tests/unit/test_launcher_scripts.py::test_train_launcher SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                 [ 43%]
tests/unit/test_tokenizer.py::test_train_tokenizer PASSED                                                                                                                                                                           [ 45%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pass] PASSED                                                                                                                                                           [ 48%]
tests/unit/test_url_accessibility.py::test_url_accessibility[enron] XFAIL                                                                                                                                                           [ 51%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pile_subset] XFAIL                                                                                                                                                     [ 54%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pile] XFAIL                                                                                                                                                            [ 56%]
tests/unit/test_url_accessibility.py::test_url_accessibility[github] XFAIL                                                                                                                                                          [ 59%]
tests/unit/test_url_accessibility.py::test_url_accessibility[arxiv] XFAIL                                                                                                                                                           [ 62%]
tests/unit/test_url_accessibility.py::test_url_accessibility[europarl] XFAIL                                                                                                                                                        [ 64%]
tests/unit/test_url_accessibility.py::test_url_accessibility[freelaw] XFAIL                                                                                                                                                         [ 67%]
tests/unit/test_url_accessibility.py::test_url_accessibility[nih] XFAIL                                                                                                                                                             [ 70%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pubmed] XFAIL                                                                                                                                                          [ 72%]
tests/unit/test_url_accessibility.py::test_url_accessibility[books1] XFAIL                                                                                                                                                          [ 75%]
tests/unit/test_url_accessibility.py::test_url_accessibility[books3] XFAIL                                                                                                                                                          [ 78%]
tests/unit/test_url_accessibility.py::test_url_accessibility[hackernews] XFAIL                                                                                                                                                      [ 81%]
tests/unit/test_url_accessibility.py::test_url_accessibility[openwebtext2] XFAIL                                                                                                                                                    [ 83%]
tests/unit/test_url_accessibility.py::test_url_accessibility[stackexchange] XFAIL                                                                                                                                                   [ 86%]
tests/unit/test_url_accessibility.py::test_url_accessibility[ubuntu_irc] XFAIL                                                                                                                                                      [ 89%]
tests/unit/test_url_accessibility.py::test_url_accessibility[youtube_subtitles] XFAIL                                                                                                                                               [ 91%]
tests/unit/test_url_accessibility.py::test_url_accessibility[c4] XFAIL                                                                                                                                                              [ 94%]
tests/unit/test_url_accessibility.py::test_url_accessibility[c4_openwebtext] XFAIL                                                                                                                                                  [ 97%]
tests/unit/test_url_accessibility.py::test_url_accessibility[enwik8] PASSED                                                                                                                                                         [100%]

============================================================================================================ warnings summary =============================================================================================================
megatron/neox_arguments/arguments.py:24
  /mnt/ssd-1/kyle/gpt-neox/megatron/neox_arguments/arguments.py:24: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import packaging

<string>:8
  <string>:8: PytestDeprecationWarning: A private pytest class or function was used.

unit/test_arguments.py: 3 warnings
unit/test_dependencies.py: 1 warning
unit/test_format_conversion_scripts.py: 1 warning
unit/test_launcher_scripts.py: 11 warnings
unit/test_tokenizer.py: 1 warning
unit/test_url_accessibility.py: 20 warnings
  /home/kyle/miniconda3/envs/neox/lib/python3.12/site-packages/py/_process/forkedfunc.py:45: DeprecationWarning: This process (pid=570518) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================= 11 passed, 6 skipped, 20 xfailed, 39 warnings in 10.77s =========================================================================================
(neox) (base) stella-ord-0% ```

Kyle1668 · 2025-03-24T23:54:25Z

All tests locally run with pytest tests -m cpu pass:

All tests run with pytest --forked --cov-report term --cov=megatron tests except these pass:

FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file1]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file2]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file3]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file4]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file5]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file7]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file9]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file10]
================================================================== 8 failed, 31 passed, 91 skipped, 80 xfailed, 212 warnings in 334.13s (0:05:34) ===================================================================```

We can leave resolving these tests for another PR.

Kyle1668 requested a review from Quentin-Anthony as a code owner March 21, 2025 21:43

Kyle1668 commented Mar 22, 2025

View reviewed changes

Kyle1668 changed the title ~~[In-Progress] Update Evaluation Logic to Latest lm_eval (0.4,8)~~ Update Evaluation Logic to Latest lm_eval (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set Mar 24, 2025

Kyle1668 commented Mar 24, 2025

View reviewed changes

.github/workflows/cpu_ci.yml Show resolved Hide resolved

Kyle1668 commented Mar 24, 2025

View reviewed changes

Kyle1668 added the dependencies Pull requests that update a dependency file label Mar 25, 2025

Kyle1668 added 12 commits May 9, 2025 11:09

Update eval code

6065d05

Resolve W&B error

c918e04

Add eval results to gitignore and specify latest lm_eval in reqs file

4383212

Add limit neox arg

0cc037c

Update ci to Python 3.9

a2a5084

Resolved challenge with task groups

6207d7b

Update docs

04eabe7

Make type handling clearer for task groups

e8c33e5

Run eval tasks even when no validaiton set is provided

6133de9

Removed commented out code

a4c8e00

Run pre-commit

c60d595

Resolved potential transformer engine bug

2841bb0

Quentin-Anthony force-pushed the update_lm_eval branch from 3be6bc9 to 2841bb0 Compare May 9, 2025 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Evaluation Logic to Latest `lm_eval` (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Update Evaluation Logic to Latest `lm_eval` (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Kyle1668 commented Mar 21, 2025 •

edited

Loading

Kyle1668 Mar 22, 2025

Kyle1668 Mar 24, 2025

Kyle1668 commented Mar 24, 2025

Kyle1668 commented Mar 24, 2025

		@@ -1307,6 +1445,14 @@ Text Generation arguments



		- eval_task_limit: int

Update Evaluation Logic to Latest lm_eval (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Are you sure you want to change the base?

Update Evaluation Logic to Latest lm_eval (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Conversation

Kyle1668 commented Mar 21, 2025 • edited Loading

Kyle1668 Mar 22, 2025

Choose a reason for hiding this comment

Kyle1668 Mar 24, 2025

Choose a reason for hiding this comment

Kyle1668 commented Mar 24, 2025

Kyle1668 commented Mar 24, 2025

Update Evaluation Logic to Latest `lm_eval` (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Update Evaluation Logic to Latest `lm_eval` (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

Kyle1668 commented Mar 21, 2025 •

edited

Loading