Skip to content

Update Evaluation Logic to Latest lm_eval (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set #1348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Kyle1668
Copy link
Contributor

@Kyle1668 Kyle1668 commented Mar 21, 2025

I'm training a model where I want to train on the entire datasets. I do not want to split the dataset into train/val/test. I want to evaluate on a set of benchmarks, one of which was introduces in a later version of lm_eval. This PR adds support to evaluate against the configured eval_tasks during training even when we don't define a validation split, and to update to the latest version of lm_eval.

@@ -1307,6 +1445,14 @@ Text Generation arguments



- **eval_task_limit**: int
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only new argument in this PR. The updates elsewhere to this file are from running configs/gen_docs.py.

@Kyle1668 Kyle1668 changed the title [In-Progress] Update Evaluation Logic to Latest lm_eval (0.4,8) Update Evaluation Logic to Latest lm_eval (0.4.8) and Support Automatic Benchmark Evals w/o Validation Set Mar 24, 2025
@@ -27,7 +28,10 @@
import torch
import torch.nn.functional as F

from lm_eval.models.utils import chunks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent versions of lm_eval have changed the paths for many of these utility functions.

@Kyle1668
Copy link
Contributor Author

Local unit tests results. No unexpected failing tests.

=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/kyle/miniconda3/envs/neox/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.2', 'Platform': 'Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.31', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'xdist': '3.6.1', 'metadata': '3.1.1', 'forked': '1.6.0', 'cov': '6.0.0', 'html': '4.1.1'}}
rootdir: /mnt/ssd-1/kyle/gpt-neox/tests
configfile: pytest.ini
plugins: xdist-3.6.1, metadata-3.1.1, forked-1.6.0, cov-6.0.0, html-4.1.1
collected 37 items                                                                                                                                                                                                                        

tests/unit/test_arguments.py::test_main_constructor PASSED                                                                                                                                                                          [  2%]
tests/unit/test_arguments.py::test_constructor_from_ymls PASSED                                                                                                                                                                     [  5%]
tests/unit/test_arguments.py::test_constructor_from_dict PASSED                                                                                                                                                                     [  8%]
tests/unit/test_dependencies.py::test_fused_kernels XFAIL (Fused kernels require manual intervention to install)                                                                                                                    [ 10%]
tests/unit/test_format_conversion_scripts.py::test_gpt_neox_to_huggingface SKIPPED (Conversion test is skipped until we fix the CUDA + torch multiprocessing issue.)                                                                [ 13%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[HFGPT2Tokenizer] PASSED                                                                                                                                                   [ 16%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[HFTokenizer] PASSED                                                                                                                                                       [ 18%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[GPT2BPETokenizer] PASSED                                                                                                                                                  [ 21%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[CharLevelTokenizer] PASSED                                                                                                                                                [ 24%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[TiktokenTokenizer] PASSED                                                                                                                                                 [ 27%]
tests/unit/test_launcher_scripts.py::test_preprocess_data[SPMTokenizer] XFAIL (Expected easy resolution: Need to provide a valid model file from somewhere)                                                                         [ 29%]
tests/unit/test_launcher_scripts.py::test_generate[None] SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                 [ 32%]
tests/unit/test_launcher_scripts.py::test_generate[tests/data/sample_prompt.txt] SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                         [ 35%]
tests/unit/test_launcher_scripts.py::test_evaluate SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                       [ 37%]
tests/unit/test_launcher_scripts.py::test_finetuning SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                     [ 40%]
tests/unit/test_launcher_scripts.py::test_train_launcher SKIPPED (All model tests are skipped until we fix the CUDA + torch multiprocessing issue.)                                                                                 [ 43%]
tests/unit/test_tokenizer.py::test_train_tokenizer PASSED                                                                                                                                                                           [ 45%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pass] PASSED                                                                                                                                                           [ 48%]
tests/unit/test_url_accessibility.py::test_url_accessibility[enron] XFAIL                                                                                                                                                           [ 51%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pile_subset] XFAIL                                                                                                                                                     [ 54%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pile] XFAIL                                                                                                                                                            [ 56%]
tests/unit/test_url_accessibility.py::test_url_accessibility[github] XFAIL                                                                                                                                                          [ 59%]
tests/unit/test_url_accessibility.py::test_url_accessibility[arxiv] XFAIL                                                                                                                                                           [ 62%]
tests/unit/test_url_accessibility.py::test_url_accessibility[europarl] XFAIL                                                                                                                                                        [ 64%]
tests/unit/test_url_accessibility.py::test_url_accessibility[freelaw] XFAIL                                                                                                                                                         [ 67%]
tests/unit/test_url_accessibility.py::test_url_accessibility[nih] XFAIL                                                                                                                                                             [ 70%]
tests/unit/test_url_accessibility.py::test_url_accessibility[pubmed] XFAIL                                                                                                                                                          [ 72%]
tests/unit/test_url_accessibility.py::test_url_accessibility[books1] XFAIL                                                                                                                                                          [ 75%]
tests/unit/test_url_accessibility.py::test_url_accessibility[books3] XFAIL                                                                                                                                                          [ 78%]
tests/unit/test_url_accessibility.py::test_url_accessibility[hackernews] XFAIL                                                                                                                                                      [ 81%]
tests/unit/test_url_accessibility.py::test_url_accessibility[openwebtext2] XFAIL                                                                                                                                                    [ 83%]
tests/unit/test_url_accessibility.py::test_url_accessibility[stackexchange] XFAIL                                                                                                                                                   [ 86%]
tests/unit/test_url_accessibility.py::test_url_accessibility[ubuntu_irc] XFAIL                                                                                                                                                      [ 89%]
tests/unit/test_url_accessibility.py::test_url_accessibility[youtube_subtitles] XFAIL                                                                                                                                               [ 91%]
tests/unit/test_url_accessibility.py::test_url_accessibility[c4] XFAIL                                                                                                                                                              [ 94%]
tests/unit/test_url_accessibility.py::test_url_accessibility[c4_openwebtext] XFAIL                                                                                                                                                  [ 97%]
tests/unit/test_url_accessibility.py::test_url_accessibility[enwik8] PASSED                                                                                                                                                         [100%]

============================================================================================================ warnings summary =============================================================================================================
megatron/neox_arguments/arguments.py:24
  /mnt/ssd-1/kyle/gpt-neox/megatron/neox_arguments/arguments.py:24: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import packaging

<string>:8
  <string>:8: PytestDeprecationWarning: A private pytest class or function was used.

unit/test_arguments.py: 3 warnings
unit/test_dependencies.py: 1 warning
unit/test_format_conversion_scripts.py: 1 warning
unit/test_launcher_scripts.py: 11 warnings
unit/test_tokenizer.py: 1 warning
unit/test_url_accessibility.py: 20 warnings
  /home/kyle/miniconda3/envs/neox/lib/python3.12/site-packages/py/_process/forkedfunc.py:45: DeprecationWarning: This process (pid=570518) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================= 11 passed, 6 skipped, 20 xfailed, 39 warnings in 10.77s =========================================================================================
(neox) (base) stella-ord-0% ```

@Kyle1668
Copy link
Contributor Author

All tests locally run with pytest tests -m cpu pass:

All tests run with pytest --forked --cov-report term --cov=megatron tests except these pass:

FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file1]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file2]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file3]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file4]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file5]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file7]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file9]
FAILED tests/requirements/test_requirements.py::test_pyproject_matches_requirements[req_file10]
================================================================== 8 failed, 31 passed, 91 skipped, 80 xfailed, 212 warnings in 334.13s (0:05:34) ===================================================================```

We can leave resolving these tests for another PR. 

@Kyle1668 Kyle1668 added the dependencies Pull requests that update a dependency file label Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants