Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected keyword argument 'image_size' when running benchmark #74

Open
noctarius opened this issue Aug 13, 2024 · 5 comments
Open

Unexpected keyword argument 'image_size' when running benchmark #74

noctarius opened this issue Aug 13, 2024 · 5 comments

Comments

@noctarius
Copy link

Hey folks!

I saw that someone asked the same question yesterday on the mailinglist, but nobody has answered so I thought I bring it here since I'm running into the same issue.

When I try to run the benchmark, the process stops when trying to read the training data and complains about a "TypeError: Profile.update() got an unexpected keyword argument 'image_size'".

Generated the data as

./benchmark.sh datagen --hosts <IP> --workload unet3d --accelerator-type a100 --num-parallel 8 --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data

And running the benchmark fails as

HYDRA_FULL_ERROR=1 ./benchmark.sh run --hosts <IP> --workload unet3d --accelerator-type a100 --num-accelerators 1 --results-dir resultsdir --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data
[INFO] 2024-08-13T16:49:38.250049 Running DLIO with 1 process(es) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [30.648590087890625] GB memory [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
[INFO] 2024-08-13T16:49:43.983959 Max steps per epoch: 500 = 1 * 3500 / 7 / 1 (samples per file * num files / batch size / comm size) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:321]
[INFO] 2024-08-13T16:49:43.998867 Starting epoch 1: 500 steps expected [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:192]
[INFO] 2024-08-13T16:49:44.009325 Starting block 1 [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
Error executing job with overrides: ['workload=unet3d_a100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=3500', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 402, in <module>
    main()
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 397, in main
    benchmark.run()
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 343, in run
    steps = self._train(epoch)
            ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/reader_handler.py", line 114, in read_index
    self.get_sample(filename, sample_index)
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 48, in get_sample
    dlp.update(image_size=image.nbytes)
TypeError: Profile.update() got an unexpected keyword argument 'image_size'

Running the benchmark in a venv with Python 3.12.3, Ubuntu 24.04 LTS, on kernel 6.8.0-1009-aws #9-Ubuntu SMP Fri May 17 14:39:23 UTC 2024.

Anyone an idea? It feels like the data format is wrong, but not sure.

@noctarius
Copy link
Author

Tried tag v1.0.1 and it still happens. Also tried to update all OS packages but still nothing. The parameter "image_size" doesn't exist in Profiler.update(...). At least not in the referenced dlio_benchmark commits 🤔

@noctarius
Copy link
Author

diff --git a/dlio_benchmark/utils/utility.py b/dlio_benchmark/utils/utility.py
index 8872f2e..267ba19 100644
--- a/dlio_benchmark/utils/utility.py
+++ b/dlio_benchmark/utils/utility.py
@@ -49,7 +49,7 @@ except:
             return
         def __exit__(self, type, value, traceback):
             return
-        def update(self, *, epoch=0, step=0, size=0, default=None):
+        def update(self, *, epoch=0, step=0, size=0, default=None, image_size=0):
             return
     class dftracer(object):
         def __init__(self,):

That fixes the issue. Not sure if "image_size" is supposed to be "size" or the other way around, but just adding it (since the test isn't using a profiler) is the easiest fix.

@zhenghh04
Copy link
Contributor

You can do pip install -r requirements.txt to fix the issue

@zhenghh04
Copy link
Contributor

This is a current bug to DLIO if dftracer is not installed.

So when switching over to 1.0.1, please make sure to do pip install -r requirements.txt

@boni-weka
Copy link

For unet3d runs, should 1.0 or 1.0.1 be used before running pip install -r requirements.txt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants