ERROR when trainning iql with checkpoint=True on multiple GPUs #47

timercrack · 2023-08-03T04:44:55Z

timercrack
Aug 3, 2023

When I train bert.pahse1 on multiple GPUs, I turned on checkpointing, set batchsize to 6500 with five 4090s. Everything is ok, it took 15 hours to train 170 million samples and got a well-behaved model.
Then I try to run ilql training with obtained encoder:

$ torchrun --standalone --nproc_per_node gpu -m kanachan.training.ilql.train training_data="/root/autodl-tmp/iql-mjlog0.shuf.txt" encoder.load_from="/root/autodl-tmp/outputs/snapshots/encoder_best.pth" batch_size=3500 checkpointing=True

the training program crashed with these output:

[2023-08-03 12:15:19,462][root][INFO] - World size: 5
[2023-08-03 12:15:19,462][root][INFO] - Process rank: 0
[2023-08-03 12:15:19,462][root][INFO] - Training data: /root/autodl-tmp/iql-mjlog0.shuf.txt
[2023-08-03 12:15:19,463][root][INFO] - # of workers: 2
[2023-08-03 12:15:19,463][root][INFO] - Device: cuda
[2023-08-03 12:15:19,463][root][INFO] - cuDNN: available
[2023-08-03 12:15:19,463][root][INFO] - dtype: torch.float32
[2023-08-03 12:15:19,463][root][INFO] - AMP dtype: torch.float16
[2023-08-03 12:15:19,463][root][INFO] - Position encoder: position_embedding
[2023-08-03 12:15:19,463][root][INFO] - Encoder dimension: 768
[2023-08-03 12:15:19,463][root][INFO] - # of heads for encoder: 12
[2023-08-03 12:15:19,463][root][INFO] - Dimension of feedforward networks for encoder: 3072
[2023-08-03 12:15:19,464][root][INFO] - Activation function for encoder: gelu
[2023-08-03 12:15:19,464][root][INFO] - Dropout for encoder: 0.100000
[2023-08-03 12:15:19,464][root][INFO] - # of encoder layers: 12
[2023-08-03 12:15:19,464][root][INFO] - Load encoder from: /root/autodl-tmp/outputs/snapshots/encoder_best.pth
[2023-08-03 12:15:19,464][root][INFO] - Dimension of feedforward networks for decoder: 3072
[2023-08-03 12:15:19,464][root][INFO] - Activation function for decoder: gelu
[2023-08-03 12:15:19,464][root][INFO] - Dropout for decoder: 0.100000
[2023-08-03 12:15:19,464][root][INFO] - # of decoder layers: 2
[2023-08-03 12:15:19,465][root][INFO] - Discount factor: 1.000000
[2023-08-03 12:15:19,465][root][INFO] - Expectile: 0.900000
[2023-08-03 12:15:19,465][root][INFO] - V loss scaling: 1.000000E+00
[2023-08-03 12:15:19,465][root][INFO] - Checkpointing: True
[2023-08-03 12:15:19,465][root][INFO] - Local batch size: 700
[2023-08-03 12:15:19,465][root][INFO] - World batch size: 3500
[2023-08-03 12:15:19,465][root][INFO] - # of steps for gradient accumulation: 1
[2023-08-03 12:15:19,465][root][INFO] - Virtual batch size: 3500
[2023-08-03 12:15:19,465][root][INFO] - Norm threshold for gradient clipping: 1.000000E+00
[2023-08-03 12:15:19,466][root][INFO] - Optimizer: lamb
[2023-08-03 12:15:19,466][root][INFO] - Learning rate: 1.000000E-03
[2023-08-03 12:15:19,466][root][INFO] - Target update interval: 1
[2023-08-03 12:15:19,466][root][INFO] - Target update rate: 0.100000
[2023-08-03 12:15:19,466][root][INFO] - Experiment output: /root/autodl-tmp/outputs
[2023-08-03 12:15:19,466][root][INFO] - Snapshot interval: 100000
[2023-08-03 12:15:21,195][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-08-03 12:15:21,342][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2023-08-03 12:15:21,342][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 4
[2023-08-03 12:15:21,343][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2
[2023-08-03 12:15:21,343][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-08-03 12:15:21,343][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
[2023-08-03 12:15:21,343][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
[2023-08-03 12:15:21,343][torch.distributed.distributed_c10d][INFO] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
[2023-08-03 12:15:21,344][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
[2023-08-03 12:15:21,351][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 5 nodes.
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Error executing job with overrides: ['training_data=/root/autodl-tmp/iql-mjlog0.shuf.txt', 'encoder.load_from=/root/autodl-tmp/outputs/snapshots/encoder_best.pth', 'batch_size=3500', 'total_sample=125978515', 'checkpointing=True']
Error executing job with overrides: ['training_data=/root/autodl-tmp/iql-mjlog0.shuf.txt', 'encoder.load_from=/root/autodl-tmp/outputs/snapshots/encoder_best.pth', 'batch_size=3500', 'total_sample=125978515', 'checkpointing=True']
Error executing job with overrides: ['training_data=/root/autodl-tmp/iql-mjlog0.shuf.txt', 'encoder.load_from=/root/autodl-tmp/outputs/snapshots/encoder_best.pth', 'batch_size=3500', 'total_sample=125978515', 'checkpointing=True']
Error executing job with overrides: ['training_data=/root/autodl-tmp/iql-mjlog0.shuf.txt', 'encoder.load_from=/root/autodl-tmp/outputs/snapshots/encoder_best.pth', 'batch_size=3500', 'total_sample=125978515', 'checkpointing=True']
Error executing job with overrides: ['training_data=/root/autodl-tmp/iql-mjlog0.shuf.txt', 'encoder.load_from=/root/autodl-tmp/outputs/snapshots/encoder_best.pth', 'batch_size=3500', 'total_sample=125978515', 'checkpointing=True']
Traceback (most recent call last):
  File "/root/kanachan/kanachan/training/ilql/train.py", line 1095, in _main
    _training(
  File "/root/kanachan/kanachan/training/ilql/train.py", line 230, in _training
    q1_loss, v1_loss, v1_batch_mean, qv1_batch_loss = _backward(
  File "/root/kanachan/kanachan/training/ilql/train.py", line 118, in _backward
    grad_scaler.scale(qv_loss).backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. 
or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model,
it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. 
DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 135 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. 
You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

without checkpointing, I can only set batchsize up to 600 with my five 4090s, the training speed is so slow that it is almost impossible to complete.

Answered by timercrack

Aug 9, 2023

sorry, I tried training without checkpoingting, its faster than with it.

View full answer

timercrack · 2023-08-09T00:53:18Z

timercrack
Aug 9, 2023
Author

sorry, I tried training without checkpoingting, its faster than with it.

0 replies

timercrack · 2023-08-09T00:54:19Z

timercrack
Aug 9, 2023
Author

close it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR when trainning iql with checkpoint=True on multiple GPUs #47

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

ERROR when trainning iql with checkpoint=True on multiple GPUs #47

timercrack Aug 3, 2023

Replies: 2 comments

timercrack Aug 9, 2023 Author

timercrack Aug 9, 2023 Author

timercrack
Aug 3, 2023

timercrack
Aug 9, 2023
Author

timercrack
Aug 9, 2023
Author