Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process Terminated during Finetuning #30

Open
Jozdien opened this issue Sep 28, 2021 · 4 comments
Open

Process Terminated during Finetuning #30

Jozdien opened this issue Sep 28, 2021 · 4 comments

Comments

@Jozdien
Copy link

Jozdien commented Sep 28, 2021

I was trying to use the Audioset pretrained model for finetuning on a very small dataset to test it out on. At first the process would simply be killed with "Out of memory" in the log, but when I moved to a larger system, the process ran for longer before returning this error:

Traceback (most recent call last):
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "../../src/run.py", line 99, in <module>
    train(audio_model, train_loader, val_loader, args)
  File "/home/ubuntu/ast_conv/src/traintest.py", line 220, in train
    torch.save(audio_model.state_dict(), "%s/models/audio_model.%d.pth" % (exp_dir, epoch))
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
    return
  File "/home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:298] . unexpected pos 322619584 vs 322619472
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f20ac5b47a7 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x24e10c0 (0x7f20f14190c0 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x24dc69c (0x7f20f141469c in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0x9a (0x7f20f1419afa in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7f20f1419d83 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x1a5 (0x7f20f141a075 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xa7ffe3 (0x7f2103160fe3 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x4ff188 (0x7f2102be0188 in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x50048e (0x7f2102be148e in /home/ubuntu/ast_conv/venvast/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: python() [0x5cf938]
frame #10: python() [0x52cae8]
frame #11: python() [0x52cb32]
frame #12: python() [0x52cb32]
<omitting python frames>
frame #17: python() [0x654354]
frame #19: __libc_start_main + 0xe7 (0x7f21079dcbf7 in /lib/x86_64-linux-gnu/libc.so.6)

run.sh: line 46:  1703 Aborted                 (core dumped) CUDA_CACHE_DISABLE=1 python -W ignore ../../src/run.py --model ${model} --dataset ${dataset} --data-train ${tr_data} --data-val ${val_data} --exp-dir $exp_dir --label-csv ./data/class_labels_indices.csv --n_class 3 --lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model True --freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} --tstride $tstride --fstride $fstride --imagenet_pretrain $imagenetpretrain --audioset_pretrain $audiosetpretrain > $exp_dir/log.txt

As far as I can tell, that OSError could indicate that the filesize has been exceeded, not just that the total memory is overflowing. I haven't changed traintest.py except for adding an elif condition for the finetuning dataset. Did you run into this error while finetuning, or does it seem like something you understand the cause of?

@YuanGongND
Copy link
Owner

YuanGongND commented Oct 8, 2021

It doesn't seem to be the problem of the code. Can you check how much space you have on your disk? In traintest.py, we save the output predictions of each epoch, which might take a few GBs for the training process, depending on how large your test set is.

@Jozdien
Copy link
Author

Jozdien commented Oct 12, 2021

I don't think it's the disk space because I'm testing this on a very small dataset (<20 samples). Could it be possible that something is writing a large amount of data to one particular file, and the filesize limits vary between systems (very speculative)?

@YuanGongND
Copy link
Owner

I see. I would suggest running the ESC-50 recipe and see if the same error raises - it would be fast and easy to run; if you still see the error you can check your os; otherwise, you can check if your modification is correct or not.

traintest.py does save the prediction files, which could be large, but considering you have only 20 samples, it is unlikely the case.

@Jozdien
Copy link
Author

Jozdien commented Oct 15, 2021

Yep, got the same error, although this time far later in the training process (ESC-50 ran for about 16 epochs before halting, on my dataset I don't recall it making any progress on training). Just to know what kind of specifications I should use, what amount of disk space do you recommend using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants