-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R Session Aborted when initializing training #1275
Comments
I'll add that it crashes on my machine too (same office), so it's not specific to @jarroyoe 's computer.
Matrix products: default locale: time zone: America/Los_Angeles attached base packages: other attached packages: loaded via a namespace (and not attached): |
Looks like also related to #1273 (comment) |
Hi @jarroyoe and @danielrodonnell I don't have a reliable reproducible environment for the problem. But I made a speculative fix for the problem and merged it to main. Could you try installing torch from main and see if this fixes the issue?
Sorry for the disruption. Thanks! |
@dfalbel I'll give this a try. I am working in RStudio. |
Just to confirm you see: When loading torch. Its specially important that it dowloads lantern-0.14.1.9000
Also, does the error only happens when training the model, or just:
triggers the error, like in #1273? |
Currently trying to make a MWE without |
I have updated the MWE, because of the batch size it wasn't really running any training:
It definitely runs on my Windows machine. So we need to identify what difference there's between our environments. |
Also, if you are still manually installing lantern, please make sure to use |
I made a MWE without luz, and this one actually runs: library(torch)
x <- torch_rand(10,118,8)
y <- torch_rand(10)
res_lstm <- nn_module(
initialize = function(){
self$lstm <- nn_lstm(8,46,batch_first = TRUE)
self$lstm_connection <- nn_sequential(
nn_sigmoid(),
nn_linear(46,23),
nn_sigmoid(),
nn_linear(23,1))
},
forward = function(x){
lstm <- self$lstm_connection(
self$lstm(torch_flip(x,2))[[1]][,118,]
)
torch_squeeze(nn_sigmoid()(lstm))
}
)
model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)
for(epoch in 1:100){
optimizer$zero_grad()
y_pred <- model(x)
loss <- torch_mean((y_pred - y)^2)
cat("Epoch: ", epoch, " Loss: ", loss$item(), "\n")
loss$backward()
optimizer$step()
} |
Do you have a CUDA compatible version? @danielrodonnell tried this and the MWE worked on CPU |
Yes the cuda compatible should be:
|
I tried this and my R session still aborts |
I was able to reproduce it locally now. After attaching a debugger I see: Unhandled exception at 0x00007FF868343401 (cudnn64_9.dll) in rsession-utf8.exe: Fatal program exit requested. I believe this is caused by a missing cuDNN installation. I'm installing it now from https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-8.9.7.29_cuda12-archive.zip to see if that's the missing library. |
I can confirm that installing cudnn from this URL:
(Actually had to bump from cuDNN 8 to cuDNN 9 - d0eb62c) Fixed the crash. So it's definitely a cuDNN version mismatch. |
Thanks, @dfalbel , this seems to have done the trick! I think that's all our problems solved for now. Thanks again for being so responsive. |
Awesome! Glad it worked. I wish we could be more explicit on the error when cuDNN is missing :(
So it just tries to load a symbol that doesn't exist. |
I have a deep learning model I'm trying to train using
luz
. A MWE of the model and training goes as follows:When I try to run this my R session aborts without exit code when trying to use
libtorch 2.5.1
,lantern 0.14.1
, and the main branch oftorch
downloaded usingremotes::install_github("mlverse/torch")
. This script works well when usinglibtorch 2.0.1
,lantern 0.12.0
, andtorch 0.12.0
. Here's myR.version
:Could you help me figure out why this crashes?
The text was updated successfully, but these errors were encountered: