Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R Session Aborted when initializing training #1275

Closed
jarroyoe opened this issue Feb 12, 2025 · 18 comments
Closed

R Session Aborted when initializing training #1275

jarroyoe opened this issue Feb 12, 2025 · 18 comments

Comments

@jarroyoe
Copy link

jarroyoe commented Feb 12, 2025

I have a deep learning model I'm trying to train using luz. A MWE of the model and training goes as follows:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
    initialize = function(num_lags = 118){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,num_lags,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

fitted <- res_lstm %>% 
  setup(loss = nn_mse_loss(), 
        optimizer = optim_adam) %>% 
  fit(ds, epochs = 2)

When I try to run this my R session aborts without exit code when trying to use libtorch 2.5.1, lantern 0.14.1, and the main branch of torch downloaded using remotes::install_github("mlverse/torch"). This script works well when using libtorch 2.0.1, lantern 0.12.0, and torch 0.12.0. Here's my R.version:

platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          3.2                              
year           2023                             
month          10                               
day            31                               
svn rev        85441                            
language       R                                
version.string R version 4.3.2 (2023-10-31 ucrt)
nickname       Eye Holes  

Could you help me figure out why this crashes?

@danielrodonnell
Copy link

danielrodonnell commented Feb 12, 2025

I'll add that it crashes on my machine too (same office), so it's not specific to @jarroyoe 's computer.

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] luz_0.4.0 torch_0.14.1

loaded via a namespace (and not attached):
[1] coro_1.1.0 R6_2.6.0 bit_4.5.0.1 magrittr_2.0.3 pkgconfig_2.0.3 bit64_4.6.0-1 generics_0.1.3
[8] lifecycle_1.0.4 ps_1.8.1 cli_3.6.3 processx_3.8.5 callr_3.7.6 vctrs_0.6.5 zeallot_0.1.0
[15] compiler_4.4.1 prettyunits_1.2.0 rstudioapi_0.16.0 tools_4.4.1 hms_1.1.3 Rcpp_1.0.14 crayon_1.5.3

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Looks like also related to #1273 (comment)
I'm actively investigating it. Can you confirm it also only happens when running from RStudio?

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Hi @jarroyoe and @danielrodonnell

I don't have a reliable reproducible environment for the problem. But I made a speculative fix for the problem and merged it to main. Could you try installing torch from main and see if this fixes the issue?

remotes::install_github("mlverse/main")

Sorry for the disruption. Thanks!

@danielrodonnell
Copy link

@dfalbel I'll give this a try. I am working in RStudio.

@jarroyoe
Copy link
Author

Hi @dfalbel, I tried reinstalling torch on @Head and the error keeps persisting. I tried running it directly on R instead of RStudio and the error keeps going.

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Just to confirm you see:

When loading torch. Its specially important that it dowloads lantern-0.14.1.9000

ℹ Additional software needs to be downloaded and installed for torch to work correctly.
trying URL 'https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.5.1%2Bcpu.zip'
Content type 'application/zip' length 187685286 bytes (179.0 MB)
downloaded 179.0 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip'
Content type 'application/x-zip-compressed' length 2516096 bytes (2.4 MB)
downloaded 2.4 MB

Also, does the error only happens when training the model, or just:

torch_randn(10)

triggers the error, like in #1273?

@jarroyoe
Copy link
Author

jarroyoe commented Feb 13, 2025

torch_randn(10) doesn't trigger the crash. This happened before because I was using an older version of lantern. Unfortunately both @danielrodonnell and I have to manually download the binaries because of firewall issues.

Currently trying to make a MWE without luz.

@danielrodonnell
Copy link

@jarroyoe @dfalbel

Just a minor correction in case it matters--I've gotten around the firewall issues and can now do install_torch() with the https links in my .Rprofile. Probably not important, either way.

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

I have updated the MWE, because of the batch size it wasn't really running any training:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

It definitely runs on my Windows machine. So we need to identify what difference there's between our environments.

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Also, if you are still manually installing lantern, please make sure to use https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip, the link is slightly different from yesterday, and this one contains changes from today, that should help fixing the crash.

@jarroyoe
Copy link
Author

I made a MWE without luz, and this one actually runs:

library(torch)

x <- torch_rand(10,118,8)
y <- torch_rand(10)

res_lstm <- nn_module(
    initialize = function(){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,118,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)

for(epoch in 1:100){
	optimizer$zero_grad()
	y_pred <- model(x)
	loss <- torch_mean((y_pred - y)^2)
	cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
	loss$backward()
	optimizer$step()
}

@jarroyoe
Copy link
Author

Also, if you are still manually installing lantern, please make sure to use https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip, the link is slightly different from yesterday, and this one contains changes from today, that should help fixing the crash.

Do you have a CUDA compatible version? @danielrodonnell tried this and the MWE worked on CPU

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Yes the cuda compatible should be:

https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cu124-win64.zip

@jarroyoe
Copy link
Author

I have updated the MWE, because of the batch size it wasn't really running any training:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

It definitely runs on my Windows machine. So we need to identify what difference there's between our environments.

I tried this and my R session still aborts

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

I was able to reproduce it locally now. After attaching a debugger I see:

Unhandled exception at 0x00007FF868343401 (cudnn64_9.dll) in rsession-utf8.exe: Fatal program exit requested.

I believe this is caused by a missing cuDNN installation.

I'm installing it now from https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-8.9.7.29_cuda12-archive.zip to see if that's the missing library.

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

I can confirm that installing cudnn from this URL:

https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.7.1.26_cuda12-archive.zip

(Actually had to bump from cuDNN 8 to cuDNN 9 - d0eb62c)

Fixed the crash. So it's definitely a cuDNN version mismatch.

@danielrodonnell
Copy link

Thanks, @dfalbel , this seems to have done the trick! I think that's all our problems solved for now. Thanks again for being so responsive.

@dfalbel
Copy link
Member

dfalbel commented Feb 13, 2025

Awesome! Glad it worked. I wish we could be more explicit on the error when cuDNN is missing :(
Seems like torch is not dynamically looking at cuDNN when eg, calling:

torch::backends_cudnn_is_available()

So it just tries to load a symbol that doesn't exist.

@dfalbel dfalbel closed this as completed Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants