R Session Aborted when initializing training #1275

jarroyoe · 2025-02-12T22:41:03Z

I have a deep learning model I'm trying to train using luz. A MWE of the model and training goes as follows:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
    initialize = function(num_lags = 118){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,num_lags,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

fitted <- res_lstm %>% 
  setup(loss = nn_mse_loss(), 
        optimizer = optim_adam) %>% 
  fit(ds, epochs = 2)

When I try to run this my R session aborts without exit code when trying to use libtorch 2.5.1, lantern 0.14.1, and the main branch of torch downloaded using remotes::install_github("mlverse/torch"). This script works well when using libtorch 2.0.1, lantern 0.12.0, and torch 0.12.0. Here's my R.version:

platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          3.2                              
year           2023                             
month          10                               
day            31                               
svn rev        85441                            
language       R                                
version.string R version 4.3.2 (2023-10-31 ucrt)
nickname       Eye Holes

Could you help me figure out why this crashes?

The text was updated successfully, but these errors were encountered:

danielrodonnell · 2025-02-12T22:49:19Z

I'll add that it crashes on my machine too (same office), so it's not specific to @jarroyoe 's computer.

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] luz_0.4.0 torch_0.14.1

loaded via a namespace (and not attached):
[1] coro_1.1.0 R6_2.6.0 bit_4.5.0.1 magrittr_2.0.3 pkgconfig_2.0.3 bit64_4.6.0-1 generics_0.1.3
[8] lifecycle_1.0.4 ps_1.8.1 cli_3.6.3 processx_3.8.5 callr_3.7.6 vctrs_0.6.5 zeallot_0.1.0
[15] compiler_4.4.1 prettyunits_1.2.0 rstudioapi_0.16.0 tools_4.4.1 hms_1.1.3 Rcpp_1.0.14 crayon_1.5.3

dfalbel · 2025-02-13T01:05:30Z

Looks like also related to #1273 (comment)
I'm actively investigating it. Can you confirm it also only happens when running from RStudio?

dfalbel · 2025-02-13T13:11:02Z

Hi @jarroyoe and @danielrodonnell

I don't have a reliable reproducible environment for the problem. But I made a speculative fix for the problem and merged it to main. Could you try installing torch from main and see if this fixes the issue?

remotes::install_github("mlverse/main")

Sorry for the disruption. Thanks!

danielrodonnell · 2025-02-13T16:00:19Z

@dfalbel I'll give this a try. I am working in RStudio.

jarroyoe · 2025-02-13T16:34:22Z

Hi @dfalbel, I tried reinstalling torch on @Head and the error keeps persisting. I tried running it directly on R instead of RStudio and the error keeps going.

dfalbel · 2025-02-13T16:38:32Z

Just to confirm you see:

When loading torch. Its specially important that it dowloads lantern-0.14.1.9000

ℹ Additional software needs to be downloaded and installed for torch to work correctly.
trying URL 'https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.5.1%2Bcpu.zip'
Content type 'application/zip' length 187685286 bytes (179.0 MB)
downloaded 179.0 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip'
Content type 'application/x-zip-compressed' length 2516096 bytes (2.4 MB)
downloaded 2.4 MB

Also, does the error only happens when training the model, or just:

torch_randn(10)

triggers the error, like in #1273?

jarroyoe · 2025-02-13T16:41:48Z

torch_randn(10) doesn't trigger the crash. This happened before because I was using an older version of lantern. Unfortunately both @danielrodonnell and I have to manually download the binaries because of firewall issues.

Currently trying to make a MWE without luz.

danielrodonnell · 2025-02-13T16:45:18Z

@jarroyoe @dfalbel

Just a minor correction in case it matters--I've gotten around the firewall issues and can now do install_torch() with the https links in my .Rprofile. Probably not important, either way.

dfalbel · 2025-02-13T16:53:03Z

I have updated the MWE, because of the batch size it wasn't really running any training:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

It definitely runs on my Windows machine. So we need to identify what difference there's between our environments.

dfalbel · 2025-02-13T16:57:18Z

Also, if you are still manually installing lantern, please make sure to use https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip, the link is slightly different from yesterday, and this one contains changes from today, that should help fixing the crash.

jarroyoe · 2025-02-13T16:58:12Z

I made a MWE without luz, and this one actually runs:

library(torch)

x <- torch_rand(10,118,8)
y <- torch_rand(10)

res_lstm <- nn_module(
    initialize = function(){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,118,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)

for(epoch in 1:100){
	optimizer$zero_grad()
	y_pred <- model(x)
	loss <- torch_mean((y_pred - y)^2)
	cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
	loss$backward()
	optimizer$step()
}

jarroyoe · 2025-02-13T17:13:41Z

Also, if you are still manually installing lantern, please make sure to use https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cpu-win64.zip, the link is slightly different from yesterday, and this one contains changes from today, that should help fixing the crash.

Do you have a CUDA compatible version? @danielrodonnell tried this and the MWE worked on CPU

dfalbel · 2025-02-13T17:15:30Z

Yes the cuda compatible should be:

https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.14.1.9000+cu124-win64.zip

jarroyoe · 2025-02-13T17:26:11Z

I have updated the MWE, because of the batch size it wasn't really running any training:

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

It definitely runs on my Windows machine. So we need to identify what difference there's between our environments.

I tried this and my R session still aborts

dfalbel · 2025-02-13T18:13:38Z

I was able to reproduce it locally now. After attaching a debugger I see:

Unhandled exception at 0x00007FF868343401 (cudnn64_9.dll) in rsession-utf8.exe: Fatal program exit requested.

I believe this is caused by a missing cuDNN installation.

I'm installing it now from https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-8.9.7.29_cuda12-archive.zip to see if that's the missing library.

dfalbel · 2025-02-13T18:27:19Z

I can confirm that installing cudnn from this URL:

https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.7.1.26_cuda12-archive.zip

(Actually had to bump from cuDNN 8 to cuDNN 9 - d0eb62c)

Fixed the crash. So it's definitely a cuDNN version mismatch.

danielrodonnell · 2025-02-13T20:55:35Z

Thanks, @dfalbel , this seems to have done the trick! I think that's all our problems solved for now. Thanks again for being so responsive.

dfalbel · 2025-02-13T21:06:38Z

Awesome! Glad it worked. I wish we could be more explicit on the error when cuDNN is missing :(
Seems like torch is not dynamically looking at cuDNN when eg, calling:

torch::backends_cudnn_is_available()

So it just tries to load a symbol that doesn't exist.

dfalbel closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R Session Aborted when initializing training #1275

R Session Aborted when initializing training #1275

jarroyoe commented Feb 12, 2025 •

edited

Loading

danielrodonnell commented Feb 12, 2025 •

edited

Loading

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025

danielrodonnell commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025 •

edited

Loading

danielrodonnell commented Feb 13, 2025

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025 •

edited

Loading

danielrodonnell commented Feb 13, 2025

dfalbel commented Feb 13, 2025

R Session Aborted when initializing training #1275

R Session Aborted when initializing training #1275

Comments

jarroyoe commented Feb 12, 2025 • edited Loading

danielrodonnell commented Feb 12, 2025 • edited Loading

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025

danielrodonnell commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025 • edited Loading

danielrodonnell commented Feb 13, 2025

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 13, 2025

dfalbel commented Feb 13, 2025

dfalbel commented Feb 13, 2025 • edited Loading

danielrodonnell commented Feb 13, 2025

dfalbel commented Feb 13, 2025

jarroyoe commented Feb 12, 2025 •

edited

Loading

danielrodonnell commented Feb 12, 2025 •

edited

Loading

jarroyoe commented Feb 13, 2025 •

edited

Loading

dfalbel commented Feb 13, 2025 •

edited

Loading