Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow train loop compared to python #694

Open
mohamed-180 opened this issue Sep 24, 2021 · 15 comments
Open

Slow train loop compared to python #694

mohamed-180 opened this issue Sep 24, 2021 · 15 comments

Comments

@mohamed-180
Copy link
Contributor

With trivial example of approximating trigonometric function (sin) in python :

## Imports
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch import tensor
from torch import nn
from torch.utils.data.dataset import TensorDataset
from torch.utils.data import DataLoader
from time import time
from tqdm import tqdm

## Data
x = torch.linspace(-6,6,1000).reshape(-1,1)
y = torch.cos(x)
ds = TensorDataset(x,y)
dls = DataLoader(ds, batch_size=64, shuffle=True)

## Model
model = nn.Sequential(nn.Linear(1,4), nn.Sigmoid(), nn.Linear(4,1))
opt = torch.optim.Adam(model.parameters(), lr = .01)

## Train Step
def train_step(model, dls):
    epoch_loss = 0
    for xtrn , ytrn in dls:
        loss = F.mse_loss(model(xtrn), ytrn)
        opt.zero_grad()
        loss.backward()
        opt.step()

        epoch_loss += loss.item()
    return epoch_loss

## Train Modle
l = 0
for epoch in tqdm(range(100)):
    l += train_step(model, dls)
print(l/100)

and the same implementation in R :

library(torch)

x <- torch_linspace(-6,6,1000)$reshape(c(-1,1))
y <- torch_sin(x)

ds <- tensor_dataset(x,y)
dls <- dataloader(ds, 64L, shuffle = TRUE)

model <- nn_sequential(nn_linear(1,16), nn_sigmoid(), nn_linear(16,1))
opt <- optim_adam(model$parameters, lr=.01)

for (epoch in 1:100){
  l <- 0
  coro::loop(for (b in dls) {
    loss = nnf_mse_loss(model(b[[1]]) , b[[2]])
    opt$zero_grad()
    loss$backward()
    opt$step()
    l <- l + loss$item()
  })

  l
}

the deference in time is observable , don't know why !!!
Screen Shot 2021-09-24 at 7 56 44 PM

@mohamed-180
Copy link
Contributor Author

mohamed-180 commented Sep 24, 2021

After dropping the use of dataloader and as workaround use manual shuffling and batching :

library(torch)

x <- torch_linspace(-6,6,1000)$reshape(c(-1,1))
y <- torch_sin(x)
model <- nn_sequential(nn_linear(1,16), nn_sigmoid(), nn_linear(16,1))
opt <- optim_adam(model$parameters, lr=.01)
#--------------------------
#  shuffling and batching 👇 
#--------------------------
ind <- torch_randperm(length(x)) + 1L # indexing must start from 1 not zero
ind <- split(ind,ceiling(seq_along(ind)/64))

for (epoch in 1:100){
  l <- 0
  for(b in ind) {
    loss = nnf_mse_loss(model(x[b]) , y[b])
    opt$zero_grad()
    loss$backward()
    opt$step()
    l <- l + loss$item()
  }
  l
}

and getting half time 😮

Screen Shot 2021-09-24 at 11 35 30 PM

@dfalbel
Copy link
Member

dfalbel commented Sep 27, 2021

Yes, this is still kind of expected. We still need to make performance improvements in the R side. Specially related to dataloading and in the optimizers code.

However, small examples are likely to show higher differences because code is probably spending more time in R/Python code than in the efficient C++ libtorch code that both the Python and the R packages share.

@mohamed-180
Copy link
Contributor Author

that is true the optimizer step function takes a lot of time while the training loop 👇

Screen Shot 2021-09-25 at 1 20 19 AM

@cgorac
Copy link

cgorac commented Aug 23, 2022

I'm new to Torch for R, haven't found a reference to a mailing list where it would be maybe more appropriate to discuss this, so I'll leave my comment here.

Am also rather surprised on how slow the training is in Torch for R when compared to Python. So I have following hello-world-MNIST code on Python:

import torch

from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

DATADIR = "data"
BATCH_SIZE = 64
EPOCHS = 5

class Mnist(nn.Module):
    def __init__(self):
        super(Mnist, self).__init__()
        self.flatten = nn.Flatten()
        self.sequential = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.sequential(x)
        return logits

def train(dataloader, model, loss_fn, optimizer, device):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn, device):
    size = len(dataloader.dataset)
    model.eval()
    test_loss, correct = 0, 0
    num_batches = 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            
            pred = model(X)
            loss = loss_fn(pred, y).item()
            
            test_loss += loss
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            num_batches += 1
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

if __name__ == "__main__":
    torch.manual_seed(1)
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    
    train_data = datasets.MNIST(
        root=DATADIR,
        train=True,
        download=True,
        transform=ToTensor()
    )

    test_data = datasets.MNIST(
        root=DATADIR,
        train=False,
        download=True,
        transform=ToTensor()
    )

    train_dataloader= DataLoader(train_data, batch_size = BATCH_SIZE)
    test_dataloader = DataLoader(test_data, batch_size = BATCH_SIZE)

    model = Mnist().to(device)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.RMSprop(model.parameters())
    
    for t in range(EPOCHS):
        print(f"Epoch {t+1}")
        print("-------------------------------")
        train(train_dataloader, model, loss_fn, optimizer, device)
        test(test_dataloader, model, loss_fn, device)
        print("Done!")

and then what I think is an equivalent in R:

library(magrittr)
library(torch)
library(luz)
library(torchvision)

DATADIR <- "data"
BATCH_SIZE <- 64
EPOCHS <- 5

train_ds <- mnist_dataset(
  DATADIR,
  download = TRUE,
  transform = transform_to_tensor
)

test_ds <- mnist_dataset(
  DATADIR,
  train = FALSE,
  transform = transform_to_tensor
)

train_dl <- dataloader(train_ds, batch_size = BATCH_SIZE)
test_dl <- dataloader(test_ds, batch_size = BATCH_SIZE)

model <- nn_module(
  "Mnist",
  
  initialize = function() {
    self$sequential <- nn_sequential(
      nn_linear(28 * 28, 512),
      nn_relu(),
      nn_linear(512, 10)
    )
  },
  
  forward = function(x) {
    x %>%
      torch_flatten(start_dim = 2) %>%
      self$sequential()
  }
)

fitted <- model %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_rmsprop,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  fit(train_dl, epochs = EPOCHS, valid_data = test_dl)

I did runs in both cases with training/test data already downloaded. My GPU is rather old Quadro P3000, still adequate enough for this small problem. Python code takes about 33s to train, while R code takes about 2020s, so about 60x slower. Note that torch::cuda_is_available() returns TRUE, and judging from nvidia-smi output my GPU is busy during running R code, in the sense that I could see increased GPU memory usage approximately alike to when Python code run. The amount of GPU computations should have been exactly the same in both cases: same amount of training/test data, same networks thus same number of coefficients to optimize; so I guess the difference comes from R code from Torch for R. I read the documentation, and I understood for example optimizers and other segments of Torch are implemented in R, but I don't understand why is that i.e. why corresponding code from libtorch is not also wrapped for use in R?

For the record, C++ libtorch code, completely equivalent to above Python code, takes 5s to train. Considering that most of performance-critical code in PyTorch is actually shared with libtorch, the PyTorch performance is rather disappointing too. On the other side, admittedly, C++ code takes twice as long to compile than to run on my laptop: the code itself has about 120 lines, but as libtorch is heavily templated, the preprocessed file actually sent to compiler is about 320k lines long.

(As an additional note, equivalent Keras R code takes about 15s to train.)

@cmcclellan
Copy link

I want to +1 on this comment. I am very interested moving to this package, but the training times I'm experiencing are orders of magnitude larger than with Keras/Tensorflow. This moves the torch package from the go-to package to something I may dust off every once in a while for a edge case. Hope this can be optimized soon.

@sebffischer
Copy link
Collaborator

We were trying to profile R torch to see where the performance difference to pytorch comes from.

For this, we used the code below, which takes around 20 seconds.
The equivalent in pytorch runs for 3 seconds.

library(torch)

p = 100
steps = 10000
n = 1000

X = torch_randn(n, p, device = "cuda")
beta = torch_randn(p, 1, device = "cuda")
Y = X$matmul(beta)

latent = 5000

net = nn_sequential(
  nn_linear(p, latent),
  nn_relu(),
  nn_linear(latent, 1)
)

net$cuda()

t1 = Sys.time()

p = profvis::profvis({
  for (i in 1:steps) {
      Y_hat = net(X)
      loss = nnf_mse_loss(Y, Y_hat)
  }
}, simplify = FALSE)

t2 = Sys.time()

htmlwidgets::saveWidget(p, "~/torch.html", selfcontained = TRUE)

print(paste0("Total time: ", t2 - t1))

Below is a screenshot of the Flame graph, where the grey areas are the time spent for garbage collection.
Screenshot 2024-08-22 at 16 38 04

If this data is correct (and no funny business happens with profvis) this means around 3/4 of the time in this loop is spent for garbage collection. This means that without garbage collection, torch would take around 4 - 5 seconds compared to pytorchs 3 seconds.
Do you think that these numbers are plausible and if so, is there something we can do about it?
I would also love to help with this, but this seems to be too deep into the C code for me to understand what's going on.

@dfalbel
Copy link
Member

dfalbel commented Aug 22, 2024

I recommend taking a look at this section in the documentation:

https://torch.mlverse.org/docs/articles/memory-management#cuda

Basically, when you are in a situation of high memory pressure within the GPU, we'll need to force GC at every iteration because otherwise libtorch can't allocate gpu memory. Unfortunatelly, there's no way for us to tell R to trigger a simpler GC. You might have some success tuning this parameters, and maybe we should consider a diffferent set of pars.

The problem is that R is not aware of how much memory each tensor uses, so it will not trigger GC when it should.

We could fix this, but it would a non-trivial amount of work, by tracking all tensors that are sent to R at some point and and making sure we can delete them imediately when they go out of scope, even if R still didn't garbage collect the the R object itself, we might be able to free up the gpu memory.

@sebffischer
Copy link
Collaborator

I recommend taking a look at this section in the documentation:

https://torch.mlverse.org/docs/articles/memory-management#cuda

Basically, when you are in a situation of high memory pressure within the GPU, we'll need to force GC at every iteration because otherwise libtorch can't allocate gpu memory. Unfortunatelly, there's no way for us to tell R to trigger a simpler GC. You might have some success tuning this parameters, and maybe we should consider a diffferent set of pars.

The problem is that R is not aware of how much memory each tensor uses, so it will not trigger GC when it should.

We could fix this, but it would a non-trivial amount of work, by tracking all tensors that are sent to R at some point and and making sure we can delete them imediately when they go out of scope, even if R still didn't garbage collect the the R object itself, we might be able to free up the gpu memory.

Thanks, I will look into this!
I would really be interested in trying to help to improve this, because I think currently this is really a showstopper for some use cases.

@sebffischer
Copy link
Collaborator

this is maybe a naive idea, but could torch not do its own bookkeeping for the tensors and only garbage collect this list of tensors instead of all R objects? maybe one could use the refcounts for that https://developer.r-project.org/Refcnt.html

@dfalbel
Copy link
Member

dfalbel commented Aug 23, 2024

I'm not sure I completely follow the suggestion. Would we instead of call GC, walk trough the objects list looking for tensors and remove those that have a 0 refcount? I'm not sure how to this, but we probably can somehow?

Another idea that might work is to somehow show the R allocator how much memory tensors are using, so R would call GC at the expected locations more often, and we wouldn't need to force that. I think this is possible with ALTREP objects.

@sebffischer
Copy link
Collaborator

sebffischer commented Aug 28, 2024

I'm not sure I completely follow the suggestion. Would we instead of call GC, walk trough the objects list looking for tensors and remove those that have a 0 refcount? I'm not sure how to this, but we probably can somehow?

Yeah that would be my idea. It is also possible to access the refcount -- even from within R -- but I am now sure whether I understand its behavior. E.g. below, shouldn't the refcount be 1?

x = torch::torch_randn(1)

.Internal(refcnt(x))
#> [1] 4

Created on 2024-08-28 with reprex v2.1.1

This excerpt from the link I sent also suggest that refcounts are not always properly decremented.

Work in Progress

Since reference counts are applied to all objects, including environments and promises, a check after applying a closure can show which bindings are no longer needed, and this allows reference counts on arguments to be decremented again. Thus, for example, after the call

mean(x)

this mechanism can restore the reference count of x to its value prior to the call (no modification of the mean source is needed). There are still a number of rough edges and interactions with the complex assignment process that need to be resolved, so this is not yet committed to the subversion sources. But it is likely these can be resolved and the result committed before too long.

Coming back to your comment:

Another idea that might work is to somehow show the R allocator how much memory tensors are using, so R would call GC at the expected locations more often, and we wouldn't need to force that. I think this is possible with ALTREP objects.

This would still require calls to GC in situations we only want to free torch tensors and not other R objects, right?

@sebffischer
Copy link
Collaborator

sebffischer commented Sep 6, 2024

Coming back to this: I think the already available torch::jit_trace looks very helpful in gaining speed improvements.
I have some questions as to whether my understanding of what would have to be done to get rid of the GC calls (at least in the standard training loop) is correct.

Let's say I write a simple training loop

library(torch)
net = nn_sequential(
  nn_linear(20, 100),
  nn_relu(),
  nn_linear(100, 1)
)
x = torch_randn(100, 20)
beta = torch_randn(20, 1)
y = torch_matmul(x, beta)

opt = optim_adam(net$parameters)

net_jit = jit_trace(net, x)

for (i in 1:100) {
  opt$zero_grad()
  y_hat = net_jit(x)
  loss = nnf_mse_loss(y, y_hat)
  loss$backward()
  opt$step()
}

In each iteration of the loop, some temporary tensors are allocated. These allocations will be freed via the finalizer of the external pointer which will only be called after garbage collection, which is slow and we would like to avoid it.

In the for-loop, we have the following temporary tensors allocations in each iteration:

  1. The intermediate tensors that are created when calling net_jit(x), i.e. the output of the first linear layer and of the relu layer. Because we call net_jit(x) in the example and not net(x) these are not R objects, but only live in torchscript, correct? Because these are needed for the backward pass, they will still be kept until $backward() is called but they don't require calling any R finalizer.
  2. The loss. This is an external pointer R object with a finalizer that needs to be called to reclaim the memory.
  3. When calling loss$backward() we create all the gradients (net$parameters[[1]]$grad etc.). Are these also R external pointers just like loss that require calling their finalizers?
    However, after calling $backward(), the intermediate tensors that were saved for the backward pass are freed.
  4. Any intermediate tensors that are created within opt$step().

In order to avoid having to call into the R gc, we would therefore have to:

  1. Free the loss after having called opt$step().
    --> Would it be possible to just offer a torch_free method that can be manually called (at ones own risk)?
  2. make opt$zero_grad() also free the gradients after "zeroing" them.
    --> Maybe this is already happening? I am not sure.
  3. Ensure that any intermediate tensors that are created within opt$step() are also freed.
    --> This seems manageable as well, especially if we initially only focus on the default optimizers, something along the lines of https://github.com/dfalbel/torchoptx, would solve the problem, right?
    Or we just have to manually torch_free the temporary tensors that are created within opt$step().

What do you think about these suggestions?
And also, is my understanding correct?

@sebffischer
Copy link
Collaborator

When I compare the jitted r torch code (https://github.com/sebffischer/torch-profiling/blob/master/2024-09-06/rtorch.R) with the equivalent pytorch code, the R torch code runs around a factor of 3x slower than the python version.
This is already much better than the 6/7 factor I had earlier.

Still, 2.6 out of the roughly 5.6 seconds that the R code runs are still due to the GC.

Below I attached an image of the profile, but the whole file can also be found here in case you are interested.

Interestingly, the gc seems to be called during the (jitted) forward call of the network.

Screenshot 2024-09-06 at 15 02 35

@dfalbel
Copy link
Member

dfalbel commented Sep 11, 2024

I like your suggestions, let's see how much we can improve with torch_free. I'm going to add that function.
I believe it will help us call into GC less, because we won't be in such memory pressing situations.

@sebffischer
Copy link
Collaborator

sebffischer commented Oct 22, 2024

Hey @dfalbel, I haven't yet tried out those functions, mostly because I am unable to install the dev version of the package (once a new release is made I can explore it it).

I have another idea that I would like to explore: While torch shines partly because it is so flexible, I think a lot of use-cases are also covered by a standard training loop and a standard optimizer.

Would it be possible to implement such a standard training loop in C++?
Here I mean that we would:

  • Use the Optimizer implementation from C++ (what you have already started here: https://github.com/dfalbel/torchoptx)
  • Conduct the forward and backward pass as well as the optimizer step in C++. Because we can jit the model, I think this should be possible. Because we do everything in C++ we also have no garbage collector issue. Basically only the data loading and the definition of the neural network architecture happens in R.
    My hope is that this would most use cases people care about while getting us very close to PyTorch performance.

If you would be interested in pursuing this I would love to contribute. I have already found https://github.com/mlverse/lltm for a start, but I can't quite get it running. Further, I don't have a lot of experience in C++ and the whole build is non trivial as you mention yourself. Let me know whether you think this might be a promising path to explore.
I can also imagine this to be its own package and not necessarily part of mlverse/torch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants