Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] fix random generator for shuffle among all workers #6982

Merged
merged 1 commit into from
Jan 22, 2024

Conversation

Rhett-Ying
Copy link
Collaborator

@Rhett-Ying Rhett-Ying commented Jan 19, 2024

Description

image

Test accuracy 45.8261

Epoch: 01, Loss: 2.3362, Valid accuracy: 48.41%, Time 68.9185
Training~Epoch 02: 0it [00:00, ?it/s]rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
rank None shuffled indices: tensor([404999, 305221, 507294,  ..., 354460, 285126, 516371])
Training~Epoch 02: 615it [02:25,  4.23it/s]
Evaluating the model on the validation set.
Inference: 16it [00:03,  4.22it/s]
Finish evaluating on validation set.
Epoch: 02, Loss: 1.5541, Valid accuracy: 48.49%, Time 145.3738
Training~Epoch 03: 0it [00:00, ?it/s]rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])
rank None shuffled indices: tensor([ 87977, 530066, 189900,  ..., 302993,  15348, 600035])

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 19, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 19, 2024

Commit ID: 0618b4bfc3fd55cb8e9386c9c2d5f71a400a563b

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 19, 2024

Is there a particular reason we use np.rng and np.shuffle? If we used torch, the code could be enabled to work on the GPU later.

@Rhett-Ying
Copy link
Collaborator Author

Is there a particular reason we use np.rng and np.shuffle? If we used torch, the code could be enabled to work on the GPU later.

torch.Generator() cannot be pickled. and is this targeted on GPU? why do we need that?

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 20, 2024

Is there a particular reason we use np.rng and np.shuffle? If we used torch, the code could be enabled to work on the GPU later.

torch.Generator() cannot be pickled. and is this targeted on GPU? why do we need that?

I see. @frozenbugs was also asking about whether we should consider enabling GPU for the ItemSampler. Now I know what are the limitations. Thanks!

There is still a way to utilize torch shuffle while utilizing np.rng. We store an np.rng object and whenever we need to shuffle, we get a random number from np.rng, and seed a new torch rng. Then we shuffle with torch shuffle with torch rng.

@Rhett-Ying
Copy link
Collaborator Author

I don't know whether moving ItemSampler to GPU is a good idea or best practice. Does torch.DataLoader ever think about it? Putting all workloads onto GPU may lose overlapping. The best practice may be combining CPU and GPU together instead of relying on single one of them.

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 20, 2024

I don't know whether moving ItemSampler to GPU is a good idea or best practice. Does torch.DataLoader ever think about it? Putting all workloads onto GPU may lose overlapping. The best practice may be combining CPU and GPU together instead of relying on single one of them.

There is one more reason the solution I propose could be better than the numpy one. If we use numpy, it will keep its own thread pool separate from torch thread pool. It is best if we use torch for all operations if we can, even if it runs on the CPU and we don't do any refactoring to enable GPU execution.

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 20, 2024

I benchmarked the two approaches and seems like numpy is faster on the CPU, at least in the Google Colab environment: https://colab.research.google.com/drive/1_5M2aLPrjLcHSrnTyT6GpB0FMyRhuCbO?usp=sharing

import torch
import numpy
import time

def numpy_way(len, shuffle, device, rng):
    indices = torch.arange(len)
    if shuffle:
        rng.shuffle(indices.numpy())
    return indices.to(device)

def torch_way(len, shuffle, device, rng):
    if shuffle:
        seed = rng.integers(2 ** 62).item()
        torch_rng = torch.Generator(device).manual_seed(seed)
        return torch.randperm(len, generator=torch_rng, device=device)
    else:
        return torch.arange(len, device=device)

rng = numpy.random.default_rng()

for log_len in range(15, 25):
    len = 2 ** log_len
    for device in ["cpu", "cuda"]:
        runtimes = []
        for shuffler in [numpy_way, torch_way]:
            for i in range(10):
                if i == 3: # Warmup for 3 iterations.
                    start = time.time()
                indices = shuffler(len, True, device, rng)
            if device == "cuda":
                torch.cuda.synchronize()
            runtimes.append(time.time() - start)
        print(f"len: {len}, device: {device}, numpy: {runtimes[0]:.4f}, torch: {runtimes[1]:.4f}")
len: 32768, device: cpu, numpy: 0.0046, torch: 0.0037
len: 32768, device: cuda, numpy: 0.0037, torch: 0.0017
len: 65536, device: cpu, numpy: 0.0059, torch: 0.0069
len: 65536, device: cuda, numpy: 0.0072, torch: 0.0020
len: 131072, device: cpu, numpy: 0.0128, torch: 0.0146
len: 131072, device: cuda, numpy: 0.0143, torch: 0.0036
len: 262144, device: cpu, numpy: 0.0285, torch: 0.0339
len: 262144, device: cuda, numpy: 0.0312, torch: 0.0056
len: 524288, device: cpu, numpy: 0.0593, torch: 0.0757
len: 524288, device: cuda, numpy: 0.0618, torch: 0.0111
len: 1048576, device: cpu, numpy: 0.1561, torch: 0.1794
len: 1048576, device: cuda, numpy: 0.1328, torch: 0.0212
len: 2097152, device: cpu, numpy: 0.3178, torch: 0.4349
len: 2097152, device: cuda, numpy: 0.3435, torch: 0.0391
len: 4194304, device: cpu, numpy: 0.8298, torch: 1.1266
len: 4194304, device: cuda, numpy: 0.9278, torch: 0.0763
len: 8388608, device: cpu, numpy: 2.9981, torch: 2.5587
len: 8388608, device: cuda, numpy: 1.8150, torch: 0.1607
len: 16777216, device: cpu, numpy: 5.9163, torch: 7.2578
len: 16777216, device: cuda, numpy: 5.5185, torch: 0.3070

@Rhett-Ying So we might wanna keep your existing solution because it is more performant, in the google colab environment. However, if we want to have the advantages I listed above, we could go with the torch solution.

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 20, 2024

Actually, on my machine with AMD Ryzen 5950X CPU, torch seems to be the much more performant one:

len: 32768, device: cpu, numpy: 0.0017, torch: 0.0010
len: 32768, device: cuda, numpy: 0.0028, torch: 0.0050
len: 65536, device: cpu, numpy: 0.0053, torch: 0.0038
len: 65536, device: cuda, numpy: 0.0055, torch: 0.0039
len: 131072, device: cpu, numpy: 0.0119, torch: 0.0052
len: 131072, device: cuda, numpy: 0.0101, torch: 0.0060
len: 262144, device: cpu, numpy: 0.0174, torch: 0.0108
len: 262144, device: cuda, numpy: 0.0208, torch: 0.0050
len: 524288, device: cpu, numpy: 0.0380, torch: 0.0230
len: 524288, device: cuda, numpy: 0.0437, torch: 0.0086
len: 1048576, device: cpu, numpy: 0.0776, torch: 0.0530
len: 1048576, device: cuda, numpy: 0.0868, torch: 0.0156
len: 2097152, device: cpu, numpy: 0.1667, torch: 0.1479
len: 2097152, device: cuda, numpy: 0.2033, torch: 0.0585
len: 4194304, device: cpu, numpy: 0.4680, torch: 0.3354
len: 4194304, device: cuda, numpy: 0.5764, torch: 0.2689
len: 8388608, device: cpu, numpy: 1.2623, torch: 0.9207
len: 8388608, device: cuda, numpy: 1.4132, torch: 0.2697
len: 16777216, device: cpu, numpy: 3.0960, torch: 2.1247
len: 16777216, device: cuda, numpy: 3.4015, torch: 0.2903
len: 33554432, device: cpu, numpy: 6.9723, torch: 4.6462
len: 33554432, device: cuda, numpy: 7.4700, torch: 0.2341
len: 67108864, device: cpu, numpy: 14.5287, torch: 9.9723
len: 67108864, device: cuda, numpy: 15.6430, torch: 0.3678
len: 134217728, device: cpu, numpy: 30.6493, torch: 20.7330
len: 134217728, device: cuda, numpy: 32.3042, torch: 0.4654

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 20, 2024

The reason why we might want to move the shuffling operation to the GPU is as you can see, it starts taking very long as the itemset size grows. That is why, it might be a good idea to have GPU option. Otherwise, GPUs will idle between epochs while shuffling.

@Rhett-Ying Rhett-Ying merged commit 03ca11f into dmlc:master Jan 22, 2024
2 checks passed
@Rhett-Ying Rhett-Ying deleted the gb_itemsampler_seed branch January 22, 2024 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants