Skip to content

Training equivariant transformer with OptimizedDistance #203

@FranklinHu1

Description

@FranklinHu1

Hello,

I am currently trying to train the equivariant transformer model using the OptimizedDistance module by replacing the call to Distance() with OptimizedDistance() in torchmd-net/torchmdnet/models/torchmd_et.py. I want to train on a system with periodic boundary conditions. However, when I try running the training, I get the following traceback:

Traceback (most recent call last):
  File "/home/frankhu/torchmd-net/torchmdnet/scripts/train.py", line 189, in <module>
    main()
  File "/home/frankhu/torchmd-net/torchmdnet/scripts/train.py", line 137, in main
    model = LNNP(args, prior_model=prior_models, mean=data.mean, std=data.std)
  File "/home/frankhu/torchmd-net/torchmdnet/module.py", line 29, in __init__
    self.model = create_model(self.hparams, prior_model, mean, std)
  File "/home/frankhu/torchmd-net/torchmdnet/models/model.py", line 70, in create_model
    representation_model = TorchMD_ET(
  File "/home/frankhu/torchmd-net/torchmdnet/models/torchmd_et.py", line 118, in __init__
    self.distance = OptimizedDistance(
  File "/home/frankhu/torchmd-net/torchmdnet/models/utils.py", line 199, in __init__
    from torchmdnet.neighbors import get_neighbor_pairs_kernel
  File "/home/frankhu/torchmd-net/torchmdnet/neighbors/__init__.py", line 15, in <module>
    compile_extension()
  File "/home/frankhu/torchmd-net/torchmdnet/neighbors/__init__.py", line 11, in compile_extension
    cpp_extension.load(
  File "/home/frankhu/mambaforge/envs/torchmd-net/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1269, in load
    return _jit_compile(
  File "/home/frankhu/mambaforge/envs/torchmd-net/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1453, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
  File "/home/frankhu/mambaforge/envs/torchmd-net/lib/python3.10/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
  File "/home/frankhu/mambaforge/envs/torchmd-net/lib/python3.10/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/home/frankhu/torchmd-net/torchmdnet/neighbors/backwards.cu'

I saw in a previous commit that this file was removed, but it seems like the model cannot proceed with training without it. For reference, here is the change I made within torchmd_et.py:

self.distance = OptimizedDistance(
            cutoff_lower,
            cutoff_upper,
            max_num_pairs = -max_num_neighbors,
            return_vecs = False,
            loop = False,
            strategy = 'brute',
            include_transpose = True,
            resize_to_fit = False,
            check_errors = False,
            box = torch.diag(torch.tensor(pbc_box))
        )

I am running on one Nvidia H100 GPU. Any help/clarification would be greatly appreciated.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions