Skip to content

🐛[BUG]: RuntimeError: Input type (c10::Half) and bias type (float) should be the same #874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
luke-conibear opened this issue May 7, 2025 · 13 comments · Fixed by #885
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working

Comments

@luke-conibear
Copy link

luke-conibear commented May 7, 2025

Version

Latest from main branch

On which installation method(s) does this occur?

Source

Describe the issue

Following this PR, the CorrDiff example has an error in the generation (see traceback below).

The weights for both regression and diffusion are new following this PR too.

[2025-05-06 17:09:31,605][generate][INFO] - Using dataset: hrrr_mini
[2025-05-06 17:09:48,205][generate][INFO] - Patch-based training disabled
[2025-05-06 17:09:48,205][generate][INFO] - Loading residual network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus"...
[2025-05-06 17:09:49,114][generate][INFO] - Loading network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2000128.mdlus"...
[2025-05-06 17:09:49,426][generate][INFO] - Generating images, saving results to /mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc...
[2025-05-06 17:09:50,195][generate][INFO] - starting index: 0
/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py:701: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=self.amp_mode):
Error executing job with overrides: ['++dataset.data_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_data_path/hrrr_mini_train.nc', '++dataset.stats_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_stats_path/stats.json', '++generation.io.reg_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2000128.mdlus', '++generation.io.res_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus', '++generation.io.output_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc']
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 390, in <module>
    main()
  File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 344, in main
    image_out = generate_fn()
                ^^^^^^^^^^^^^
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 192, in generate_fn
    image_reg = regression_step(
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/corrdiff/utils.py", line 84, in regression_step
    x = net(x=x_hat[0:1], img_lr=img_lr)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/unet.py", line 165, in forward
    F_x = self.model(
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 703, in forward
    return super().forward(x, noise_labels, class_labels, augment_labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 450, in forward
    x = block(x, emb)
        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 703, in forward
    x = self.proj(attn.reshape(*x.shape)).add_(x)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 285, in forward
    x = torch.nn.functional.conv2d(x, w, padding=w_pad, bias=b)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

Minimum reproducible example

Default CorrDiff example
@luke-conibear luke-conibear added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 7, 2025
whn09 pushed a commit to whn09/physicsnemo that referenced this issue May 7, 2025
@whn09
Copy link

whn09 commented May 7, 2025

Remove with amp.autocast(enabled=self.amp_mode): in layers.py can solve this problem. But maybe not the best solution.

@CharlelieLrt CharlelieLrt self-assigned this May 7, 2025
@CharlelieLrt CharlelieLrt added 2 - In Progress Currently a work in progress and removed ? - Needs Triage Need team to review and classify labels May 7, 2025
@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 7, 2025

@luke-conibear thank you for reporting. We are aware of issues with CorrDiff checkpoints, and those will be addressed by #871 once it is merged. For the time being, you can downgrade to the last release 1.0.1-rc until we have a fix.

@loliverhennigh @jialusui1102 for viz

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 7, 2025

@luke-conibear after discussion with @jialusui1102 it seems your problem is not due to checkpoints (downgrading to 1.0.1-rc should still fix your issue until we resolve this).

Could please detail how you generated the checkpoints that you want to use in generate.py? Are they trained with the latest train.py and which config file did you use?

Could you also confirm that you are using this config for the generate.py, or if you modifed anything there?

@luke-conibear
Copy link
Author

@CharlelieLrt this is not using old checkpoints. It is all new runs and checkpoints for all steps.

Yes, I used that exact config for generation. I used the default configs from the main branch without any changes.

The exact commands I submitted were:

  • Regression
python train.py --config-name=config_training_hrrr_mini_regression.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
  • Diffusion
python train.py --config-name=config_training_hrrr_mini_diffusion.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.regression_checkpoint_path=${{inputs.regression_checkpoint_path}} ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
  • Generation
python generate.py --config-name=config_generate_hrrr_mini.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++generation.io.reg_ckpt_filename=${{inputs.regression_checkpoint_path}} ++generation.io.res_ckpt_filename=${{inputs.diffusion_checkpoint_path}} ++generation.io.output_filename=${{outputs.output_filename}} ++hydra.run.dir=${{outputs.output_dir}}

Regression run okay in the same time as before the PR.
Diffusion runs, though the non-patched version now takes double the time to complete.
Generation has the error above.

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 9, 2025

@luke-conibear thank you for the details.

Generation has the error above.

This was due to keeping AMP enabled in inference, which shouldn't be the case. It should be fixed in #882. Let me know if you still encounter this issue.

Diffusion runs, though the non-patched version now takes double the time to complete.

We were not able to reproduce this. At least the runtime per forward pass that we measured during training is consistent with both the regression model (since both regression and diffusion models share the same architecture, their forward pass runtimes should be comparable), and the diffusion model pre-PR.

Could you please share these details:

  • Are you referring to overall runtime or runtime per iteration, or only the forward pass?
  • Which commit are you using as a reference in your "double the time" comparison?
  • Are you using DDP, and if so how many GPUs are you using?

@luke-conibear
Copy link
Author

@CharlelieLrt Thanks for the quick response.
Unfortunately, yes the generation RuntimeError issue is still there.


For the timing comment, I was confused in my comparisons. Sorry for wasting time there.

My mistake was that the previous run used 4 GPUs, while the new run used 2 GPUs. So double the time for half the GPUs makes sense.

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 13, 2025

@luke-conibear I am not able to reproduce the RuntimeError with the latest commit. To help me troubleshoot this, could you please:

  • Give me commit hash that you are using for the entire pipeline (i.e. regression training, diffusion training, and generate). Please make sure that you use the same commit for all of them.
  • Give me the command that you use to run the train.py for both regression and diffusion training, including the config and any hydra override.
  • The number and type of GPUs that you are using for training regression and diffusion.
  • That command that you use to run the generate.py, including the config and any hydra override.
  • The number and types of GPUs that you are using for generation.

@luke-conibear
Copy link
Author

Thanks for your help

  • I used this recent commit for all steps
  • Commands
    # Regression
    torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config-name=config_training_hrrr_mini_regression.yaml model=regression ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=2560 ++training.hp.batch_size_per_gpu=640 ++training.perf.dataloader_workers=1 ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
    
    # Diffusion
    torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config-name=config_training_hrrr_mini_diffusion.yaml model=diffusion ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=2560 ++training.hp.batch_size_per_gpu=640 ++training.perf.dataloader_workers=1 ++training.io.regression_checkpoint_path=${{inputs.regression_checkpoint_path}} ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
    
    # Generation
    python generate.py --config-name=config_generate_hrrr_mini.yaml generation=non_patched ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++generation.io.reg_ckpt_filename=${{inputs.regression_checkpoint_path}} ++generation.io.res_ckpt_filename=${{inputs.diffusion_checkpoint_path}} ++generation.io.output_filename=${{outputs.output_filename}} ++hydra.run.dir=${{outputs.output_dir}} ++generation.has_lead_time=False ++generation.num_ensembles=2 ++generation.times=['2020-02-02T00:00:00']
  • Configs are default ones without any changes
  • All steps use Standard_NC80adis_H100_v5 on Azure ML. 2x GPUs for regression and diffusion. 1x GPU for generation.

@luke-conibear
Copy link
Author

The above information is for non-patched diffusion, as I cannot get the patched version to work.

I've tried many config/hydra variants e.g., appending to the command

f"model=patched_diffusion ++training.hp.patch_shape_x={patch_shape_x} ++training.hp.patch_shape_y={patch_shape_y} ++training.hp.patch_num={patch_num} "

Though always get in the logs

Patch-based training disabled

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented May 13, 2025

@luke-conibear thank you for the details, we will try to reproduce your error with the generate.py.

The above information is for non-patched diffusion, as I cannot get the patched version to work.

I've tried the exact command that you provided with the commit that you linked and the patch-based diffusion training works without problem for me. What values did you use for patch_shape_x and patch_shape_y? I suspect that you used values >= 64? FYI, the HRRR-mini dataset has images that are 64x64, so if you request patches that are greater or equal than 64, the patched training will be automatically disabled.

Note 1: currently patch_shape_x and patch_shape_y also need to be multiple of 32, so the only option to have patch-based diffusion training on HRRR-mini is to set patch_shape_x = 32 and patch_shape_y = 32

Note 2: patched-based training is designed for much larger images. It should still work on the HRRR-mini dataset, but it is not the most relevant application of patch-based diffusion.

@luke-conibear
Copy link
Author

@CharlelieLrt Okay, great, thanks a lot for the help.

Yes, you're right about the patch shape. I used 32 and patched diffusion works.

Then generation for patched diffusion has the same error as for non-patched.
Traceback below:

/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/filesystem.py:75: SyntaxWarning: invalid escape sequence '\w'
  pattern = re.compile(f"{suffix}[\w-]+(/[\w-]+)?/[\w-]+@[A-Za-z0-9.]+/[\w/](.*)")
/usr/local/lib/python3.12/dist-packages/physicsnemo/launch/logging/launch.py:321: SyntaxWarning: invalid escape sequence '\.'
  key = re.sub("[^a-zA-Z0-9\.\-\s\/\_]+", "", key)
/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/generative/deterministic_sampler.py:53: SyntaxWarning: invalid escape sequence '\s'
  """
/usr/local/lib/python3.12/dist-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config_generate_hrrr_mini.yaml': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
/usr/local/lib/python3.12/dist-packages/physicsnemo/distributed/manager.py:415: UserWarning: Could not initialize using ENV, SLURM or OPENMPI methods. Assuming this is a single process job
  warn(
[2025-05-14 14:05:47,457][generate][INFO] - Using dataset: hrrr_mini
[2025-05-14 14:06:04,172][generate][INFO] - Patch-based training enabled
[2025-05-14 14:06:04,172][generate][INFO] - Loading residual network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus"...
[2025-05-14 14:06:04,955][generate][INFO] - Loading network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus"...
[2025-05-14 14:06:05,240][generate][INFO] - Generating images, saving results to /mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc...
[2025-05-14 14:06:06,021][generate][INFO] - starting index: 0
/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py:701: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=self.amp_mode):
Error executing job with overrides: ['generation=patched', '++dataset.data_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_data_path/hrrr_mini_train.nc', '++dataset.stats_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_stats_path/stats.json', '++generation.io.reg_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus', '++generation.io.res_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus', '++generation.io.output_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc', '++generation.has_lead_time=False', '++generation.num_ensembles=2', '++generation.times=[2020-02-02T00:00:00]', '++generation.patch_shape_x=32', '++generation.patch_shape_y=32']
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 396, in <module>
    main()
  File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 350, in main
    image_out = generate_fn()
                ^^^^^^^^^^^^^
  File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 198, in generate_fn
    image_reg = regression_step(
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/corrdiff/utils.py", line 84, in regression_step
    x = net(x=x_hat[0:1], img_lr=img_lr)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/unet.py", line 165, in forward
    F_x = self.model(
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 703, in forward
    return super().forward(x, noise_labels, class_labels, augment_labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 450, in forward
    x = block(x, emb)
        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 703, in forward
    x = self.proj(attn.reshape(*x.shape)).add_(x)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 285, in forward
    x = torch.nn.functional.conv2d(x, w, padding=w_pad, bias=b)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

@CharlelieLrt
Copy link
Collaborator

Yes, you're right about the patch shape. I used 32 and patched diffusion works.

Great to know! We will update the log messages to more clearly explain why patching is disabled in this case.

Regarding your runtime error in generate.py @jialusui1102 identified the source of the problem (we were not properly disabling AMP in the models).

Both will be fixed once #885 is merged.

@luke-conibear
Copy link
Author

@CharlelieLrt Thanks a lot for the great help here. I confirm this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working
Projects
None yet
3 participants