-
Notifications
You must be signed in to change notification settings - Fork 346
🐛[BUG]: RuntimeError: Input type (c10::Half) and bias type (float) should be the same #874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Remove |
@luke-conibear thank you for reporting. We are aware of issues with CorrDiff checkpoints, and those will be addressed by #871 once it is merged. For the time being, you can downgrade to the last release 1.0.1-rc until we have a fix. @loliverhennigh @jialusui1102 for viz |
@luke-conibear after discussion with @jialusui1102 it seems your problem is not due to checkpoints (downgrading to 1.0.1-rc should still fix your issue until we resolve this). Could please detail how you generated the checkpoints that you want to use in Could you also confirm that you are using this config for the |
@CharlelieLrt this is not using old checkpoints. It is all new runs and checkpoints for all steps. Yes, I used that exact config for generation. I used the default configs from the
The exact commands I submitted were:
python train.py --config-name=config_training_hrrr_mini_regression.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
python train.py --config-name=config_training_hrrr_mini_diffusion.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.regression_checkpoint_path=${{inputs.regression_checkpoint_path}} ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
python generate.py --config-name=config_generate_hrrr_mini.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++generation.io.reg_ckpt_filename=${{inputs.regression_checkpoint_path}} ++generation.io.res_ckpt_filename=${{inputs.diffusion_checkpoint_path}} ++generation.io.output_filename=${{outputs.output_filename}} ++hydra.run.dir=${{outputs.output_dir}} Regression run okay in the same time as before the PR. |
@luke-conibear thank you for the details.
This was due to keeping AMP enabled in inference, which shouldn't be the case. It should be fixed in #882. Let me know if you still encounter this issue.
We were not able to reproduce this. At least the runtime per forward pass that we measured during training is consistent with both the regression model (since both regression and diffusion models share the same architecture, their forward pass runtimes should be comparable), and the diffusion model pre-PR. Could you please share these details:
|
@CharlelieLrt Thanks for the quick response. For the timing comment, I was confused in my comparisons. Sorry for wasting time there. My mistake was that the previous run used 4 GPUs, while the new run used 2 GPUs. So double the time for half the GPUs makes sense. |
@luke-conibear I am not able to reproduce the
|
Thanks for your help
|
The above information is for non-patched diffusion, as I cannot get the patched version to work. I've tried many config/hydra variants e.g., appending to the command f"model=patched_diffusion ++training.hp.patch_shape_x={patch_shape_x} ++training.hp.patch_shape_y={patch_shape_y} ++training.hp.patch_num={patch_num} " Though always get in the logs Patch-based training disabled |
@luke-conibear thank you for the details, we will try to reproduce your error with the
I've tried the exact command that you provided with the commit that you linked and the patch-based diffusion training works without problem for me. What values did you use for Note 1: currently Note 2: patched-based training is designed for much larger images. It should still work on the HRRR-mini dataset, but it is not the most relevant application of patch-based diffusion. |
@CharlelieLrt Okay, great, thanks a lot for the help. Yes, you're right about the patch shape. I used 32 and patched diffusion works. Then generation for patched diffusion has the same error as for non-patched. /usr/local/lib/python3.12/dist-packages/physicsnemo/utils/filesystem.py:75: SyntaxWarning: invalid escape sequence '\w'
pattern = re.compile(f"{suffix}[\w-]+(/[\w-]+)?/[\w-]+@[A-Za-z0-9.]+/[\w/](.*)")
/usr/local/lib/python3.12/dist-packages/physicsnemo/launch/logging/launch.py:321: SyntaxWarning: invalid escape sequence '\.'
key = re.sub("[^a-zA-Z0-9\.\-\s\/\_]+", "", key)
/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/generative/deterministic_sampler.py:53: SyntaxWarning: invalid escape sequence '\s'
"""
/usr/local/lib/python3.12/dist-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config_generate_hrrr_mini.yaml': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
/usr/local/lib/python3.12/dist-packages/physicsnemo/distributed/manager.py:415: UserWarning: Could not initialize using ENV, SLURM or OPENMPI methods. Assuming this is a single process job
warn(
[2025-05-14 14:05:47,457][generate][INFO] - Using dataset: hrrr_mini
[2025-05-14 14:06:04,172][generate][INFO] - Patch-based training enabled
[2025-05-14 14:06:04,172][generate][INFO] - Loading residual network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus"...
[2025-05-14 14:06:04,955][generate][INFO] - Loading network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus"...
[2025-05-14 14:06:05,240][generate][INFO] - Generating images, saving results to /mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc...
[2025-05-14 14:06:06,021][generate][INFO] - starting index: 0
/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py:701: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=self.amp_mode):
Error executing job with overrides: ['generation=patched', '++dataset.data_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_data_path/hrrr_mini_train.nc', '++dataset.stats_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_stats_path/stats.json', '++generation.io.reg_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus', '++generation.io.res_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus', '++generation.io.output_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc', '++generation.has_lead_time=False', '++generation.num_ensembles=2', '++generation.times=[2020-02-02T00:00:00]', '++generation.patch_shape_x=32', '++generation.patch_shape_y=32']
Traceback (most recent call last):
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 396, in <module>
main()
File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 350, in main
image_out = generate_fn()
^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 198, in generate_fn
image_reg = regression_step(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/corrdiff/utils.py", line 84, in regression_step
x = net(x=x_hat[0:1], img_lr=img_lr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/unet.py", line 165, in forward
F_x = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 703, in forward
return super().forward(x, noise_labels, class_labels, augment_labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 450, in forward
x = block(x, emb)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 703, in forward
x = self.proj(attn.reshape(*x.shape)).add_(x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 285, in forward
x = torch.nn.functional.conv2d(x, w, padding=w_pad, bias=b)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (c10::Half) and bias type (float) should be the same |
Great to know! We will update the log messages to more clearly explain why patching is disabled in this case. Regarding your runtime error in Both will be fixed once #885 is merged. |
@CharlelieLrt Thanks a lot for the great help here. I confirm this is fixed. |
Uh oh!
There was an error while loading. Please reload this page.
Version
Latest from main branch
On which installation method(s) does this occur?
Source
Describe the issue
Following this PR, the CorrDiff example has an error in the generation (see traceback below).
The weights for both regression and diffusion are new following this PR too.
Minimum reproducible example
The text was updated successfully, but these errors were encountered: