Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in python tests/validate_network.py --device privateuseone:1 #70

Open
sukamenev opened this issue Apr 3, 2024 · 5 comments
Open

Comments

@sukamenev
Copy link

Tested on your original code

Testing  resnet18
Accessing device #1:AMD Radeon R9 Fury Series (radeonsi, fiji, LLVM 17.0.6, DRM 3.54, 6.6.12-calculate) on rusticl
LLVM ERROR: Cannot select: 0x7f3c70430b30: f32 = and 0x7f3c70424cc0, Constant:i32<2147483647>
  0x7f3c70424cc0: f32 = bitcast 0x7f3c7042ae70
    0x7f3c7042ae70: i32 = llvm.amdgcn.wwm TargetConstant:i64<2662>, 0x7f3c70424b00
      0x7f3c70430970: i64 = TargetConstant<2662>
      0x7f3c70424b00: i32 = llvm.amdgcn.readlane TargetConstant:i64<2528>, 0x7f3c7042bc00, Constant:i32<63>
        0x7f3c704254a0: i64 = TargetConstant<2528>
        0x7f3c7042bc00: i32,ch,glue = CopyFromReg # D:1 0x7f3c70425350, Register:i32 %367, 0x7f3c70425350:1
          0x7f3c70424da0: i32 = Register %367
          0x7f3c70425350: ch,glue = inlineasm # D:1 0x7f3c70424e10, TargetExternalSymbol:i64'; 4', MDNode:ch<null>, TargetConstant:i64<1>, TargetConstant:i32<1769482>, Register:i32 %367, TargetConstant:i32<-2147483639>, Register:i32 %368, 0x7f3c70424e10:1
            0x7f3c70424f60: i64 = TargetExternalSymbol'; 4'
            0x7f3c704303c0: i64 = TargetConstant<1>
            0x7f3c70424a20: i32 = TargetConstant<1769482>
            0x7f3c70424da0: i32 = Register %367
            0x7f3c704252e0: i32 = TargetConstant<-2147483639>
            0x7f3c7042b500: i32 = Register %368
            0x7f3c70424e10: ch,glue = CopyToReg # D:1 0x7f3c70430a50:1, Register:i32 %368, 0x7f3c7042b5e0
              0x7f3c7042b500: i32 = Register %368
              0x7f3c7042b5e0: i32 = bitcast # D:1 0x7f3c70424b70
                0x7f3c70424b70: f32 = fadd # D:1 0x7f3c704309e0, 0x7f3c7042b730
                  0x7f3c704309e0: f32 = fadd # D:1 0x7f3c70430200, 0x7f3c70425040


                  0x7f3c7042b730: f32 = bitcast # D:1 0x7f3c7042b0a0

        0x7f3c7042bb20: i32 = Constant<63>
  0x7f3c70430ac0: i32 = Constant<2147483647>
In function: main
Emergency stop
@artyom-beilis
Copy link
Owner

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

@artyom-beilis
Copy link
Owner

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

@sukamenev
Copy link
Author

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

My CPU have 64 bit architecture.
GCN 3 (Fiji) - I don't know how many bit architecture.

Quote from AMD docs:

Every instruction is described with either 32 bits or 64 bits of microcode.
• Vector Memory instructions are 64 bits.
• Exports are 64 bits.
• LDS and GDS are 64 bits.
• Scalar ALU instructions are 32 bits but can have an additional 32 bits
of literal constant data.
• Vector ALU instructions can be 32 bits or 64 bits. The 32-bit versions
can have an additional 32 bits of literal constant data.

@sukamenev
Copy link
Author

On AMD OpenCL from amdgpu-pro also error

python tests/validate_network.py --device privateuseone:3
Testing  resnet18
Accessing device #3:Fiji on AMD Accelerated Parallel Processing
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 280, in <module>
    main(r)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 221, in main
    train_on_images(m,batch,args.device,args.eval,iter_size = args.iter_size,opt_steps = args.opt,fwd=args.fwd)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 105, in train_on_images
    ref = step(model,data,labels,opt_steps,iter_size,fwd=fwd,test=test)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 85, in step
    loss.backward()
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: could not create a primitive descriptor iterator

@sukamenev
Copy link
Author

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

Thank you! I got 8-9% speed impovement on amdgpu-pro OpenCL drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants