-
Notifications
You must be signed in to change notification settings - Fork 186
Fix read_volatile
intrinsic
#216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note I have never touched this code before and we don't have tests, so I am very unsure of this change. |
ce65c5e
to
e72a113
Compare
d018a62
to
016c8b4
Compare
It would always generate PTX to load only a 1-bit integer (`i1`) from the provided pointer, regardless of the actual size of `T`. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment. This commit changes it to: 1. Load the full `llvm_type_of(T)` instead of `i1`. 2. Use the correct alignment of `T` for the load instruction. 3. Removes an incorrect pointer cast that was based on the return ABI of `T` rather than the type of the memory being read. Possibly fixes Rust-GPU#208.
@brandonros if you could try this fix that would be great! |
on it now @LegNeato thank you for taking a look into this |
![]() Would you benefit if I made an easier to reproduce repo? I configured https://github.com/brandonros/ed25519-vanity-rs/tree/no-compact to point to your feature branch (hardcoded rev) I could inline one package if it was easier. Do you want another ptx compiled output? Any way I could get the "stack trace" on where it is doing this IllegalAddress read exception? |
Hmmm, not sure. That repro should be enough, thanks! I don't have more time to poke at it today but hopefully will poke more tomorrow. |
jedisct1/rust-ed25519-compact@master...brandonros:rust-ed25519-compact:master Can you give this a look? I did this as a wild guess I'm afraid. ![]() Can I do this atomic/fence part better/differently? I see you are fixing |
Ignore that one, that's the "working" library (rust-ed25519-compact) actually. The non-working one is the more popular curve25519-dalek. Here is the IllegalAddress one: let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input); // good
let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar; // good
let compressed_point = point.compress(); // problem in here
/*let public_key_bytes = compressed_point.to_bytes();
public_key_bytes*/ Let me figure out exactly which line in curve25519_dalek is giving us the problem, might give us a clue |
Update: drilling it into 1 specific line from curve25519-dalek library let mut input = [0u8; 32];
input.copy_from_slice(&hashed_private_key_bytes[0..32]);
let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
let recip = point.Z.invert();
let x = &point.X * &recip;
let y = &point.Y * &recip;
let mut s: [u8; 32];
s = y.as_bytes();
s[31] ^= x.is_negative().unwrap_u8() << 7; // <<< it is this line that causes IllegalAddress will keep digging in |
If I convert this impl FieldElement {
/// Determine if this `FieldElement` is negative, in the sense
/// used in the ed25519 paper: `x` is negative if the low bit is
/// set.
///
/// # Return
///
/// If negative, return `Choice(1)`. Otherwise, return `Choice(0)`.
pub fn is_negative(&self) -> Choice {
let bytes = self.as_bytes();
(bytes[0] & 1).into()
} into this: let x_bytes = x.as_bytes();
let x_is_negative = x_bytes[0] & 1;
s[31] ^= x_is_negative << 7; one IllegalAddress problem goes away and I get further I ended up with this... I might be chasing a ghost/it might be a different issue (like some alignment issue or something that throws things off down the stack or something) else {
let mut input = [0u8; 32];
input.copy_from_slice(&hashed_private_key_bytes[0..32]);
let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
let recip = point.Z.invert();
let x = &point.X * &recip;
let y = &point.Y * &recip;
let mut s: [u8; 32];
s = y.as_bytes();
let x_bytes = x.as_bytes();
let x_is_negative = x_bytes[0] & 1;
s[31] ^= x_is_negative << 7;
let compressed_point = curve25519_dalek::edwards::CompressedEdwardsY(s);
let public_key_bytes = compressed_point.to_bytes();
//public_key_bytes // if you try to return this, IllegalAddress
[0u8; 32] // if you return this, everything else above it passes?
} I'll leave it on the branch for you to repro if you want |
sorry for the spam, journey continues the function can return the Otherwise it's like a Drop or something that I've seen when playing with Rust and FFI before. Similar behavior.
I used to have do things like this |
Even trying this let public_key_bytes = compressed_point.to_bytes();
output.copy_from_slice(&public_key_bytes[0..32]); let mut public_key_bytes = [0u8; 32];
derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes); or this let mut output = [0u8; 32];
output.copy_from_slice(&public_key_bytes[0..32]);
output does not seem to make it much better |
// ed25519 private key -> public key (first 32 bytes only)
cuda_std::println!("hashed_private_key_bytes: {:02x?}", hashed_private_key_bytes);
let mut public_key_bytes = [0u8; 32];
derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);
//cuda_std::println!("public_key_bytes: {}", public_key_bytes[0]); // if you uncomment this, IllegalAddress This is about as simple as I can boil it down to |
this was a pretty bad miss on my end... if the final result is not needed, it's most likely (obviously) dropped by the compiler let me add a println! in between each line/result and see if that helps |
just this, fine
by enabling the multiplication... it breaks even the "#1" working step... will continue to dive in, don't expect you to read all this, hoping it helps |
No, the stream of consciousness is good and helps! Thanks for sticking with it. I had originally looked at |
So I did something a bit strange/foolish to help dive into this: https://github.com/brandonros/nvptx-llvm-poc Just to remove Rust-CUDA from the mix, I made a small pipeline of Rust -> LLVM IR -> NVPTX to see if it could shed insight.
# Build with cargo
pushd kernel
cargo clean
cargo +nightly build --release --target nvptx64-nvidia-cuda
popd
# Find the generated .ll file (cargo puts it in a nested directory)
find kernel/target/nvptx64-nvidia-cuda/release/deps -name "kernel.ll" -exec cp {} output/kernel.ll \;
# Convert .ll to .ptx
llc -march=nvptx64 -mcpu=sm_75 output/kernel.ll -o output/kernel.ptx Does this help us? I am on a newer version of nightly... Shall I try this code without the precomputed table, let me see how that goes... |
Working through it... llvm-18 -> llvm-19 edit: as expected, linker hell... lots of |
I pushed another speculative fix to the same branch if you want to try it. I can't test it as I am on a mac laptop right now / not near an nvidia machine. |
will test now by the way, I spin up a temporary GPU for like $0.15/hr at https://vast.ai in case you didn't already know not trying to add more headache, I'll test now and report back, thanks |
no change in behavior, still broken on efd57d0
I moved this into my own code if we need. It's super strange that it's a static pointer reference. Does that help us? I did it so I could add |
Progress:
desired: not need to put an attribute and not have a bug Does this give us a clue? Want to see the ptx of the "global" "working" one? |
I tested "not this feature branch", just typical main Now with the It would be cool if |
If this is accurate, do we need to add some kind of check to make sure no more than 64kb is added to constant? Is that something we could do easily or not really? |
Yeah, I am aware of vast...just didn't want to shave some yaks. I'm not sure that explanation is right. Again, not totally sure but I assume LLVM understands what the limits are if they are static. Perhaps they need to be specified, hm. I'm not sure if the limit varies by card (I would expect so). Need to look more. The code I deleted in the second commit deleted forcing things into the I'm still getting a hold on this area of the code, so who knows! |
From looking around it might indeed be 64kb for all cards. I can't find a definitive source from Nvidia though. |
The performance is fantastic in global so, something I can for sure live with. One odd thing: https://github.com/dalek-cryptography/curve25519-dalek/compare/main...brandonros:curve25519-dalek:main?expand=1 I tried to clean up things on my end and not copy and paste entire repo forks into mine (did it to make quick edits for us) This... doesn't work. Same IllegalAddress error. I need to copy the entire precomputed table into my kernel crate and not use the one of the dependency, despite them both seemingly having this "global" attribute correctly added? edit: gah, I did the wrong u32 backend instead of u64... works fine... |
tested LegNeato@5bfda9a, did not fix it if you want mildly shaved yaks: https://github.com/brandonros/ed25519-vanity-rs/blob/master/scripts/vast.sh |
I guess I am going to have to really debug and run the code rather than just looking at it 🤣 |
So, I am an idiot. The previous fix I did didn't put it in the right address space, not sure what I was thinking 😞 . I'm pretty sure it works now: https://github.com/LegNeato/ed25519-vanity-rs |
This isn't strictly correct, as really we should be keeping track of all statics put into constant memory going over the size rather than just a singular one. But it is better than what was there. |
can we get just the global constant size limit section change on the branch and not the other first two changed you weren't sure of or do you want to push all 3 of this is the fix and the others are good to include / actual improvements? |
I'm going to split the |
This isn't fuly correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill insead of only spilling when a static is big. But, this is materially better than what is there. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216
This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill insead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216
This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216
* Allow address spaces to propagate to LLVM This looks like it was code that wasn't deleted after the refactor in decda87 * Spill large statics from constant to global memory This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes #208. See also the debugging and discussion in #216 * Add `--use-constant-memory-space` flag, off by default * Make it clear that `#[cuda_std::address_space(constant)]` still works
It would always generate PTX to load only a 1-bit integer (
i1
) from the provided pointer, regardless of the actual size ofT
. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment.This commit changes it to:
llvm_type_of(T)
instead ofi1
.T
for the load instruction.T
rather than the type of the memory being read.Possibly fixes #208.