Fix `read_volatile` intrinsic #216

LegNeato · 2025-05-24T15:20:05Z

It would always generate PTX to load only a 1-bit integer (i1) from the provided pointer, regardless of the actual size of T. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment.

This commit changes it to:

Load the full llvm_type_of(T) instead of i1.
Use the correct alignment of T for the load instruction.
Removes an incorrect pointer cast that was based on the return ABI of T rather than the type of the memory being read.

Possibly fixes #208.

LegNeato · 2025-05-24T15:20:56Z

Note I have never touched this code before and we don't have tests, so I am very unsure of this change.

brandonros · 2025-05-24T23:41:58Z

jedisct1/rust-ed25519-compact@master...brandonros:rust-ed25519-compact:master

Can you give this a look? I did this as a wild guess I'm afraid.

Can I do this atomic/fence part better/differently?

I see you are fixing read_volatile. Does write_volatile need the same fix by chance?

brandonros · 2025-05-25T00:40:13Z

Ignore that one, that's the "working" library (rust-ed25519-compact) actually. The non-working one is the more popular curve25519-dalek.

Here is the IllegalAddress one:

 let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input); // good
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar; // good
        let compressed_point = point.compress(); // problem in here
        /*let public_key_bytes = compressed_point.to_bytes();
        public_key_bytes*/

Let me figure out exactly which line in curve25519_dalek is giving us the problem, might give us a clue

brandonros · 2025-05-25T00:55:42Z

Update:

drilling it into 1 specific line from curve25519-dalek library

let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        s[31] ^= x.is_negative().unwrap_u8() << 7; // <<< it is this line that causes IllegalAddress

will keep digging in

brandonros · 2025-05-25T01:05:01Z

If I convert this

impl FieldElement {
    /// Determine if this `FieldElement` is negative, in the sense
    /// used in the ed25519 paper: `x` is negative if the low bit is
    /// set.
    ///
    /// # Return
    ///
    /// If negative, return `Choice(1)`.  Otherwise, return `Choice(0)`.
    pub fn is_negative(&self) -> Choice {
        let bytes = self.as_bytes();
        (bytes[0] & 1).into()
    }

into this:

        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;

one IllegalAddress problem goes away and I get further

I ended up with this... I might be chasing a ghost/it might be a different issue (like some alignment issue or something that throws things off down the stack or something)

else {
        let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;
        let compressed_point = curve25519_dalek::edwards::CompressedEdwardsY(s);
        let public_key_bytes = compressed_point.to_bytes();
        //public_key_bytes // if you try to return this, IllegalAddress
        [0u8; 32] // if you return this, everything else above it passes?
    }

I'll leave it on the branch for you to repro if you want

brandonros · 2025-05-25T01:12:23Z

sorry for the spam, journey continues

the function can return the public_key_bytes no problem, it's when you turn around and try to read it, which would make sense given your read_volatile but I'm not quite seeing where something is marked volatile in this mix, did you see something that tipped you off onto that? Is there some other stale read potentially?

Otherwise it's like a Drop or something that I've seen when playing with Rust and FFI before. Similar behavior.

impl CompressedEdwardsY {
    /// View this `CompressedEdwardsY` as an array of bytes.
    pub const fn as_bytes(&self) -> &[u8; 32] {
        &self.0
    }

    /// Copy this `CompressedEdwardsY` to an array of bytes.
    pub const fn to_bytes(&self) -> [u8; 32] {
        self.0
    }

I used to have do things like this std::mem::ManuallyDrop::new(buf).as_slice().as_ptr(), let's see if I can play with it.

brandonros · 2025-05-25T01:23:17Z

Even trying this

let public_key_bytes = compressed_point.to_bytes();
        output.copy_from_slice(&public_key_bytes[0..32]);

    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);

or this

        let mut output = [0u8; 32];
        output.copy_from_slice(&public_key_bytes[0..32]);
        output

does not seem to make it much better

brandonros · 2025-05-25T01:35:44Z

    // ed25519 private key -> public key (first 32 bytes only)
    cuda_std::println!("hashed_private_key_bytes: {:02x?}", hashed_private_key_bytes);
    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);
    //cuda_std::println!("public_key_bytes: {}", public_key_bytes[0]); // if you uncomment this, IllegalAddress

This is about as simple as I can boil it down to

brandonros · 2025-05-25T16:53:26Z

this was a pretty bad miss on my end...

if the final result is not needed, it's most likely (obviously) dropped by the compiler

let me add a println! in between each line/result and see if that helps

brandonros · 2025-05-25T17:12:47Z

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

just this, fine

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

        // 2
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        cuda_std::println!("point: {:?}", point);

Found 1 CUDA devices
Starting device 0
[0] Loading module...
[0] Starting search loop...
hashed_private_key_bytes: [60, 95, b5, 5d, 3b, 22, 10, e2, 4e, ac, 07, 85, 73, f5, 53, 11, eb, 1e, c7, ea, 6b, f8, 03, cf, d3, 84, 22, 63, 60, 00, e7, 6c, 6d, 82, b7, 6f, 03, 46, d8, 9c, 5e, f1, 93, 46, 65, dc, 63, 73, 53, 0e, 7f, f0, 64, a8, 91, f8, 11, 55, cf, be, af, 27, 21, 07]

thread 'main' panicked at src/main.rs:123:32:
called `Result::unwrap()` on an `Err` value: IllegalAddress

by enabling the multiplication... it breaks even the "#1" working step...

will continue to dive in, don't expect you to read all this, hoping it helps

LegNeato · 2025-05-25T17:34:19Z

No, the stream of consciousness is good and helps! Thanks for sticking with it. I had originally looked at ED25519_BASEPOINT_TABLE but I think I followed a red herring looking at blackbox, which lead me to read_volatile.

brandonros · 2025-05-25T18:35:51Z

So I did something a bit strange/foolish to help dive into this:

https://github.com/brandonros/nvptx-llvm-poc

Just to remove Rust-CUDA from the mix, I made a small pipeline of Rust -> LLVM IR -> NVPTX to see if it could shed insight.

llc: error: llc: output/kernel.ll:44:32: error: expected type
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80

; Function Attrs: nounwind
define ptx_kernel void @ed25519_poc() unnamed_addr #2 {
start:
  %x_bytes = alloca [32 x i8], align 1
  %_16 = alloca [32 x i8], align 1
  %y = alloca [40 x i8], align 8
  %x = alloca [40 x i8], align 8
  %recip = alloca [40 x i8], align 8
  %point = alloca [160 x i8], align 8
  %scalar = alloca [32 x i8], align 1
  %input = alloca [32 x i8], align 1
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 1 dereferenceable(32) %input, i8 0, i64 32, i1 false)
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %scalar)
; call curve25519_dalek::scalar::Scalar::from_bytes_mod_order
  call void @_ZN16curve25519_dalek6scalar6Scalar20from_bytes_mod_order17ha97de3501966aa39E(ptr noalias nocapture noundef nonnull sret([32 x i8]) align 1 dereferenceable(32) %scalar, ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(32) %input) #8
  call void @llvm.lifetime.start.p0(i64 160, ptr nonnull %point)
  %_4 = load ptr, ptr @_ZN16curve25519_dalek7backend6serial3u649constants23ED25519_BASEPOINT_TABLE17h1495e1294bfbdf04E, align 8, !nonnull !2, !align !3, !noundef !2
; call <&curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$::mul
  call void @"_ZN138_$LT$$RF$curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$3mul17hb25e470272c01c40E"(ptr noalias nocapture noundef nonnull sret([160 x i8]) align 8 dereferenceable(160) %point, ptr noalias noundef nonnull readonly align 8 dereferenceable(30720) %_4, ptr noalias noundef nonnull readonly align 1 dereferenceable(32) %scalar) #8
  call void @llvm.lifetime.start.p0(i64 40, ptr nonnull %recip)
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80
; call curve25519_dalek::field::<impl curve25519_dalek$$backend$$serial$$u64$$field$$FieldElement51$GT$::invert

# Build with cargo
pushd kernel
cargo clean
cargo +nightly build --release --target nvptx64-nvidia-cuda
popd

# Find the generated .ll file (cargo puts it in a nested directory)
find kernel/target/nvptx64-nvidia-cuda/release/deps -name "kernel.ll" -exec cp {} output/kernel.ll \;

# Convert .ll to .ptx
llc -march=nvptx64 -mcpu=sm_75 output/kernel.ll -o output/kernel.ptx

Does this help us? I am on a newer version of nightly...

Shall I try this code without the precomputed table, let me see how that goes...

LegNeato · 2025-05-26T02:03:36Z

I'm going to split the read_volatile changes from the address space changes.

This isn't fuly correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill insead of only spilling when a static is big. But, this is materially better than what is there. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216

This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill insead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216

This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216

LegNeato · 2025-05-26T02:16:22Z

@brandonros #217

* Allow address spaces to propagate to LLVM This looks like it was code that wasn't deleted after the refactor in decda87 * Spill large statics from constant to global memory This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes #208. See also the debugging and discussion in #216 * Add `--use-constant-memory-space` flag, off by default * Make it clear that `#[cuda_std::address_space(constant)]` still works

brandonros · 2025-06-01T19:29:49Z

@LegNeato I just confirmed we need this. Without it curve25519-dalek generates entirely wrong (silently) different results on GPU vs CPU. this fixes it.

It would always generate PTX to load only a 1-bit integer (`i1`) from the provided pointer, regardless of the actual size of `T`. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment. This commit changes it to: 1. Load the full `llvm_type_of(T)` instead of `i1`. 2. Use the correct alignment of `T` for the load instruction. 3. Removes an incorrect pointer cast that was based on the return ABI of `T` rather than the type of the memory being read. Possibly fixes Rust-GPU#208.

brandonros · 2025-06-02T18:17:47Z

are you good to merge, anything you want to wait for?

LegNeato · 2025-06-03T11:11:15Z

I'm not exactly sure this is correct in all cases, I hit a different issue while investigating the sha2 stuff and trying to track down if this is the cause. Not familiar with this area so fumbling around in the dark a bit.

…st-GPU#217) * Allow address spaces to propagate to LLVM This looks like it was code that wasn't deleted after the refactor in Rust-GPU@decda87 * Spill large statics from constant to global memory This isn't fully correct, as ideally we keep track of what we have put into constant memory and when it is filled up spill instdead of only spilling when a static is big. But, this is materially better than what is there (which is a runtime error). An argument can be made to just _always_ use global memory and we don't have to worry about getting the packing right. Fixes Rust-GPU#208. See also the debugging and discussion in Rust-GPU#216 * Add `--use-constant-memory-space` flag, off by default * Make it clear that `#[cuda_std::address_space(constant)]` still works

LegNeato force-pushed the fix208 branch 2 times, most recently from ce65c5e to e72a113 Compare May 24, 2025 15:35

LegNeato marked this pull request as draft May 24, 2025 15:35

LegNeato force-pushed the fix208 branch 3 times, most recently from 016c8b4 to fab7455 Compare May 24, 2025 15:59

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as outdated.

Sign in to view

LegNeato force-pushed the fix208 branch 2 times, most recently from cc392bb to efd57d0 Compare May 25, 2025 19:30

This comment was marked as outdated.

Sign in to view

LegNeato force-pushed the fix208 branch from 09c1538 to 5bfda9a Compare May 25, 2025 22:24

This comment was marked as outdated.

Sign in to view

LegNeato mentioned this pull request May 26, 2025

Make constant memory opt-in, spill large statics to global memory #217

Merged

brandonros mentioned this pull request Jun 1, 2025

Fix read_volatile intrinsic #223

Closed

LegNeato force-pushed the fix208 branch from 9026816 to 1bc8d50 Compare June 2, 2025 17:30

LegNeato marked this pull request as ready for review June 3, 2025 11:11

Fix read_volatile intrinsic #216

Are you sure you want to change the base?

Fix read_volatile intrinsic #216

Uh oh!

Conversation

LegNeato commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LegNeato commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

brandonros commented May 24, 2025

Uh oh!

brandonros commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandonros commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandonros commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025

Uh oh!

LegNeato commented May 25, 2025

Uh oh!

brandonros commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

LegNeato commented May 26, 2025

Uh oh!

LegNeato commented May 26, 2025

Uh oh!

brandonros commented Jun 1, 2025

Uh oh!

brandonros commented Jun 2, 2025

Uh oh!

LegNeato commented Jun 3, 2025

Uh oh!

Uh oh!

Fix `read_volatile` intrinsic #216

Fix `read_volatile` intrinsic #216

LegNeato commented May 24, 2025 •

edited

Loading

LegNeato commented May 24, 2025 •

edited

Loading

brandonros commented May 25, 2025 •

edited

Loading

brandonros commented May 25, 2025 •

edited

Loading

brandonros commented May 25, 2025 •

edited

Loading