Skip to content

Fix read_volatile intrinsic #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

LegNeato
Copy link
Contributor

@LegNeato LegNeato commented May 24, 2025

It would always generate PTX to load only a 1-bit integer (i1) from the provided pointer, regardless of the actual size of T. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment.

This commit changes it to:

  1. Load the full llvm_type_of(T) instead of i1.
  2. Use the correct alignment of T for the load instruction.
  3. Removes an incorrect pointer cast that was based on the return ABI of T rather than the type of the memory being read.

Possibly fixes #208.

@LegNeato
Copy link
Contributor Author

LegNeato commented May 24, 2025

Note I have never touched this code before and we don't have tests, so I am very unsure of this change.

@LegNeato LegNeato force-pushed the fix208 branch 2 times, most recently from ce65c5e to e72a113 Compare May 24, 2025 15:35
@LegNeato LegNeato marked this pull request as draft May 24, 2025 15:35
@LegNeato LegNeato force-pushed the fix208 branch 2 times, most recently from d018a62 to 016c8b4 Compare May 24, 2025 15:59
It would always generate PTX to load only a 1-bit integer (`i1`)
from the provided pointer, regardless of the actual size of `T`.
The alignment specified for the load was also based on the pointer's
alignment, not the pointee's alignment.

This commit changes it to:
1. Load the full `llvm_type_of(T)` instead of `i1`.
2. Use the correct alignment of `T` for the load instruction.
3. Removes an incorrect pointer cast that was based on the return
   ABI of `T` rather than the type of the memory being read.

Possibly fixes Rust-GPU#208.
@LegNeato
Copy link
Contributor Author

@brandonros if you could try this fix that would be great!

@brandonros
Copy link

on it now @LegNeato

thank you for taking a look into this

@brandonros
Copy link

image

Would you benefit if I made an easier to reproduce repo?

I configured https://github.com/brandonros/ed25519-vanity-rs/tree/no-compact to point to your feature branch (hardcoded rev)

I could inline one package if it was easier.

Do you want another ptx compiled output? Any way I could get the "stack trace" on where it is doing this IllegalAddress read exception?

@LegNeato
Copy link
Contributor Author

LegNeato commented May 24, 2025

Hmmm, not sure. That repro should be enough, thanks! I don't have more time to poke at it today but hopefully will poke more tomorrow.

@brandonros
Copy link

jedisct1/rust-ed25519-compact@master...brandonros:rust-ed25519-compact:master

Can you give this a look? I did this as a wild guess I'm afraid.

image

Can I do this atomic/fence part better/differently?

I see you are fixing read_volatile. Does write_volatile need the same fix by chance?

@brandonros
Copy link

brandonros commented May 25, 2025

Ignore that one, that's the "working" library (rust-ed25519-compact) actually. The non-working one is the more popular curve25519-dalek.

Here is the IllegalAddress one:

 let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input); // good
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar; // good
        let compressed_point = point.compress(); // problem in here
        /*let public_key_bytes = compressed_point.to_bytes();
        public_key_bytes*/

Let me figure out exactly which line in curve25519_dalek is giving us the problem, might give us a clue

@brandonros
Copy link

Update:

drilling it into 1 specific line from curve25519-dalek library

let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        s[31] ^= x.is_negative().unwrap_u8() << 7; // <<< it is this line that causes IllegalAddress

will keep digging in

@brandonros
Copy link

If I convert this

impl FieldElement {
    /// Determine if this `FieldElement` is negative, in the sense
    /// used in the ed25519 paper: `x` is negative if the low bit is
    /// set.
    ///
    /// # Return
    ///
    /// If negative, return `Choice(1)`.  Otherwise, return `Choice(0)`.
    pub fn is_negative(&self) -> Choice {
        let bytes = self.as_bytes();
        (bytes[0] & 1).into()
    }

into this:

        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;

one IllegalAddress problem goes away and I get further

I ended up with this... I might be chasing a ghost/it might be a different issue (like some alignment issue or something that throws things off down the stack or something)

else {
        let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;
        let compressed_point = curve25519_dalek::edwards::CompressedEdwardsY(s);
        let public_key_bytes = compressed_point.to_bytes();
        //public_key_bytes // if you try to return this, IllegalAddress
        [0u8; 32] // if you return this, everything else above it passes?
    }

I'll leave it on the branch for you to repro if you want

@brandonros
Copy link

sorry for the spam, journey continues

the function can return the public_key_bytes no problem, it's when you turn around and try to read it, which would make sense given your read_volatile but I'm not quite seeing where something is marked volatile in this mix, did you see something that tipped you off onto that? Is there some other stale read potentially?

Otherwise it's like a Drop or something that I've seen when playing with Rust and FFI before. Similar behavior.

impl CompressedEdwardsY {
    /// View this `CompressedEdwardsY` as an array of bytes.
    pub const fn as_bytes(&self) -> &[u8; 32] {
        &self.0
    }

    /// Copy this `CompressedEdwardsY` to an array of bytes.
    pub const fn to_bytes(&self) -> [u8; 32] {
        self.0
    }

I used to have do things like this std::mem::ManuallyDrop::new(buf).as_slice().as_ptr(), let's see if I can play with it.

@brandonros
Copy link

Even trying this

let public_key_bytes = compressed_point.to_bytes();
        output.copy_from_slice(&public_key_bytes[0..32]);
    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);

or this

        let mut output = [0u8; 32];
        output.copy_from_slice(&public_key_bytes[0..32]);
        output

does not seem to make it much better

@brandonros
Copy link

brandonros commented May 25, 2025

    // ed25519 private key -> public key (first 32 bytes only)
    cuda_std::println!("hashed_private_key_bytes: {:02x?}", hashed_private_key_bytes);
    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);
    //cuda_std::println!("public_key_bytes: {}", public_key_bytes[0]); // if you uncomment this, IllegalAddress

This is about as simple as I can boil it down to

@brandonros
Copy link

this was a pretty bad miss on my end...

if the final result is not needed, it's most likely (obviously) dropped by the compiler

let me add a println! in between each line/result and see if that helps

@brandonros
Copy link

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

just this, fine

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

        // 2
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        cuda_std::println!("point: {:?}", point);
Found 1 CUDA devices
Starting device 0
[0] Loading module...
[0] Starting search loop...
hashed_private_key_bytes: [60, 95, b5, 5d, 3b, 22, 10, e2, 4e, ac, 07, 85, 73, f5, 53, 11, eb, 1e, c7, ea, 6b, f8, 03, cf, d3, 84, 22, 63, 60, 00, e7, 6c, 6d, 82, b7, 6f, 03, 46, d8, 9c, 5e, f1, 93, 46, 65, dc, 63, 73, 53, 0e, 7f, f0, 64, a8, 91, f8, 11, 55, cf, be, af, 27, 21, 07]

thread 'main' panicked at src/main.rs:123:32:
called `Result::unwrap()` on an `Err` value: IllegalAddress

by enabling the multiplication... it breaks even the "#1" working step...

will continue to dive in, don't expect you to read all this, hoping it helps

@LegNeato
Copy link
Contributor Author

No, the stream of consciousness is good and helps! Thanks for sticking with it. I had originally looked at ED25519_BASEPOINT_TABLE but I think I followed a red herring looking at blackbox, which lead me to read_volatile.

@brandonros
Copy link

brandonros commented May 25, 2025

So I did something a bit strange/foolish to help dive into this:

https://github.com/brandonros/nvptx-llvm-poc

Just to remove Rust-CUDA from the mix, I made a small pipeline of Rust -> LLVM IR -> NVPTX to see if it could shed insight.

llc: error: llc: output/kernel.ll:44:32: error: expected type
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80
; Function Attrs: nounwind
define ptx_kernel void @ed25519_poc() unnamed_addr #2 {
start:
  %x_bytes = alloca [32 x i8], align 1
  %_16 = alloca [32 x i8], align 1
  %y = alloca [40 x i8], align 8
  %x = alloca [40 x i8], align 8
  %recip = alloca [40 x i8], align 8
  %point = alloca [160 x i8], align 8
  %scalar = alloca [32 x i8], align 1
  %input = alloca [32 x i8], align 1
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 1 dereferenceable(32) %input, i8 0, i64 32, i1 false)
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %scalar)
; call curve25519_dalek::scalar::Scalar::from_bytes_mod_order
  call void @_ZN16curve25519_dalek6scalar6Scalar20from_bytes_mod_order17ha97de3501966aa39E(ptr noalias nocapture noundef nonnull sret([32 x i8]) align 1 dereferenceable(32) %scalar, ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(32) %input) #8
  call void @llvm.lifetime.start.p0(i64 160, ptr nonnull %point)
  %_4 = load ptr, ptr @_ZN16curve25519_dalek7backend6serial3u649constants23ED25519_BASEPOINT_TABLE17h1495e1294bfbdf04E, align 8, !nonnull !2, !align !3, !noundef !2
; call <&curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$::mul
  call void @"_ZN138_$LT$$RF$curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$3mul17hb25e470272c01c40E"(ptr noalias nocapture noundef nonnull sret([160 x i8]) align 8 dereferenceable(160) %point, ptr noalias noundef nonnull readonly align 8 dereferenceable(30720) %_4, ptr noalias noundef nonnull readonly align 1 dereferenceable(32) %scalar) #8
  call void @llvm.lifetime.start.p0(i64 40, ptr nonnull %recip)
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80
; call curve25519_dalek::field::<impl curve25519_dalek$$backend$$serial$$u64$$field$$FieldElement51$GT$::invert
# Build with cargo
pushd kernel
cargo clean
cargo +nightly build --release --target nvptx64-nvidia-cuda
popd

# Find the generated .ll file (cargo puts it in a nested directory)
find kernel/target/nvptx64-nvidia-cuda/release/deps -name "kernel.ll" -exec cp {} output/kernel.ll \;

# Convert .ll to .ptx
llc -march=nvptx64 -mcpu=sm_75 output/kernel.ll -o output/kernel.ptx

Does this help us? I am on a newer version of nightly...

Shall I try this code without the precomputed table, let me see how that goes...

@brandonros
Copy link

brandonros commented May 25, 2025

This is an LLVM version compatibility issue. The nuw (no unsigned wrap) flag in getelementptr inbounds nuw is from a newer LLVM version than what your llc supports.

Working through it... llvm-18 -> llvm-19

edit: as expected, linker hell... lots of .extern references... :D

@LegNeato
Copy link
Contributor Author

I pushed another speculative fix to the same branch if you want to try it. I can't test it as I am on a mac laptop right now / not near an nvidia machine.

@brandonros
Copy link

will test now

by the way, I spin up a temporary GPU for like $0.15/hr at https://vast.ai in case you didn't already know

not trying to add more headache, I'll test now and report back, thanks

@brandonros
Copy link

brandonros commented May 25, 2025

no change in behavior, still broken on efd57d0

use curve25519_dalek::backend::serial::u64::field::FieldElement51;
use curve25519_dalek::backend::serial::curve_models::AffineNielsPoint;
use curve25519_dalek::edwards::EdwardsBasepointTable;
use curve25519_dalek::window::LookupTable;

pub static ED25519_BASEPOINT_TABLE: &'static EdwardsBasepointTable =
    &ED25519_BASEPOINT_TABLE_INNER;

pub static ED25519_BASEPOINT_TABLE_INNER: EdwardsBasepointTable = EdwardsBasepointTable([
    LookupTable([
        AffineNielsPoint {
            y_plus_x: FieldElement51::from_limbs([

I moved this into my own code if we need. It's super strange that it's a static pointer reference. Does that help us?

I did it so I could add #[cuda_std::address_space(global/constant)] to see if that helps.

@brandonros
Copy link

Progress:

#[cuda_std::address_space(global)]
pub static ED25519_BASEPOINT_TABLE: &'static EdwardsBasepointTable =
    &ED25519_BASEPOINT_TABLE_INNER;

#[cuda_std::address_space(global)]
pub static ED25519_BASEPOINT_TABLE_INNER: EdwardsBasepointTable = EdwardsBasepointTable([
    LookupTable([
        AffineNielsPoint {
            y_plus_x: FieldElement51::from_limbs([

global: bug goes away

constant: bug stays

desired: not need to put an attribute and not have a bug

Does this give us a clue? Want to see the ptx of the "global" "working" one?

@brandonros
Copy link

I tested "not this feature branch", just typical main

Now with the #[cuda_std::address_space(global)] attributes on this large strange static, even latest main works (without the changes of this branch)

It would be cool if constant worked I think, but I'm sure that's lower priority or I could even be misunderstanding/if it is even possible

@brandonros
Copy link

Great question! When you had no #[address_space] attribute, the GPU compiler was making automatic decisions about where to place your data, and it was making the wrong choice for your use case.
Default Placement Without Address Space Attributes
Without explicit address space annotations, the NVVM (NVIDIA's LLVM) compiler typically:

Static data with initializers → Constant memory (address space 4)
Large static data → Sometimes global memory, but inconsistently
Local variables → Local memory (address space 5) or registers

What Was Happening to Your Ed25519 Tables
Your Ed25519 basepoint tables were likely being placed in constant memory by default because:

They're static variables with compile-time initializers
The compiler sees them as "constants" that won't change
Constant memory is the compiler's first choice for read-only static data

Why This Caused IllegalAddress Errors
Constant memory limitations:

Size limit: Only 64KB total constant memory per GPU
Access patterns: Optimized for when all threads access the same address simultaneously
Addressing: Different addressing scheme than global memory

Your Ed25519 tables are much larger than 64KB:

Each FieldElement51 is 5 × 8 bytes = 40 bytes
Each AffineNielsPoint is 3 × 40 bytes = 120 bytes
Each LookupTable is 8 × 120 bytes = 960 bytes
The full EdwardsBasepointTable is 32 × 960 bytes = ~30KB just for one table

But you have nested structures and the compiler was probably trying to put everything in constant memory, quickly exceeding the 64KB limit.
The IllegalAddress Error
When GPU threads tried to access data that:

Exceeded constant memory size limits, or
Used invalid addressing for the memory space, or
Accessed memory through the wrong address space pointer

The GPU hardware threw an IllegalAddress exception because the memory access violated the hardware's memory model rules.
Why #[address_space(global)] Fixed It
By explicitly marking the data for global memory:

No size limits (well, limited by total GPU VRAM)
Proper addressing scheme for large data structures
Accessible by all threads with the right pointer type
Hardware knows how to handle the memory access patterns

This is a common gotcha in GPU programming - the compiler's "helpful" automatic placement often doesn't match what you actually need for large data structures!

If this is accurate, do we need to add some kind of check to make sure no more than 64kb is added to constant? Is that something we could do easily or not really?

@LegNeato
Copy link
Contributor Author

LegNeato commented May 25, 2025

Yeah, I am aware of vast...just didn't want to shave some yaks.

I'm not sure that explanation is right. Again, not totally sure but I assume LLVM understands what the limits are if they are static. Perhaps they need to be specified, hm. I'm not sure if the limit varies by card (I would expect so). Need to look more.

The code I deleted in the second commit deleted forcing things into the AddressSpace::Data address space. Which allowed through the (I believe correct) AddressSpace(4) , which this static apparently might be too big for.

I'm still getting a hold on this area of the code, so who knows!

@LegNeato
Copy link
Contributor Author

From looking around it might indeed be 64kb for all cards. I can't find a definitive source from Nvidia though.

@brandonros
Copy link

brandonros commented May 25, 2025

The performance is fantastic in global so, something I can for sure live with.

One odd thing: https://github.com/dalek-cryptography/curve25519-dalek/compare/main...brandonros:curve25519-dalek:main?expand=1

I tried to clean up things on my end and not copy and paste entire repo forks into mine (did it to make quick edits for us)

This... doesn't work. Same IllegalAddress error.

I need to copy the entire precomputed table into my kernel crate and not use the one of the dependency, despite them both seemingly having this "global" attribute correctly added?

edit: gah, I did the wrong u32 backend instead of u64... works fine...

@brandonros
Copy link

tested LegNeato@5bfda9a, did not fix it

if you want mildly shaved yaks: https://github.com/brandonros/ed25519-vanity-rs/blob/master/scripts/vast.sh

@LegNeato
Copy link
Contributor Author

I guess I am going to have to really debug and run the code rather than just looking at it 🤣

@LegNeato
Copy link
Contributor Author

So, I am an idiot. The previous fix I did didn't put it in the right address space, not sure what I was thinking 😞 . I'm pretty sure it works now: https://github.com/LegNeato/ed25519-vanity-rs

@LegNeato
Copy link
Contributor Author

This isn't strictly correct, as really we should be keeping track of all statics put into constant memory going over the size rather than just a singular one. But it is better than what was there.

@brandonros
Copy link

can we get just the global constant size limit section change on the branch and not the other first two changed you weren't sure of or do you want to push all 3 of this is the fix and the others are good to include / actual improvements?

@LegNeato
Copy link
Contributor Author

I'm going to split the read_volatile changes from the address space changes.

LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fuly correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill insead of only spilling when
a static is big. But, this is materially better than
what is there.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill insead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill instdead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
@LegNeato
Copy link
Contributor Author

@brandonros #217

LegNeato added a commit that referenced this pull request May 26, 2025
* Allow address spaces to propagate to LLVM

This looks like it was code that wasn't deleted after the refactor in
decda87

* Spill large statics from constant to global memory

This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill instdead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes #208.

See also the debugging and discussion in
#216

* Add `--use-constant-memory-space` flag, off by default

* Make it clear that `#[cuda_std::address_space(constant)]` still works
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

curve25519_dalek crate = runtime error
2 participants