Skip to content

refactor: replace byte-level operations with word-level operations #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

AnarchistHoneybun
Copy link
Contributor

finally got to this, I'm sorry for being so late ;-;
pre-refactor bench results:


running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 8 tests
test kupyna256_10    ... bench:         364.55 ns/iter (+/- 31.58) = 27 MB/s
test kupyna256_100   ... bench:       5,095.60 ns/iter (+/- 636.62) = 19 MB/s
test kupyna256_1000  ... bench:      49,466.16 ns/iter (+/- 6,330.77) = 20 MB/s
test kupyna256_10000 ... bench:     478,002.24 ns/iter (+/- 59,204.41) = 20 MB/s
test kupyna512_10    ... bench:         659.55 ns/iter (+/- 78.15) = 15 MB/s
test kupyna512_100   ... bench:       9,032.33 ns/iter (+/- 3,979.37) = 11 MB/s
test kupyna512_1000  ... bench:      67,634.22 ns/iter (+/- 8,083.88) = 14 MB/s
test kupyna512_10000 ... bench:     911,271.90 ns/iter (+/- 268,785.03) = 10 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 106.87s

post-refactor bench results:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/mod.rs (/home/vrin/rcrypto_hashes_contrib/target/release/deps/mod-fccbd874730b7904)

running 8 tests
test kupyna256_10    ... bench:         379.25 ns/iter (+/- 15.23) = 26 MB/s
test kupyna256_100   ... bench:       3,916.63 ns/iter (+/- 78.04) = 25 MB/s
test kupyna256_1000  ... bench:      57,448.10 ns/iter (+/- 7,546.75) = 17 MB/s
test kupyna256_10000 ... bench:     554,591.68 ns/iter (+/- 60,150.87) = 18 MB/s
test kupyna512_10    ... bench:         969.11 ns/iter (+/- 117.66) = 10 MB/s
test kupyna512_100   ... bench:       9,383.38 ns/iter (+/- 2,446.42) = 10 MB/s
test kupyna512_1000  ... bench:     116,995.27 ns/iter (+/- 6,493.42) = 8 MB/s
test kupyna512_10000 ... bench:     995,084.96 ns/iter (+/- 165,238.44) = 10 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 119.54s

not seeing any increase yet, which I think is because we're still having to split the values up when doing the row rotation and substitution operations. need to look into the math of that so I can handle it, will probably require a refactor in how our sbox and mds tables are stored. still, bulk of the work seems done

@AnarchistHoneybun
Copy link
Contributor Author

@newpavlov please lmk if this is a step in the right direction at least

@AnarchistHoneybun
Copy link
Contributor Author

is there any way to benchmark such that I can get stats for individual functions, so I can see where my bottleneck is, without wrapping every function inside a timing block

@newpavlov
Copy link
Member

newpavlov commented May 25, 2025

cargo flamegraph is a common tool for this. You also could use perf directly.

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good step forward even without noticeable performance improvements.

which I think is because we're still having to split the values up when doing the row rotation and substitution operations

Yeah, you should probably minimize number of conversion to and from bytes and operate over 64-bit integers whenever possible.

@@ -147,4 +137,4 @@ impl SerializableState for KupynaLongVarCore {
}

#[cfg(feature = "zeroize")]
impl ZeroizeOnDrop for KupynaLongVarCore {}
impl ZeroizeOnDrop for KupynaLongVarCore {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the empty last line in all files.


// println!("prev vector:=");
// for v in prev_vector.iter() {
// println!("{:016X?}", v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these comments.

for nu in 0..ROUNDS {
state = add_constant_plus(state, nu as usize);
state = apply_s_box(state);
state = rotate_rows(state);
state = mix_columns(state);
}
matrix_to_block(state)
state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth to change add_constant_plus, apply_s_box, rotate_rows, and mix_columns to work in-place, i.e. over &mut [u64; COLS].

@@ -14,6 +14,7 @@ rust-version = "1.85"

[dependencies]
digest = "=0.11.0-pre.10"
hex-literal = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've set up a main.rs file to verify the code works as expected while making changes to the functions, this is the reason for both the hex-literal dependency and the print statements. will clean them all up once I'm done making large changes to the functions, it just helps to pinpoint error faster if some math goes wrong

# Conflicts:
#	kupyna/src/long.rs
#	kupyna/src/long_compress.rs
#	kupyna/src/short.rs
#	kupyna/src/short_compress.rs
@AnarchistHoneybun AnarchistHoneybun deleted the kupyna-perf branch May 26, 2025 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants