-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MulAcc
causes significant speed regression
#296
Comments
I tried confirming the speed regression with a quick & scrappy benchmark. I first cloned sprs locally, and patch it to add back the previous --- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -295,6 +295,46 @@ pub fn csc_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
}
}
+/// CSR-dense colmaj multiplication
+///
+/// Performs better if rhs has few columns.
+pub fn csr_mulacc_dense_colmaj_old<'a, N1, N2, NOut, I, Iptr>(
+ lhs: CsMatViewI<N1, I, Iptr>,
+ rhs: ArrayView<N2, Ix2>,
+ mut out: ArrayViewMut<'a, NOut, Ix2>,
+) where
+ N1: 'a + Num + Copy,
+ N2: 'a + Num + Copy,
+ NOut: 'a + Num + Copy,
+ N1: std::ops::Mul<N2, Output = NOut>,
+ I: 'a + SpIndex,
+ Iptr: 'a + SpIndex,
+{
+ if lhs.cols() != rhs.shape()[0] {
+ panic!("Dimension mismatch");
+ }
+ if lhs.rows() != out.shape()[0] {
+ panic!("Dimension mismatch");
+ }
+ if rhs.shape()[1] != out.shape()[1] {
+ panic!("Dimension mismatch");
+ }
+ if !lhs.is_csr() {
+ panic!("Storage mismatch");
+ }
+ let axis1 = Axis(1);
+ for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
+ for (orow, lrow) in lhs.outer_iterator().enumerate() {
+ let mut prev = ocol[[orow]];
+ for (rrow, &lval) in lrow.iter() {
+ let rval = rcol[[rrow]];
+ prev = prev + lval * rval;
+ }
+ ocol[[orow]] = prev;
+ }
+ }
+}
+
/// CSR-dense colmaj multiplication
///
/// Performs better if rhs has few columns. then I modified --- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
rayon = { version = "1.3.0", optional = true }
num_cpus = { version = "1.13.0", optional = true }
approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[dev-dependencies]
num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
bincode = "1.2.0"
tobj = "3.0"
image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[[bench]]
name = "suite"
@@ -51,6 +51,11 @@ harness = false
name = "sorting"
harness = false
+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+
[workspace]
members = [
"sprs-ldl", Finally, I wrote the following simple benchmark program to extern crate rand;
extern crate sprs;
extern crate ndarray;
use rand::Rng;
use sprs::prod::{csr_mulacc_dense_colmaj, csr_mulacc_dense_colmaj_old};
fn generate_random_csr(shape: sprs::Shape) -> sprs::CsMat<f32> {
let mut rng = rand::thread_rng();
let (rows, cols) = shape;
let nnz = rng.gen_range(1..rows * cols / 2);
let mut mat = sprs::TriMat::<f32>::new(shape);
for _ in 0..nnz {
let r = rng.gen_range(0..rows);
let c = rng.gen_range(0..cols);
let v = rng.gen::<f32>();
mat.add_triplet(r, c, v);
}
mat.to_csr()
}
fn generate_random_array(shape: sprs::Shape) -> ndarray::Array2<f32> {
let mut rng = rand::thread_rng();
let data = (0..shape.0 * shape.1).map(|_| rng.gen()).collect::<Vec<f32>>();
ndarray::Array2::from_shape_vec(shape, data).unwrap()
}
fn main() {
const N_RUNS_PER_LOOP: usize = 1000;
const N_LOOP: usize = 10;
const MAX_LEFT_DIM: usize = 200;
const MAX_RIGHT_DIM: usize = 200;
const MAX_INNER_DIM: usize = 5000;
let mut rng = rand::thread_rng();
for _ in 0..N_LOOP {
let left_shape: sprs::Shape = (
rng.gen_range(10..=MAX_LEFT_DIM),
rng.gen_range(10..=MAX_INNER_DIM),
);
let right_shape: sprs::Shape = (
left_shape.1,
rng.gen_range(1..=MAX_RIGHT_DIM),
);
let rand_csr = generate_random_csr(left_shape);
let rand_vec = generate_random_array(right_shape);
let mut out = ndarray::Array2::<f32>::zeros((left_shape.0, right_shape.1));
let start_t = std::time::Instant::now();
for _ in 0..N_RUNS_PER_LOOP {
csr_mulacc_dense_colmaj(rand_csr.view(), rand_vec.view(), out.view_mut());
}
println!("New\t{}ms", start_t.elapsed().as_millis());
let start_t = std::time::Instant::now();
for _ in 0..N_RUNS_PER_LOOP {
csr_mulacc_dense_colmaj_old(rand_csr.view(), rand_vec.view(), out.view_mut());
}
println!("Old\t{}ms", start_t.elapsed().as_millis());
println!("");
}
} The benchmark can then be run by
As you can see, the new implementation with |
Do the numbers change significantly if you enable |
Yep, the difference seems to widen. With: --- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
rayon = { version = "1.3.0", optional = true }
num_cpus = { version = "1.13.0", optional = true }
approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[dev-dependencies]
num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
bincode = "1.2.0"
tobj = "3.0"
image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[[bench]]
name = "suite"
@@ -51,6 +51,14 @@ harness = false
name = "sorting"
harness = false
+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+[profile.release]
+lto = true
+
+
[workspace]
members = [
"sprs-ldl", I get:
|
So I had a bit more fun debugging the cause of the speed regression. This is a tricky one because there turned out to be 2 causes for the slowdown:
For the former, I'm referring to the change here, where we used to copy the value of For testing purposes, we can change it back to the old behavior at the cost of adding an explicit diff --git a/src/sparse/prod.rs b/src/sparse/prod.rs
index aa7241b1..8c721474 100644
--- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -303,7 +343,7 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
rhs: ArrayView<B, Ix2>,
mut out: ArrayViewMut<'a, N, Ix2>,
) where
- N: 'a + crate::MulAcc<A, B>,
+ N: 'a + crate::MulAcc<A, B> + Clone,
I: 'a + SpIndex,
Iptr: 'a + SpIndex,
{
@@ -322,11 +362,12 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
let axis1 = Axis(1);
for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
for (orow, lrow) in lhs.outer_iterator().enumerate() {
- let oval = &mut ocol[[orow]];
+ let mut oval = ocol[[orow]].clone();
for (rrow, lval) in lrow.iter() {
let rval = &rcol[[rrow]];
oval.mul_acc(lval, rval);
}
+ ocol[[orow]] = oval;
}
}
} For --- a/src/mul_acc.rs
+++ b/src/mul_acc.rs
@@ -19,13 +19,13 @@ pub trait MulAcc<A = Self, B = A> {
/// Default for types which supports `mul_add`
impl<N, A, B> MulAcc<A, B> for N
-where
- N: Copy,
- B: Copy,
- A: num_traits::MulAdd<B, N, Output = N> + Copy,
+ where
+ N: Copy,
+ B: std::ops::Add<N, Output = N> + Copy,
+ A: std::ops::Mul<B, Output = B> + Copy,
{
fn mul_acc(&mut self, a: &A, b: &B) {
- *self = a.mul_add(*b, *self);
+ *self = (*a * *b) + *self;
}
} With both patches, when I rerun
With only the second (
With only the first patch but without the second, the speed gap is also obvious:
but if I turn on
Ok so where do we go from here?
|
@tomtung Is there any reason other than the potential efficiency improvement of using For the second issue, I don't mind the It still surprises me that the repeated accesses to the heap are hurting cache performance since the accesses are to the same location each time. I guess the alternating access between the |
So I ran my own bench with
On my environment I hardly any difference between heap, copy, and clone but I do get an improvement with removing the |
I added benches that are exactly the same as yours and am getting the same result as with my FEM matrix. Hardly any difference between heap access, clones, and copies on my environment. To see the changes I made look at this commit in my fork. |
Hi @aujxn, thanks for looking into this! Sorry about the delay; I've been a bit busy at work lately. I cloned your fork and ran the benchmark. The full result is attached below (with added blank lines for grouping). Some observations:
Curious if the result is different on your machine.
|
I guess since |
I just found out that my implementation of an ML model has been training up to several times slower than before. After some debugging, turns out that the cause was #276, where the
MulAcc
trait was introduced in d56bd72, and specifically where theCopy
bound was removed in d2f0da7. For details please see tomtung/omikuji#32.My intuitive guess is that the trait
MulAcc
got rid of theCopy
bound at the cost of forcing the values to be passed as references (pointers). This might have made it hard for the compiler to do inlining and/or other optimizations. since the+=
operation is typically in the innermost part of the loop, any slow-down in such hot spots would result in significant speed regression.I don't have context on the introduction of
MulAcc
, so I'm not sure what's the right way to fix this. Let me know if there's something I could help with.The text was updated successfully, but these errors were encountered: