-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Math: Optimize sofm_exp_fixed() HiFi version #9631
Conversation
src/math/exp_fcn_hifi.c
Outdated
@@ -357,20 +312,27 @@ int32_t sofm_exp_fixed(int32_t x) | |||
xs = x; | |||
while (xs >= SOFM_EXP_TWO_Q27 || xs <= SOFM_EXP_MINUS_TWO_Q27) { | |||
xs >>= 1; | |||
n++; | |||
n <<= 1; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just count the number of bits to shift and shift by N directly? Maybe with a branch like
if (xs >= SOFM_EXP_TWO_Q27)
xs >>= ffs(xs) - ffs(SOFM_EXP_TWO_Q27);
else if (xs <= SOFM_EXP_MINUS_TWO_Q27)
xs >>= ...;
? The actual shift above is likely wrong, but that would be the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a HiFi instruction to count bits for normalization, so maybe such instruction could be used. If counting bits is done in a loop like this it would not help.
There's also the Q27 vs. Q28 format issue, here we shift right one too much and in next we shift left. But I could not get measurable improvement from trying to optimize that, and I'm not sure I preserved right the equation so I left it out from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's max. 3 loops for worst case large input number. So, the overhead from counting bits other way needs to be small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu on Xtensa __builtin_ffs()
uses the NSAU instruction to count those bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I didn't know, I'll try with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu yep, we have definitions for all the ffs*()
and clz*()
variants in
Lines 133 to 139 in 53f2d96
#define ffs(i) __builtin_ffs(i) | |
#define ffsl(i) __builtin_ffsl(i) | |
#define ffsll(i) __builtin_ffsll(i) | |
#define clz(i) __builtin_clz(i) | |
#define clzl(i) __builtin_clzl(i) | |
#define clzll(i) __builtin_clzll(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm able to save another 0.4 MCPS with normalize HiFi instruction based input value scaling to replace the while loop. Also even more saving seems to be possible with allowing range -4 to +4 instead of -2 to +2 for the small value exponent function. I'll update this PR tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another great looking improvement! One question about using rand for the test case, plesae check inline.
int32_t ivalue, iexp_value; | ||
int i; | ||
|
||
srand((unsigned int)time(NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit mixed feeling about this. Randomization can work, but then you'd have to have enough iterations to cover meaningful amount of the test space to provide repeatable and reliable results (see e.g. https://scotthannen.org/blog/2024/05/20/dont-use-random-numbers-in-tests.html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed here testing of the other small values exp() function that uses random numbers (there was a discussion about it in review). Random numbers give good coverage over time if the random numbers are good (the failing input / output value is printed). But of course with 256 points in one run, some issue might be missed. But what would then be suitable number of points to trust that 32 bit inputs are properly tested in acceptable time. The cmocka test is run with xtensa simulator, so it could take quite long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still kept this. If it's a blocker, also easy to change to a fixed grid of test points. As you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random is not reproducible but does offer a more varied surface area for the tests. @singalsu for the moment can we make this a static table as we dont yet have the infra to deal with test randomization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, knowing the exponent nature, I think I could test with a dense grid of 1 lsb at min and max values and some 100 points grid for intermediate values.
I think there's a mistake in random values of the other function test in this same source file so I should fix it too. The values tested are actually only integer decimal values +0.5 and not entire 32 bit range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have both? A fixed set of points and a few more random ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I can do that since I have already code for random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed my mind again. After adding the linear grid test with 3 regions of exp() asymptotic approach of x-axis, curve up, asymptotic approach of y-axis I thought that having random added is low value, so I removed it.
bdd20f9
to
210bff1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless my eyes deceive, the last style issue is still there...
ed44aac
to
d8f21c1
Compare
There was no unit test for the function, so it is added. The pass criteria is tuned for current implementation to just pass in the three defined regions. It is useful to verify the next changes to optimize the function. Signed-off-by: Seppo Ingalsuo <[email protected]>
The unnecessary shift and multiply functions can be removed with use of normal C shift left and with use xtensa multiply, shift, and round intrinsics directly in the function. This change saves in TGL HiFi3 platform 1.3 MCPS in DRC processing mode. Signed-off-by: Seppo Ingalsuo <[email protected]>
d8f21c1
to
9b19183
Compare
double fstep = (b - a) / (step_count - 1); | ||
double fvalue = a + fstep * point; | ||
|
||
*value = (int32_t)round(fvalue * 134217728.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity - that number has a meaning, right? I might be the only one who doesn't recognise it (ok, I'm sure it isn't Pi and it isn't e
, I'm certain enough about that :-) ) but maybe a comment would be helpful
No description provided.