-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarking/profiling #12
Comments
Sounds like a decent benchmark. (Does PHP use ICU?) |
See also these Ruby benchmarks. The fastest code in that benchmark (although it didn't seem to look at utf8proc or ICU) was unf, which seems to be a wrapper around this unf_ext C++ code. |
The eprun library claims to be fast, albeit in pure Ruby. A hopeful sign is that eprun was designed by a Unicode expert and includes benchmark data extracted from random Wikipedia pages. |
I just pushed a little benchmark program based on the eprun data files. I also pushed a corresponding benchmark of ICU (although it gives a slightly unfair advantage to ICU by preallocating a huge buffer for the output, whereas utf8proc figures out the output size dynamically). Results on my machine:
So, unless I am messing something up, ICU is significantly faster, albeit with a far more painful API. Just to make sure that ICU is not "cheating" by doing some kind of caching that helps it for repeated normalization of the same string (the above benchmark loops 100x), I tried normalizing a single long string formed by concatenating the above files a few dozen times, and got 0.679s for utf8proc and 0.104s for ICU. (Compiling with Would be nice to also benchmark against GNU libunistring and perhaps unf_ext from above. |
Note that
I couldn't get the Ruby eprun working on my machine, but I don't really understand Ruby. |
In order to figure out the correct buffer size, utf8proc_map has to perform the canonical decomposition twice, so we have a factor of 2 penalty from that compared to the (somewhat artificial) way I am calling ICU. But this does not correspond to a factor of 2 overall, because decomposition is only part of the process. If I hack out this doubled decomposition, the benchmark numbers improve slightly to:
|
From |
GNU libunistring (benchmark added in a39c1a6), which has a similar API (operates directly on UTF8 data and does not require the output buffer to be preallocated) looks very comparable to utf8proc:
|
We could make One possibility would be to have an additional flag to assume valid input, in which case a codepath with fewer checks is called. (May be somewhat annoying to implement without a bunch of cut-and-paste, though some preprocessor hacks could be used: e.g. have a file with the |
@stevengj I agree that having valid UTF-8 input could be used to make |
The author of utf8rewind pings me once in a while on reddit to let me know he's been adding new features, so this would be another thing to compare against eventually: https://bitbucket.org/knight666/utf8rewind/overview It's MIT licensed and pretty small like utf8proc is, so we could borrow anything that turned out to be worth using. |
In the Apache Arrow project, we also evaluating using |
One could probably speed up upper/lowercase conversions, e.g. by adding a specialized table just for this, but it's not clear from the issue whether that functionality is actually performance critical or just a test case? |
This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/ However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author. |
This may be the thesis author: https://www.linkedin.com/in/jpdurham |
It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.
The text was updated successfully, but these errors were encountered: