-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Tensorstore af-dist implementation #161
Comments
This may be a useful tool for finding good Tensorstore configuration settings: https://github.com/royerlab/czpeedy (via cubed-dev/cubed#513) |
Looks like that's just for write performance, but definitely worth having a look at 👍 |
I am certainly happy to investigate this issue! Before I start implementing afdist with TensorStore's Python API, I wonder if you built the project in Release mode (i.e.
If you are already building in Release mode, can you describe your informal testing? I would like to replicate the problem to debug it. ReferencesP.S. I found |
Thanks for the tip on Release mode @Will-Tyler, you're right I'd forgotten that. This is what I get when I run the benchmarks (added in #162):
If we focus on the user_time column, we can see there's a substantial difference between tensorstore and Zarr python there. (The 100k samples result is confusing when you look at wall_time because the disk cache was warm for ts_cpp, but not for zarr.) Looking at top for ts_cpp, I can see that tensorstore is running a bunch of background threads to do the fetching and decoding. Perhaps this is contributing to the overhead? It shouldn't be that much, though. |
Basically, I don't mind if we poke at this in C++ or Python - I'd just find it easier to understand what Tensorstore is doing if we go through the Python API. |
Cutting out the bits where we process the chunks, Here's the perf report for ts_py:
vs Zarr python (which is much quicker):
|
Hmm - so looks like the threaded async approach tensorstore is using is adding substantial overhead here. I'll dig a bit more. |
Tweaking the tensorstore context options a bit doesn't seem to help much. With just loading the chunks into memory we get:
which is a huge disparity, given tensorstore should be doing very little here. |
Hmmm, looking at strace output seems to confirm that tensorstore really is reading the chunk files once, although may be doing it with async APIs rather than sync file system calls. |
What is your test harness for this? What's the zarr spec? I can try to see if there are any C++ inefficiencies to remove here. |
Hi @laramiel 👋 Any help would be much appreciated! We have two different implementations, the C++ one that we started out with, and a Python one which we made to make playing around with it a bit easier. The easiest way to run this and compare the implementations is using the dataset we have created for benchmarking, and the benchmarking script. This should get you up and running:
For simplicity, you may want to factor out the actual processing of the chunks. You can do this by replacing the function count_genotypes_chunk with a no-op. These are the numbers I get on my (busy) machine when the two implementations are just iterating chunk-by-chunk and not doing any processing:
The overall CPU time cost is over 2X here for Tensorstore, which seems unnecessary? The benchmarking setup is a bit convoluted sorry, it's something that's evolved over time. Please do let me know if you hit any problems, I'd be more than happy to help! Thanks again for taking a look. |
@jeromekelleher I'm not seeing the c++ tensorstore implementation slowness on my desktop:
Will run the larger dataset overnight. |
Both of those have a very similar structure where they issue a read and then wait via |
That's interesting - @Will-Tyler noted something similar. I'll rerun and see what we get. Still not obvious why the Python version is using so much more CPU though. I see the argument for sync vs async, but I would have thought that this should only affect the overall wall-time rather than actual CPU cycles used (user time)? |
|
That's super - exactly what we'd hope to see. And excellent that TensorStore c++ is outperforming the Python code! |
Here's what I'm getting currently:
It must be something to do with the setup on my server I guess? |
Are you working off an SSD or spinning platters here @benjeffery? |
I've tested both and seen little difference. I've also just realised that you merged #164 yesterday afternoon - my previous python numbers were with code before that. 🤦 Here are the numbers with that included:
|
Good to know. So, must be something specific to my setup so - the machine and OS are quite old, so could be some important compile time optimisations are missing or something. Did you do anything funky with the Tensorstore build? |
Nope nothing at all - just installed cmake and ran the build as it is configured in the repo. |
I'm running
|
Well this really is bizarre. After upgrading to the latest version of the Tensorstore Python module (which I'm not compiling, using the distributed binarys) I get this:
So, the Python tensorstore version here is nearly a full minute slower than the Zarr version! Digging in with perf I get this for the Zarr python version: and this for Tensorstore Python version: So in both cases, the majority of the time is spent in Blosc, as you'd expect, but one takes 3X the time of the other. The processor here is old (Xeon E5-2680) and that probably explains the disparity with @benjeffery's results. I can think of two possibilities:
@benjeffery - could you run these through perf record also, so we can see what's being called under the hood on your machine? It would be nice if we could just say "oh well it doesn't work on old machines 🤷 " but this is the hardware I have available for doing these benchmarks, and they need a lot of storage and CPU time. I'm not keen on moving them somewhere else, so if we want to include Tensorstore we'll have to figure out a way of solving the issue. |
So it's possible that we're missing the default -msse2 flag when building blosc, which is a default option when building with CMake. I'll try and update those and see if it helps. Edit: Nah, -msse2 should be the default with all x86_64 arch. Edit: I noticed that you set the concurrency limits pretty low. Was there a reason for this?
https://google.github.io/tensorstore/kvstore/file/index.html |
I haven't had a chance to look at your benchmarks and results in detail, but one thing to note is that the blosc codec in numcodecs (https://github.com/zarr-developers/numcodecs/blob/02d3ceec395481f3d84eb8f2f770e66aca6361b0/numcodecs/blosc.pyx#L478) used by zarr-python uses multi-threaded blosc under certain conditions that may apply in your benchmark. Tensorstore never uses multi-threaded blosc but instead can decode multiple chunks in parallel; in the case of your benchmark there is only one chunk at a time, though. |
Thanks for the input @laramiel and @jbms!
That was just experimenting trying to rule out the overhead of having background threads. As we're just synchronously working chunk-by-chunk, there's nothing to gain from doing things in threads, and I was trying this to see if it made any difference. It doesn't.
This should all be factored out by measuring user time - true, having threads in blosc could affect the wall time, but it shouldn't make any difference to the overall compute cost. Wall-time and User-time + Sys-time are basically the same here, so I don't think threads are having any effect. |
Ah yes, interestingly you have My guess here is that Zarr is somehow handing memory to Blosc that meets the stricter alignment requirements for SSE/AVX than Tensorstore, and so we don't see this ```__memmove_XXX_unaligned_erms`` call in Zarr. The newer generation of CPUs is better able to handle stuff that's not aligned. It would be amazing if the penalty for not having aligned memory under SSE2 was this big, but I think it's the best guess right now? I see there are some calls to |
That is puzzling all right - 3X penalty just for unaligned memory seems huge. I'm not able to run the sched events in perf right now and have to context switch I'm afraid! |
Anecdotally, I noticed the same problem when running on some Azure VMs last week. I ran the Python Tensorstore version and it was way slower than Zarr python. I didn't have time to characterise properly I'm afraid, but it does indicate that this is more than just a problem with crusty old hardware and likely something that people will encounter in the real world. |
I am able to reproduce this locally per your instructions:
Running both the tensorstore and zarr-python benchmarks with google-perftools separately, the additional time with tensorstore seems to be as follows: (a) blosc_decompress: 11.12 seconds (tensorstore) / 8.15 seconds (zarr-python)
(b) An additional ~1 second is spent by tensorstore on an unnecessary memset of the output buffer used by blosc. This is pure overhead and we should be able to fix this relatively easily. (c) An additional ~1 second is spent by tensorstore on a memcpy to the destination array from the chunk cache. Currently tensorstore always decodes first to the chunk cache, but we do have some work in progress to bypass the chunk cache in cases like this where the chunk cache has a total_bytes_limit of 0 and the read covers the entire chunk. The additional memset and memcpy have a particularly strong impact on this benchmark because of the very high compression ratio --- 100 to 1. We will definitely look into these 3 causes, and try to get to the bottom of the blosc shuffle performance discrepancy, and remove the unnecessary memset and memcpy. I'll note that this benchmark seems to be obviously very useful for identifying sources of overhead like these, but because of the very high compression ratio (meaning the I/O is negligible) and the fact that it involves just reading a single complete chunk at a time, it essentially just tests chunk decoding and not any of the indexing or I/O functionality of the zarr implementation. If the benchmark matches the actual use case then that is the right thing to test, though. |
Fantastic, thanks for looking into this @jbms! I would love to include the Tensorstore benchmarks in the paper because cross language support and the potential to integrate with existing C++ based tools is a key part of our story. If you can help us with the Tensorstore overhead here we'd be very happy to include you (and anyone else involved) as authors on this paper. I think the benchmark is a useful one, in that it follows a very common access pattern for the data, and implements the core allele classification operation that will be used a great deal, e.g., in dataset quality control. Certainly the compression ratios are realistic, and what we observe for real data. |
This eliminates the unnecessary memset (but not the unnecessary copy) described here: sgkit-dev/vcf-zarr-publication#161 (comment) PiperOrigin-RevId: 691924460 Change-Id: I2cfe01659a6088278228763b7a36953125cf2380
Just noting here that I've upgraded from Python tensorstore 0.1.64 to 0.1.68 and it hasn't made much difference unfortunately. The changes may have shaved a few seconds off, but we're still looking at ~80 seconds using Tensorstore Python vs ~30 seconds with Zarr python. |
Running some more profiling, here's what we get from
Drilling in then on blosc_internal_bshuf_shuffle_bit_eightelem_scal we get
|
Here's the same for tensorstore Python (latest version)
Drilling in to blosc_internal_bshuf_shuffle_bit_eightelem_scal again (and selecting the hottest instruction):
|
I'm out of my depth here with the assembly level stuff, but it looks like the big difference is the Does this give you any insights as to what's going on @jbms? |
That tensorstore loop is not unrolled; the python loop is. https://godbolt.org/z/T1bh36bcK Edit: it appears that GCC doesn't unroll this loop at |
Thanks for producing those profiles, and nice work analyzing that @laramiel |
FWIW, it's not clear to me where numcodecs gets its compiler flags set, but my best guess is that is follows the Python values, which might look like
I think It's built on the manylinux2014 Docker image, which is based on GCC 10. That would predict that this loop is not unrolled according to @laramiel's analysis above though. |
See sgkit-dev/vcf-zarr-publication#161 The default bazel -c opt build mode is -O2, vs. the default cmake Release optimization mode of -O3. It turns out that gcc doesn't unroll the underlying blosc loop at -O3, so this sets the level to -O3 when building wheels. PiperOrigin-RevId: 698176524 Change-Id: Iea9c47e1cb96d40169d2b28af93cf53b33e6991c
I pushed v0.1.69 which should now be built with -O3. PTAL. |
Thanks @laramiel. With tensorstore 0.1.69 I am seeing about 11.54 seconds compared to 10.54 with zarr-python; tensorstore 0.1.68 is about 2 seconds slower. Per the previous profiling results, the remaining 1 second discrepancy is presumably due to the extra memcpy that we have not yet eliminated but plan to work on. |
You could try it with https://google.github.io/tensorstore/kvstore/file/index.html#json-Context.file_io_memmap and see if that helps. |
I've tried it out on my benchmarks @laramiel and unfortunately it's made no noticeable difference:
Annotating blosc function:
Version 0.1.69 installed from pip:
|
That assembly is unrolled and is substantially the same as the python version; there's a little difference between some instruction order, but nothing stands out to me there. |
One other possibility that occurs to me --- @jeromekelleher are you running the benchmark on a machine with NUMA? |
I am indeed @jbms, a dual socket Xeon. I'll paste in the numa details later. |
The NUMA-related issue I suspected would be that the output array from blosc decompression happens to be on a different NUMA node from the CPU executing the decompression. In zarr-python everything is single-threaded so it is unlikely to happen, while tensorstore uses multiple threads making it more plausible. However, looking at the code in tensorstore, I see that the output array gets allocated in the same thread that does the decompression, which I think makes a NUMA-related issue less likely. In any case, can you try running the benchmark pinned to a single numa node in order to rule out this possibility? |
I've tried pinning to numa node 0 and 1 (with numactl -m) and oddly node 0 seems to run a bit slower than node 1 (~82 seconds vs 76 seconds), but I think it's probably something to do with the cooling on one of the CPUs being a bit better than the other (I've observed frequency limiting messages from the kernel at various times). But it's nowhere near big enough to explain the difference to the Zarr python version (~30 seconds). Previous experience with trying to tune numa performance on this box would suggest a max of 5-10% perf difference between optimal numa usage vs not, so I doubt that's what's going on here. |
The C++ af-dist implementation added in #157 (and notes in #160) is surprisingly slow. In some informal tests it looks like it's about 3X slower than Python Zarr, just retrieving the chunks. There's no good reason for this, so I suspect we must be using Tensorstore badly.
To make it easier to experiment with Tensorstore configuration settings, I think it would be very helpful to have a Python implementation of the afdist operation using the Python tensorstore API. The Python API is properly documented, and the (unbelievably slow!) C++ compile times don't have to be endured.
CPU time should be dominated by decompression, so we really should expect to have performance parity with the Python Zarr approach.
@Will-Tyler - fancy taking this one on?
The text was updated successfully, but these errors were encountered: