-
-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance followup #1737
Comments
Is there anyway you can post versions of those flamegraphs that are either interactive, or are widened so that I can the symbol names? |
Just updated my post with an archive upload Sir, looks like GitHub does not allow them to be interactive. |
Thank you. So about doubling etc. With REQ/REP there is a full round trip. The throughput etc. may be latency dominated. So adding more pairs should increase performance until you saturate some other resource (either the network bandwidth available, the CPU resources, or something else like memory/bus speed -- the latter two of these are very dependent on the internal details of NNG as well -- what locks we use, etc.) There are some hot locks that you are probably hitting on, although the flame graphs above don't suggest that they are the limiters. (The lock protecting the hash of socket IDs to socket structs for example.) There are shared resources as well -- both REQ and REP share a few things:
I don't think your flamegraphs uncover the time spent not doing work, because we are waiting for a response. That's the issue in the 4th item, and it might be dominant. A few ideas though.
So the latency and keeping the pipe full may lead to some surprises where multiple clients improve performance. More info as I think on it. |
Looking at inproc, I think it was designed for safety, and in particular has some guards to prevent recursive lock entry. As a result, we wind up not "completing" jobs inline that we might otherwise be able to, which forces us to go the taskq to break the lock cycle. I do believe that there is some opportunity to improve this -- in particular it seems like some of the locking betweeen the readers and writers and could be broken by completing the aios outside of the lock. I have done this for some other transports but haven't paid much attention to inproc. Partly I don't think of people doing much performance critical work with inproc. So it hasn't gotten the attention that some of the other transports have (particularly for performance). I usually assume that folks who have performance critical needs and could use inproc would just use function calls. :-) If you're on UNIX / LInux, you might want to look at the socket:// transport I just integrated. It's easier to use than ipc (no filesystem paths it uses socketpair()) but should have performance very close to ipc:// because it is backed almost identically. (There are two differences -- we don't write the extra flags byte for socket:// so that saves a tiny bit, and we use write() rather than sendmsg(), which probably won't be noticeable. I could change that if we found a reason to do so.) |
I just had another look at those flame graphs. Your code is apparently allocating and freeing nng_aio structures, and that is responsible for a large portion of the work. You can avoid this in your application code. Create one or more nng_aio structures ahead of time. Then use callback driven aio. This will reduce latency, reduce context switches, and completely eliminate the overhead of allocation / free. Use |
Hi @gdamore,
Happy holidays and many thanks for addressing #1663, really appreciate, we're definitely on the right track as I can confirm that the throughput roughly doubled. Now however, I would like to follow up with some more analysis that bothers me.
I have the following 4 test case runs (each of them is run in a separate process and not in parallel so that there is no side load on the system):
inproc
.inproc
.ipc
.ipc
.By "pair" I mean that each 2 participants talk privately to each other exclusively (so there is no case where one REP would have 2 or more REQs, i.e. 1-to-1 only pairs). But pairs themselves execute in parallel within one test case (process). Now, just from definitions of cases my expectations would be:
inproc
substantially or otherwise visibly more preferment thanipc
.Somehow, I observe results that contradict every expectation. Let's start with
inproc
:inproc
: 2 pairsinproc
: 4 pairsObservations:
inproc
implementation) overall throughput of ~35K m/sinproc
implementation) individual pair throughput of ~17K m/sNow let's see how
ipc
is doing:ipc
: 2 pairsipc
: 4 pairsObservations:
So to summarize:
inproc
performance of the whole system degrades with a number of pairs of actors. (unusable basically).ipc
performance of the whole system grows with a number of pairs of actors. (NOTE: with 16 pairs I am close to ~300K m/s).ipc
starts to outperforminproc
(as a result of the above 2 items) with 4 or more pairs, I cannot explain this...Each flame graph is uploaded in the form of SVG and is interactive (fgs.zip). You can browse each branch in a regular web browser (e.g. Chromium or Firefox). You probably want to look into
nni_taskq_thread
branch for further analysis. To me it appears that there is a lot of interlocking happening in NNG, especially how else can we explain rapid degradation ininproc
case, where seemingly independent pairs affect each other somehow.Appreciate your efforts and looking forward to improve NNG even further together, I'd be glad to test any improvements or other nuances for you.
Wish you happy holidays 🎅
** Environment Details **
master
3.10.0-1160.80.1.el7.x86_64
The text was updated successfully, but these errors were encountered: