Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance followup #1737

Open
Alexander-Shukaev opened this issue Dec 18, 2023 · 5 comments
Open

Performance followup #1737

Alexander-Shukaev opened this issue Dec 18, 2023 · 5 comments
Labels

Comments

@Alexander-Shukaev
Copy link

Alexander-Shukaev commented Dec 18, 2023

Hi @gdamore,

Happy holidays and many thanks for addressing #1663, really appreciate, we're definitely on the right track as I can confirm that the throughput roughly doubled. Now however, I would like to follow up with some more analysis that bothers me.

I have the following 4 test case runs (each of them is run in a separate process and not in parallel so that there is no side load on the system):

  • 2 pairs REQ/REP talking via inproc.
  • 4 pairs REQ/REP talking via inproc.
  • 2 pairs REQ/REP talking via ipc.
  • 4 pairs REQ/REP talking via ipc.

By "pair" I mean that each 2 participants talk privately to each other exclusively (so there is no case where one REP would have 2 or more REQs, i.e. 1-to-1 only pairs). But pairs themselves execute in parallel within one test case (process). Now, just from definitions of cases my expectations would be:

  • high throughput in all cases, for each pair.
  • throughput can gradually but slowly decrease if there are 4 pairs (or even more, i.e. with increasing number of pairs of actors) running in one process due to natural saturation of the system resource (CPU, communication, etc.).
  • throughput cannot drastically drop just with 4 pairs (instead of 2) unless there are some "global" relations/interlockings between pairs of (seemingly independent) actors inside NNG due to some implementation details, what would obviously be not OK.
  • inproc substantially or otherwise visibly more preferment than ipc.

Somehow, I observe results that contradict every expectation. Let's start with inproc:

INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: Exchanged 200000 messages (451045 KiB) in 11.408 seconds: 17531.000 m/s (39537.000 KiB/s)
INFO: Exchanged 200000 messages (451416 KiB) in 11.592 seconds: 17253.000 m/s (38940.000 KiB/s)
INFO: Exchanged 400000 messages (902462 KiB) in 2 pairs: 34784 m/s (78477 KiB/s)
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:1
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:3
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:2
INFO: inproc://transfixi_nng.g.Scaling.Test_Scaling:4
INFO: Exchanged 200000 messages (452012 KiB) in 21.176 seconds: 9444.000 m/s (21345.000 KiB/s)
INFO: Exchanged 200000 messages (451840 KiB) in 21.141 seconds: 9460.000 m/s (21373.000 KiB/s)
INFO: Exchanged 200000 messages (451219 KiB) in 21.252 seconds: 9411.000 m/s (21232.000 KiB/s)
INFO: Exchanged 200000 messages (451414 KiB) in 21.271 seconds: 9402.000 m/s (21222.000 KiB/s)
INFO: Exchanged 800000 messages (1806487 KiB) in 4 pairs: 37717 m/s (85171 KiB/s)
inproc: 2 pairs inproc: 4 pairs
inproc 2 inproc 4

Observations:

  • Very low (especially for inproc implementation) overall throughput of ~35K m/s
  • Very low (especially for inproc implementation) individual pair throughput of ~17K m/s
  • Changing from 2 pairs to 4 pairs leads to halving of throughput and as a result doubling of time needed to finish exchange of 200K messages for each pair as if there would be some interference/relation between them...

Now let's see how ipc is doing:

INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/82991:2
INFO: Exchanged 200000 messages (452363 KiB) in 17.902 seconds: 11172.000 m/s (25269.000 KiB/s)
INFO: Exchanged 200000 messages (451481 KiB) in 18.008 seconds: 11106.000 m/s (25071.000 KiB/s)
INFO: Exchanged 400000 messages (903845 KiB) in 2 pairs: 22278 m/s (50340 KiB/s)
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:1
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:2
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:3
INFO: ipc:///bb/data/tmp/transfixi_nng.g/Scaling.Test_Scaling/109634:4
INFO: Exchanged 200000 messages (452072 KiB) in 9.987 seconds: 20026.000 m/s (45267.000 KiB/s)
INFO: Exchanged 200000 messages (451981 KiB) in 10.686 seconds: 18715.000 m/s (42295.000 KiB/s)
INFO: Exchanged 200000 messages (451480 KiB) in 10.758 seconds: 18591.000 m/s (41967.000 KiB/s)
INFO: Exchanged 200000 messages (451711 KiB) in 10.805 seconds: 18509.000 m/s (41805.000 KiB/s)
INFO: Exchanged 800000 messages (1807245 KiB) in 4 pairs: 75841 m/s (171333 KiB/s)
ipc: 2 pairs ipc: 4 pairs
ipc 2 ipc 4

Observations:

  • Very low overall throughput of ~22K m/s
  • Very low individual pair throughput of ~11K m/s
  • Changing from 2 pairs to 4 pairs leads to doubling of throughput (😨) and as a result halving of time needed to finish exchange of 200K messages for each pair (overall throughput quadrupling essentially) as if there would be some boosting/relation between them...

So to summarize:

  • inproc performance of the whole system degrades with a number of pairs of actors. (unusable basically).
  • ipc performance of the whole system grows with a number of pairs of actors. (NOTE: with 16 pairs I am close to ~300K m/s).
  • ipc starts to outperform inproc (as a result of the above 2 items) with 4 or more pairs, I cannot explain this...
  • base performance for individual pairs for both of them is quite low (under 20K m/s). My expectation would be to reach 50K-100K m/s and I'm quite sure that's doable as that's what we have with home-grown acking protocol on top of TCP/IP stack.

Each flame graph is uploaded in the form of SVG and is interactive (fgs.zip). You can browse each branch in a regular web browser (e.g. Chromium or Firefox). You probably want to look into nni_taskq_thread branch for further analysis. To me it appears that there is a lot of interlocking happening in NNG, especially how else can we explain rapid degradation in inproc case, where seemingly independent pairs affect each other somehow.

Appreciate your efforts and looking forward to improve NNG even further together, I'd be glad to test any improvements or other nuances for you.

Wish you happy holidays 🎅

** Environment Details **

  • master
  • 3.10.0-1160.80.1.el7.x86_64
  • GNU 11.2.1 C/C++
  • Static library
@gdamore
Copy link
Contributor

gdamore commented Dec 18, 2023

Is there anyway you can post versions of those flamegraphs that are either interactive, or are widened so that I can the symbol names?

@Alexander-Shukaev
Copy link
Author

Just updated my post with an archive upload Sir, looks like GitHub does not allow them to be interactive.

@gdamore
Copy link
Contributor

gdamore commented Dec 18, 2023

Thank you.

So about doubling etc. With REQ/REP there is a full round trip. The throughput etc. may be latency dominated. So adding more pairs should increase performance until you saturate some other resource (either the network bandwidth available, the CPU resources, or something else like memory/bus speed -- the latter two of these are very dependent on the internal details of NNG as well -- what locks we use, etc.)

There are some hot locks that you are probably hitting on, although the flame graphs above don't suggest that they are the limiters. (The lock protecting the hash of socket IDs to socket structs for example.)

There are shared resources as well -- both REQ and REP share a few things:

  1. The set of threads used for the backend poller (on Linux it's just one right now, but on Windows it's scaled to the number of cores)
  2. The set of threads used for completion callbacks (typically something like 8, or ncpu, depends on your compilation flags)
  3. On a single socket, contexts share the locks for the socket (almost all the interesting activity happens under that lock) that owns them.
  4. There is typically just a single TCP connection/IPC connection between the peers. Keeping this full, but not overfull, is critical to obtaining best performance.

I don't think your flamegraphs uncover the time spent not doing work, because we are waiting for a response. That's the issue in the 4th item, and it might be dominant.

A few ideas though.

  1. Consider multiple sockets instead of multiple contexts. Contexts are meant to make things easier, but you wind up hitting those common locks. (Its probably easy to open multiple client sockets, and hard to open multiple server sockets. So start with the clients.)
  2. For inproc especially the friction to opening another socket should be very low indeed.
  3. See if you can measure latency. What would be good to know is the end to end latency between to peers. For example, if the latency is 1msec over the network (which would be high) that will limit you to 1000 msgs/sec for a single peer, but you could get up to 500 with 2 peers. Do not discount the impact of latency. (For inproc the latency should be close to zero... measured in <1us, probably on the order of 100ns but I haven't measured it.) You'll want to measure the processing time on the server, vs. the total latency observed by the client, and subtrace the two to get the actual network latency.

So the latency and keeping the pipe full may lead to some surprises where multiple clients improve performance.

More info as I think on it.

@gdamore
Copy link
Contributor

gdamore commented Dec 30, 2023

Looking at inproc, I think it was designed for safety, and in particular has some guards to prevent recursive lock entry. As a result, we wind up not "completing" jobs inline that we might otherwise be able to, which forces us to go the taskq to break the lock cycle.

I do believe that there is some opportunity to improve this -- in particular it seems like some of the locking betweeen the readers and writers and could be broken by completing the aios outside of the lock. I have done this for some other transports but haven't paid much attention to inproc.

Partly I don't think of people doing much performance critical work with inproc. So it hasn't gotten the attention that some of the other transports have (particularly for performance). I usually assume that folks who have performance critical needs and could use inproc would just use function calls. :-)

If you're on UNIX / LInux, you might want to look at the socket:// transport I just integrated. It's easier to use than ipc (no filesystem paths it uses socketpair()) but should have performance very close to ipc:// because it is backed almost identically. (There are two differences -- we don't write the extra flags byte for socket:// so that saves a tiny bit, and we use write() rather than sendmsg(), which probably won't be noticeable. I could change that if we found a reason to do so.)

@gdamore
Copy link
Contributor

gdamore commented Nov 8, 2024

I just had another look at those flame graphs.

Your code is apparently allocating and freeing nng_aio structures, and that is responsible for a large portion of the work.

You can avoid this in your application code.

Create one or more nng_aio structures ahead of time. Then use callback driven aio. This will reduce latency, reduce context switches, and completely eliminate the overhead of allocation / free.

Use nng_recv_aio and nng_send_aio if you can. You'll need to use nng_msg structures, but that will also avoid some extra data copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants