Server performance & optimization #1052

WolfganP · 2020-07-11T23:43:53Z

WolfganP
Jul 11, 2020

Follows from #339 (comment) for better focus of the discussion.

So, as the previous issue started to explore multi-threading on the server for better use of resources, I first run a profiling of the app on debian.

Special build with:
qmake "CONFIG+=nosound headless noupcasename debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro && make clean && make -j

Then run as below, and connecting a couple of clients for a few seconds:
./jamulus --nogui --server --fastupdate

Once disconnecting the clients I gracefully killed the server
pkill -sigterm jamulus

And finally run gprof, with the results posted below:
gprof ./jamulus > gprof.txt

https://gist.github.com/WolfganP/46094fd993906321f1336494f8a5faed

It would be interesting to see those who observed high cpu usage run test sessions and collect profiling information as well to detect bottlenecks and potential code optimizations, before embarking on multi-threading analysis that may require major rewrites.

dingodoppelt · 2020-07-15T08:15:23Z

dingodoppelt
Jul 15, 2020

Hi,
I don't know if this is of much help since I don't know what I'm doing here ;) but I generated the data as instructed on my private cloud server.

https://gist.github.com/dingodoppelt/802c40b1cb13c75d96f38b9604fa22df

cheers, nils

0 replies

WolfganP · 2020-07-15T12:42:35Z

WolfganP
Jul 15, 2020
Author

Thanks @dingodoppelt Could you please describe the test session/environment? (ie, how many clients connected, which hardware/Operating System you were running the server on, whatever you feel noticeable)

0 replies

WolfganP · 2020-07-15T12:56:13Z

WolfganP
Jul 15, 2020
Author

@sthenos you mentioned in https://www.facebook.com/groups/507047599870191/?post_id=564455474129403&comment_id=564816464093304 that you're running the server in linux now.
Would it be possible for you to run a profiling session in any of the casual/preparation jam sessions with multiple clients to measure REAL server stress? (obviously not during the WJN main event :-)

0 replies

dingodoppelt · 2020-07-16T01:45:17Z

dingodoppelt
Jul 16, 2020

I tested with 12 clients connected from my machine with small network buffers enabled on 64 samples buffersize.
the server is a cloud hosted kvm virtual rootserver (1fire.hosting) with 2 vcpus, 1 gig of ram on ubuntu 20.04 lowlatency.
If i get the opportunity I'll repeat the test on my other server in a real life scenario.
cheers, nils

0 replies

storeilly · 2020-07-18T09:56:06Z

storeilly
Jul 18, 2020

Are you still interested in these data? I can run a few tests on ubuntu over the weekend.
Is there a particular release or tag we should checkout as I tried last week to compile but it froze my server.

0 replies

pljones · 2020-07-18T11:11:28Z

pljones
Jul 18, 2020
Maintainer

One quick comment @WolfganP -- for some reason you build command line (rather than a simple qmake) causes the TARGET = jamulus line to trigger... I don't understand why!

EDIT Dawn strikes... Yes, it does have noupcase in the CONFIG+=... I just couldn't see it...

So if you're on anything but Windows, you'll probably want to mv it back again, otherwise start up scripts, etc, won't work.

Final edit to note: jamulus.drealm.info is running with profiling. I'll leave it up over the weekend so it should amass a fair amount of data. I'll run the gprof on Monday. Obviously a bit more "real world", as I run with logging and recording enabled, so I'm expecting different number...

0 replies

pljones · 2020-07-19T10:21:30Z

pljones
Jul 19, 2020
Maintainer

A different view should come from the Rock and Clasical/Folk/Choir genre servers that I've just updated to r3_5_9 with profiling.

make distclean
qmake "CONFIG+=nosound headless debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro
make -j
make clean

They probably won't show much OPUS usage but this should show anything that's "weird" with server list server behaviour (although they only have about 20 registering servers, until Default).

I wasn't sure what CONFIG+=debug and -config debug added -- the code appeared to have symbols regardless.

0 replies

WolfganP · 2020-07-19T20:54:49Z

WolfganP
Jul 19, 2020
Author

@pljones yes, I added debug flags to qmake just to make sure all symbols are included and no stripping is applied.
Anyways, a good way to check if symbols were included in the final exec and gprof instrumentation applied is objdump --syms jamulus | grep -i mcount (mcount* being the snippets of code added for profiling instrumentation)

0 replies

pljones · 2020-07-19T21:13:54Z

pljones
Jul 19, 2020
Maintainer

Standard build:

peter@fs-peter:~$ objdump --syms git/Jamulus-wip/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

This just changing the binary to "Jamulus", IIRC:

peter@fs-peter:~$ objdump --syms git/Jamulus/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

This was
make distclean; qmake "CONFIG+=nosound headless debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro; make -j; make clean

peter@fs-peter:~$ objdump --syms git/Jamulus-stable/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

Had a few people tonight noticing additional jitter. Not everyone... Those who noticed - myself included - had just upgraded to 3.5.9. No idea why... (I "fixed" it for the evening by upping my buffer size from 64 to 128.)

14 clients connected to the server and it's looking like this in top:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
32704 Jamulus  -21   0  320.3m  28.6m   9.1m S 10.8  0.2  80:09.58 Jamulus-server
 1066 root     -51   0    0.0m   0.0m   0.0m S  0.2  0.0  10:15.69 irq/130-enp2s0-
 1070 root     -51   0    0.0m   0.0m   0.0m S  0.1  0.0   3:27.74 irq/132-enp2s0-
 1071 root     -51   0    0.0m   0.0m   0.0m S  0.1  0.0   4:00.96 irq/133-enp2s0-

Mmm, I guess those enp2s0 IRQ handlers are a bit busy as that's the network interface... There are actually five, it seems:

peter@fs-peter:~$ ps axlww -L | grep enp2s0
1     0  1063     2  1063 -51   -      0     0 -      S    ?          0:00 [irq/129-enp2s0]
1     0  1066     2  1066 -51   -      0     0 -      S    ?         10:06 [irq/130-enp2s0-]
1     0  1068     2  1068 -51   -      0     0 -      S    ?          0:01 [irq/131-enp2s0-]
1     0  1070     2  1070 -51   -      0     0 -      S    ?          3:22 [irq/132-enp2s0-]
1     0  1071     2  1071 -51   -      0     0 -      S    ?          3:56 [irq/133-enp2s0-]

but it copes with only three in heavy demand. 129 and 131 seem left out.

Let's see what the gprof looks like in the morning :).

0 replies

pljones · 2020-07-20T06:28:10Z

pljones
Jul 20, 2020
Maintainer

OK, I decided to restart the central servers without profiling before I totally forget, so all the number are now in.
Jamulus-Central1 gprof.out
Jamulus-Central2 gprof.out
jamulus.drealm.info gprof.out

0 replies

WolfganP · 2020-07-20T17:55:09Z

WolfganP
Jul 20, 2020
Author

OK, I decided to restart the central servers without profiling before I totally forget, so all the number are now in.
Jamulus-Central1 gprof.out
Jamulus-Central2 gprof.out
jamulus.drealm.info gprof.out

Thx for the info @pljones, good to also have some performance info for the Central Server role.

Regarding the info on the audio server role, it seems it confirms the CPU impact of CServer::ProcessData and some Opus routines (I assume that as a result of the mix processing inside CServer::OnTimer), and that make sense (at least to me).
On the Opus codec front, I found this article worthy to read: https://freac.org/developer-blog-mainmenu-9/14-freac/257-introducing-superfast-conversions/ (code at https://github.com/enzo1982/superfast#superfast-codecs)

Another item I think needs some attention (or verify is already optimized) is the buffering of audio blocks to avoid unnecessary memcopies. But still reading the code :-)

0 replies

WolfganP · 2020-07-20T17:56:22Z

WolfganP
Jul 20, 2020
Author

Are you still interested in these data? I can run a few tests on ubuntu over the weekend.
Is there a particular release or tag we should checkout as I tried last week to compile but it froze my server.

Of course @storeilly the more information the better to compare performance on diff use cases and verify common patterns of CPU usage, and direct optimization efforts.

0 replies

storeilly · 2020-07-21T21:40:57Z

storeilly
Jul 21, 2020

Here is a short test on GCP n1-standard-2 (2 vCPUs, 7.5 GB memory) ubuntu 18.04.
A single user connected for a few seconds. I'm running the server overnight with two instances, one on a private central and another on jamulusclassical.fischvolk.de:22524 for more data

gprof.txt

0 replies

storeilly · 2020-07-22T16:18:00Z

storeilly
Jul 22, 2020

jamprof01.txt
jamprof02.txt

Overnights with 1 or 2 connections... Choir meeting later so will run again after that

0 replies

WolfganP · 2020-07-22T17:29:07Z

WolfganP
Jul 22, 2020
Author

Thanks @storeilly for the files, but those last 2 indicate a period of app usage extremely short, it doesn't even register significant stats to evaluate (even the cumulative times are in 0.00).

0 replies

gene96817 · 2020-12-04T19:20:02Z

gene96817
Dec 4, 2020

I am wondering if we are taking the wrong perspective on performance with cloud services. Each cloud service has different approaches to maximize utilization of their computing and networking resources. Jamulus is unique because we care about real-time performance. Most (the ideal) cloud apps care more about lots of computing in burst and less about real-time performance (or real-time to these apps are in the 100s of milliseconds). Task switching means buffering and we know buffering means latency. As we measure the load for additional clients, we should be looking at how buffering and latency changes.

0 replies

maallyn · 2020-12-04T19:22:34Z

maallyn
Dec 4, 2020

Folks:
I have a Linode dedicated (2 CPU) machine (newark-music.allyn.com) which is at the latest stable (3.6.1) compiled with config = nosound. I did not notice any configs for multi-thread or buffer size when I looked at the Jamulus.pro file.

I hope this can help. I am willing to spend the extra for a dedicated 4 cpu for a short while if you thank that will help.

0 replies

dingodoppelt · 2020-12-06T11:33:46Z

dingodoppelt
Dec 6, 2020

Folks:
I have a Linode dedicated (2 CPU) machine (newark-music.allyn.com) which is at the latest stable (3.6.1) compiled with config = nosound. I did not notice any configs for multi-thread or buffer size when I looked at the Jamulus.pro file.

hi there,
i couldn't fit more than 20 clients on your machine. are you running it with -F and -T parameters?

I have been playing around with sysbench, a tool for performance measurement and i found that cloud server performance is pretty good cpu-wise but awful for memory performance where my dedicated machine really shines. i ran this test on my home machine:

sysbench --threads=`nproc` memory run 
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 104857600 (19856820.40 per second)

102400.00 MiB transferred (19391.43 MiB/sec)


General statistics:
    total time:                          5.2791s
    total number of events:              104857600

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    1.84
         95th percentile:                        0.00
         sum:                                28463.16

Threads fairness:
    events (avg/stddev):           13107200.0000/0.00
    execution time (avg/stddev):   3.5579/0.08

and on my cloud server:

sysbench --threads=`nproc` memory run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 21430802 (2142010.21 per second)

20928.52 MiB transferred (2091.81 MiB/sec)


General statistics:
    total time:                          10.0010s
    total number of events:              21430802

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                   10.76
         95th percentile:                        0.00
         sum:                                34558.68

Threads fairness:
    events (avg/stddev):           5357700.5000/266087.46
    execution time (avg/stddev):   8.6397/0.05

this doesn't look too good in comparison. maybe this is the bottleneck? do you think sysbench could be a reliable tool to measure server performance instead of trial and error or are there any other tools i could try?

0 replies

pljones · 2020-12-06T12:38:05Z

pljones
Dec 6, 2020
Maintainer

How much memory is used by a Jamulus server thread?

0 replies

brynalf · 2020-12-06T13:21:38Z

brynalf
Dec 6, 2020

On my windows system at home a server uses only about 60 MB memory.
It scales slightly with the number of attached clients. Example: 0 clients ==> 56 MB, 50 clients ==> 59 MB.
Edit: You probably were asking about a specific case above, so never mind my answer :)

0 replies

pljones · 2020-12-06T14:34:54Z

pljones
Dec 6, 2020
Maintainer

Edit: You probably were asking about a specific case above, so never mind my answer :)

I meant Jamulus memory usage, as you've given.

The test was about memory throughput, if I read it correctly. If Jamulus isn't memory constrained, then the test shown won't be representative of Jamulus performance.

0 replies

dingodoppelt · 2020-12-07T07:20:14Z

dingodoppelt
Dec 7, 2020

The test was about memory throughput, if I read it correctly. If Jamulus isn't memory constrained, then the test shown won't be > representative of Jamulus performance.

I just wondered if that might be the issue with the cloud servers. The CPU performance is fine and doesn't really deviate from what I measure on real hardware. The only thing I could find using sysbench was the restricted performance on memory throughput in comparison to real hardware so I figured this might be another thing to look at since cloud servers die long before the CPU is used up.

0 replies

pljones · 2020-12-07T07:31:55Z

pljones
Dec 7, 2020
Maintainer

The CPU performance is fine and doesn't really deviate from what I measure on real hardware.

On average, that may be true. Are you getting a reading for consistency of performance - i.e. how much the CPU performance deviates between maximum throughput and minimum? As noted above, it's that stability that Jamulus needs and which directly affects its capacity.

0 replies

corrados · 2020-12-07T20:35:42Z

corrados
Dec 7, 2020
Maintainer

@dingodoppelt I don't know if you are on facebook, but there is a report about successfully have 53 clients connected to a Jamulus server on a 4 CPU virtual server: https://www.facebook.com/groups/619274602254947/permalink/811257479723324: "Had 53 members of a youth orchestra this evening on Jamulus (and another 15-20 listening on Zoom). Took about 90 minutes of setup so we only got through a reading of Jingle Bells at the end but it was a great first step! AWS 4 vCPU server hit ~55%."

0 replies

dingodoppelt · 2020-12-07T20:46:57Z

dingodoppelt
Dec 7, 2020

@corrados : my server does this, too. but not with every client on small network buffers. I've played on servers with around 50 people, but you can never tell if everybody has small network buffers enabled. in my tests i connected every client with the same buffersize and small network buffers enabled. It only worked for me on dedicated hardware (namely WorldJam, Jazzlounge servers)

0 replies

sbaier1 · 2020-12-12T12:14:48Z

sbaier1
Dec 12, 2020

haven't done any testing / thorough research here yet, but just a heads up: there are also several kernel parameters for the UDP networking stack and general network parameters that could be tuned with sysctl that might have a positive effect:
(Is jamulus ever network-bound?)

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

0 replies

gene96817 · 2020-12-12T18:34:08Z

gene96817
Dec 12, 2020

@sbaier1 I am out in the Internet frontier (aka far from the small distances at Europe) and see the network behavior being a dominant contributor to latency dependent performance. I'd be interested in a discussion about what can be done to quench traffic and discard packets. These mechanisms might be a good way to improve the performance (at the expense of audio interruption, which would be happening anyway). Especially with regards to a different thread on buffer backups (they called it buffer-bloat), the only way to manage problems in the network with packet backup at some routers would be creating some code to detect backups and quench traffic. I have some musicians that will "tolerate" 20-70 ms latency rather than not have music. Actively managing the packet rate up at 40+ ms would greatly improve the experience. (Note, I am thinking that some of the buffer back-ups is the interaction between our UDP traffic and other people's TCP cross traffic.)

0 replies

ann0see · 2021-02-06T17:38:10Z

ann0see
Feb 6, 2021
Maintainer

There's a new PR for multi threading: #960

0 replies

menzels · 2021-02-06T19:32:33Z

menzels
Feb 6, 2021

thanks @ann0see for pointing me to this thread. i could not read all comments here but i want to add my findings to the thread.

i noticed difficulties fitting more than about 17 clients on my hetzner cloud vServer. even though i tried configurations from 2 to 16 cores.
so i took a closer look at the multithreading code on the jamulus server side and found that it did not distribute cpu load to more cores soon enough. specifically i am talking about the MixEncodeTransmitData function that uses about 60-70% of all cpu time in jamulus.
i changed this code to distribute the load evenly among cpu cores and found that it fixed my problem.

my guess is that the cpu cores on my cloud server are not as strong as the ones the current multi threading code was developed for, i see a comment mentioning amazon cloud servers there.
i think my change is a big improvement here, because it will just distribute the load evenly between cores, and does not need to know about the power of cores.

i only tested this change with up to 21 clients. and i see much better cpu usage between cores.

20 replies

GigiDietrich Mar 19, 2021

Oke thank you. I will :)

GigiDietrich Mar 19, 2021

Hi Menzels,

I just installed jamulus in the 3.6.2 version, but something is going wrong here.
With 9 clients I get good ping and seem to only be using 1 cpu.
The moment the 10th client joins the multithreading seems to kick in (which is great), but ping also jumps to 500+

For some reason jamulus is no longer in /usr/local/bin/Jamulus, but in /usr/bin/jamulus. is this normal?

menzels Mar 22, 2021

i tried to reproduce this on my server and did not succeed.
for me everything runs fine. we did multiple rehearsals now with this exact deb package, i also used your systemd unit script the last time.
i don't use the low latency kernel though.

yes jamulus is in /usr/bin/jamulus for me too. i think that is normal.

EriqJB Mar 22, 2021

Is it possible that your hosting provider limits access when you connect with 16 clients simultaneously? I've seen this behaviour on my IONOS server, ping then also rises to >500 ms, and after some time, access is even shut off completely. CPU also stays low all the time. Does your ping drop back to normal when you disconnect the 10th client?

GigiDietrich Mar 22, 2021

Yes this happens to me! 9 clients all is fine. Connect a 10th and ping jumps to 500. Disconnect 1 and everything is fine again. Did you manage to solve it?

EriqJB · 2021-03-22T11:58:52Z

EriqJB
Mar 22, 2021

I also started playing around with the performance of Jamulus, since I would like to host an event in April with 50+ singers. Looking good so far on a 4 vCPU core virtual server at IONOS. A problem I had with load testing seems to originate in the DDoS detection system of IONOS – they seem to shut down network access to the server for about 60 minutes if I open many Jamulus sessions in short time (all originating from the same machine).

Nevertheless, I would like to share my load driver script with you, which creates n Jamulus instances, connects to a server and sends pink noise there. It is inspired by @maallyn's script above, but also wires the Jamulus inputs and the sound source automatically. (If I do it manually, I find njconnect quite convenient.) This way, it can run on a headless server. On Ubuntu, I have to first run sudo apt-get install jackd2 jnoise.

Here's the start script, which has to be started with ./jamtest-start.sh <number_instances> <server_uri>:

#!/bin/bash
# echo "Starting jack …"
jack_control start
# echo "Jack started."
echo "Starting jnoise …"
jnoise &
echo "Jnoise started."
for i in $(eval echo {1..$1})
do
	echo Starting Jamulus instance jamtest$i …
	jamulus -j -n --connect $2 --clientname jamtest$i &> /dev/null &
	sleep 1 # We need to give Jamulus a second to start up
	echo Jamulus instance jamtest$i started
	echo Connection Jack …
	jack_connect jnoise:pink "Jamulus jamtest"$i:"input left"
	echo Jack connected
done

We can use jack_lsp to see the connections in Jack.

To stop the test, I use jamtest-stop.sh which contains

#!/bin/bash
killall jamulus
killall jnoise
jack_control stop

The script is a bit hacky and not very elaborated, please feel free to improve it.

One caveat: Jack has a hard-coded limit of 64 clients per machine (https://github.com/jackaudio/jack2/blob/develop/common/JackConstants.h). To circumvent this, one would have to build a custom version of Jack from source. (Or use more machines as load drivers.)

0 replies

Server performance & optimization #1052

Replies: 288 comments · 20 replies

WolfganP Jul 15, 2020 Author

WolfganP Jul 15, 2020 Author

pljones Jul 18, 2020 Maintainer

pljones Jul 19, 2020 Maintainer

WolfganP Jul 19, 2020 Author

pljones Jul 19, 2020 Maintainer

pljones Jul 20, 2020 Maintainer

WolfganP Jul 20, 2020 Author

WolfganP Jul 20, 2020 Author

WolfganP Jul 22, 2020 Author

pljones Dec 6, 2020 Maintainer

pljones Dec 6, 2020 Maintainer

pljones Dec 7, 2020 Maintainer

corrados Dec 7, 2020 Maintainer

ann0see Feb 6, 2021 Maintainer

Replies: 288 comments 20 replies

WolfganP
Jul 15, 2020
Author

WolfganP
Jul 15, 2020
Author

pljones
Jul 18, 2020
Maintainer

pljones
Jul 19, 2020
Maintainer

WolfganP
Jul 19, 2020
Author

pljones
Jul 19, 2020
Maintainer

pljones
Jul 20, 2020
Maintainer

WolfganP
Jul 20, 2020
Author

WolfganP
Jul 20, 2020
Author

WolfganP
Jul 22, 2020
Author

pljones
Dec 6, 2020
Maintainer

pljones
Dec 6, 2020
Maintainer

pljones
Dec 7, 2020
Maintainer

corrados
Dec 7, 2020
Maintainer

ann0see
Feb 6, 2021
Maintainer