Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing bandwidth of multi-device drivers #1499

Closed
knro opened this issue Jun 27, 2021 · 45 comments
Closed

Increasing bandwidth of multi-device drivers #1499

knro opened this issue Jun 27, 2021 · 45 comments

Comments

@knro
Copy link
Contributor

knro commented Jun 27, 2021

Currently some multi-device drivers (e.g. QHY) can run multiple cameras in a single instance connected to INDI server. The driver and server are connected via pipes (stdin/stdout). When a BLOB is sent from the driver to the server, a base64 encoded BLOB is sent via an INDI XML message. The server parses the BLOB and redirects it to interested clients.

A primary limitation of this method is the congestion of the pipes when multiple devices start sending BLOBs at the same time; this is especially evident with multiple large-sensor cameras begin streaming simultaneously. This is further exacerbated by a single-threaded INDI server.

There are a few possible solutions to this:

  1. Make each driver run only a single device.
  2. Use another method to transfer data (such as shared memory region) and make each driver handler runs in its own thread in INDI server.
  3. Make BLOB transfers using dedicated channel between drivers and server?
@knro knro added this to the 1.9.2 milestone Jul 6, 2021
@knro
Copy link
Contributor Author

knro commented Jul 6, 2021

@k-ross @jpaana @pawel-soja Can you please share your thoughts on this?

@k-ross
Copy link
Contributor

k-ross commented Jul 6, 2021

Are you sure the pipe is the bottleneck? What happens once all the feeds are aggregated into a single stream to send to the client connected to the indiserver? Wouldn't that be even more of a bottleneck? What kind of throughput do we actually need?

I did a quick test and measured the actual throughput of piping data through stdout, and I get approx. 2.6 Gbps, which is faster than most TCP/IP networks. This was on my Raspberry Pi, the slowest computer I own. :)

The sender:

#include <stdio.h>

int main(int argc, char*argv[])
{
	for(int i = 0; i < 10000000; i++)
	{
		printf("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\n");
	}
	printf("EOF\n");

	return 0;
}

The receiver:

#include <stdio.h>
#include <string.h>

int main(int argc, char*argv[])
{
	auto fp = popen("./sender", "r");

	char buf[1024];
	buf[0] = '\0';
	int bytesReceived = 0;
	
	while(!feof(fp))
	{
		fgets(buf, 1024, fp);
		bytesReceived += strlen(buf);
	}
	
	fclose(fp);
	
	printf("%d bytes received\n", bytesReceived);
	
	return 0;
}

The result, on a Raspberry Pi 4:

pi@telescope:~ $ time ./receiver
1010000008 bytes received

real	0m2.956s
user	0m3.397s
sys	0m2.181s

If we need more throughput than this, my suggestion is to offload the incoming streams to separate threads in the indiserver, since I suspect any bottleneck is probably the processing of the incoming data, and not the actual transfer of the data.

@knro
Copy link
Contributor Author

knro commented Jul 6, 2021

Thank you Kevin! Let's take an example:

  • Driver QHY <-----> INDI Server
    • QHY Camera 1
    • QHY Camera 2
  • Driver SBIG <----> INDI Server

Both of these drivers are connected to INDI servers via a bidirectional pipe (stdin/stdout). For QHY driver, each camera has its own thread when sending out the data to INDI server. If the pipe is not the bottleneck, then INDI server should be since it is single threaded.

As far as client goes, in KStars, I actually open a separate client to INDI server for each BLOB, so each blob is acting as a separate client to INDI server. I believe the next step is to introduce multi-threading for that each driver is handled in its own thread, and each client is handled in its own thread as well. IIRC, @pawel-soja suggested we migrate INDI server to C++ as well which would make this even easier, and perhaps even achieve more cross-platform compatibility.

@k-ross
Copy link
Contributor

k-ross commented Jul 6, 2021

Have you done any instrumentation to determine where the bottleneck is? Have you timed how long it takes in indiserver to read the data, vs timing how long it takes to process the data? Maybe the slowness is parsing the XML, doing base64 decoding (if you're doing that), or just moving the data around and making unnecessary copies. All could slow things down, and in a single threaded server, any one area of slowness will affect everything.

The good news with single threaded servers, they are usually a lot easier to profile, and find the bottlenecks. :)

@knro
Copy link
Contributor Author

knro commented Jul 6, 2021

You're right, let me create a benchmark we can check against.

@pludov
Copy link
Contributor

pludov commented Aug 11, 2021

Hi

I have been thinking to this long time ago :

It would be natural to exchange blobs as filedescriptor over unix domain socket. The fds exhange is a special kind of message on that domain. Each fd would refernce an anonymous memory (memfd_create). So local clients can receive raw data from driver without the server or the kernel having to process them at all. Also multiple client processes would efficiently share the memory space

AFAIK this method is used by kdbus for performance

https://indilib.org/forum/development/5465-idea-for-a-zero-copy-blob-transport-protocol-in-indi.html

i would be willing to help here...

@knro
Copy link
Contributor Author

knro commented Aug 11, 2021

That would be awesome @pludov especially if we can also implement multi-threading in INDI server along the way.

@k-ross
Copy link
Contributor

k-ross commented Aug 12, 2021

It would be natural to exchange blobs as filedescriptor over unix domain socket. The fds exhange is a special kind of message on that domain. Each fd would refernce an anonymous memory (memfd_create). So local clients can receive raw data from driver without the server or the kernel having to process them at all. Also multiple client processes would efficiently share the memory space

I'm not opposed to this idea, but I have some questions.

Why send the FD over a Unix domain socket? Why not include the FD in the XML response?

Does each blob get a new FD? Or is one shared memory space kept open and reused? If it's reused, how do we communicate locking back and forth between client and server, so server knows when client is done and can overwrite with new blob? I guess one approach is to allocate a large chunk of shared memory, and treat it like a ring buffer.

Would this be significantly faster than sending the blob in binary form, instead of base64, over a TCP socket? Sending the blob in binary form over a TCP socket would allow non-local clients to also enjoy a performance boost.

@pludov
Copy link
Contributor

pludov commented Aug 13, 2021

FD is a special information that pass from process to process, like a pointer to a buffer on kernel side. It has special handling from the kernel when exchanged between processes (over the AF_UNIX communication domain). you cannot just embed a number in the existing protocol (the fd actual value on the send will make no sens on the receiver).

Each blob will get its own FD, the emitter is responsible for allocating it. The inter process passing and sharing are handled by the kernel. Basically, the kernel keeps the buffer as long as it is referenced by at least one process. To access actual buffer data, processes need to "mmap" it. In the case of the indiserver, this mmap is not required, so the cost of a blob exchange within the indiserver would not be related to the size of the blob : all messages would be almost equals in processing time on its side.

FD exchange cannot be used over TCP connection (which are presumably not on the same host, so no way to share memory). We need to keep the actual base64 protocol to keep support for remote TCP. The Unix/shared buffer protocol would still benefit the indi_server and local client (no encoding/decoding here, faster message exchange, possibly no memory copy at all).

It would require a "converter" at the TCP/remote endpoint for the support of remote connection. This converter can be a per client process/thread, that way we would get a fast indiserver, pure Unix/shared buffer, and a per connection thread for actual base64 encoding (which should always be enough, since one cpu doing base64 will certainly always be able to saturate a tcp connection).

On the implementation side, I propose to implement in the following order:

  • first a protocol converter (tcp <=> unix/socket), to ensure every aspect is covered. I would test plugging two converter together. It would be a transparent indi proxy with no feature loss .
  • then a indiserver speaking 100% unix/socket, and using the converter for client and driver
  • then update drivers/indi library to natively speak the new protocol (connect to unix domain, send dgram & blobs...)

I would go with an event loop library (libev) for indiserver and its "tcp connector"

@knro
Copy link
Contributor Author

knro commented Aug 13, 2021

@pludov do you think you can work on some sort of proof of concept sample for this? I think if we can reduce the copying between drivers and indiserver on the same host that would bring tons of benefit. For remote clients & remote drivers (that are snooping on BLOBs), we should always fallback to TCP/Base64.

Now this could also be extended to LOCAL clients. Perhaps they can even be aware of this feature and can support, but base64 should also be there to maintain compatibility.

@pludov
Copy link
Contributor

pludov commented Aug 13, 2021

@knro : i'll create a tcp/unix message converter first because it will be required for any latter real testing. And the code will be the basis of the unix socket indiserver. Given my spare time availble, it will take some weesk...

@pludov
Copy link
Contributor

pludov commented Aug 15, 2021

I started working on that, starting with C++ convertion of indiserver, here: https://github.com/pludov/indi/tree/fast-blob-transfer

I'll add libev for eventloop and go on with implementing communication over unix datagram sockets

Any idea why I need to add libs/ to the lilxml include ? master is not building on my ubuntu 18 box otherwise...

#include "libs/lilxml.h"

@knro
Copy link
Contributor Author

knro commented Aug 16, 2021

Excellent progress! Looking forward to seeing the first operational C++ INDI server. The lilxml issue is minor and probably we just need to add an include in CMakeLists.txt later on.

@pludov
Copy link
Contributor

pludov commented Aug 17, 2021

I now have a locally working c++ indiserver based on libev. But the CI fails because libev is not installed on the build containers... I'd like to have this working to validate libev compatibility (esp on macos...) How can i fix this ?

@knro
Copy link
Contributor Author

knro commented Aug 17, 2021

Is it possible to edit the yaml file to apt-get install that?

@pludov
Copy link
Contributor

pludov commented Aug 18, 2021

The trick was to use the correct docker registry. Forks push to their owner's repo, but still uses container from the main repo for test. I fixed that.

I am almost finished with c++ convertion & libev integration. It is working for my simple test setup (no remote drv, ...) Would you consider a merge request for this part, before I go on with testing datagram implementation ?

@knro
Copy link
Contributor Author

knro commented Aug 18, 2021

A merge request is a bit early now. It is better to demonstrate the whole spectrum. I will checkout your branch and give you feedback once you let us know it's ready for first phase testing.

@pludov
Copy link
Contributor

pludov commented Aug 21, 2021

Not ready for testing but I now have a working driver=>indiserver connection over unix socket with blob exchanged as shared buffers :-)

It still inefficient as base64 encoding is done in the indiserver eventloop for all clients. I'll now to move that away from indiserver eventloop and allow local clients to receive blob as shared buffer.

Optionally, drivers will then need (light) adaptation to write directly their blob to shared buffer (avoiding copy here as well)

@knro
Copy link
Contributor Author

knro commented Aug 23, 2021

So this needs change in INDI::CCD? This shouldn't be hard to do. We also need to perhaps even include clients to gain from this, if possible. Looking forward to testing this and comparing the results.

Edit: I know INDI::CCD is just for cameras, but this is where the bulk of BLOBs transfers are happening right now. This along with INDI::StreamManager

@pludov
Copy link
Contributor

pludov commented Aug 23, 2021

I just pushed an implementation that can use shared buffer for blob delivery from driver to libindiclient (lots of FIXME but that path is OK).

So I did some quick benchmark :-)

I used the following "setup":

  • ccd_simulator in 8 bit video streaming 5120x4096 (just disabled ShowStarField to get more throughput out of it)
  • a simple libindiclient test program that dump the stream to a file
  • Ran my 4 core AMD Ryzen 5 system

indi 1.9.1:

cpu    time      process
232,2  4:27.11   indi_simulator_ccd
 95,7  0:22.29   streamer
 46,5  0:39.90   indiserver
374,4               total
FPS : 9.94 (Recorded: 9.15)

shared buffer version:

cpu    time      process
172,1   4:45.23 indi_simulator_
 14,3   0:03.48 streamer
  0,3   0:00.50 indiserver
186.7              total
FPS: 10.86 (Recorded: 10.75)

My interpretation:

  • the overall cpu gain for now is 50%
  • the win is clear on the client and server side
  • indi_simulator_ccd is the bottleneck in my scenario
  • The result FPS is not that impacted because the saturation is within indi_simulator_ccd and there are enough cores on my system to avoid perf impact on the server & client

Also that test does not measure the latency on the driver side. But with such a low cpu usage on indiserver, it will be able to keep up with a lot of drivers traffic :-)

I didn't look at streammanager or INDI::CCD in details for now. As you see there is already an advantage for drivers without more change (overall cpu is divided by 2), but there is probably something done in streammanager that consume a large amount of cpu for copy/convertion/... (provided i disabled all the star/noise rendering code... just producing horizontal bars here !)

@knro
Copy link
Contributor Author

knro commented Aug 23, 2021

Awesome progress! The CPU utilization is definitely much better but I expected we see a significant FPS boost from this as well since we avoided a lot of copying. What if you reduce the resolution? can you see more differences in FPS? how about with real cameras?

@pludov
Copy link
Contributor

pludov commented Aug 23, 2021 via email

@pludov
Copy link
Contributor

pludov commented Aug 24, 2021

The bottleneck in my scenario is GammaLut16::apply. It is however quite simple, I'll check the optimization level of the compiler...

@pludov
Copy link
Contributor

pludov commented Aug 25, 2021

Since indi_simulator is not able to saturate one indiserver even before shared buffer optimization, the apparent FPS gain was limited. So i tried to optimize effectiveness of ccd_simulator for my benchmark. It appears that lots of frames are lost because of they reach deliver while the previous image delivery is still ongoing. (previewThreadPool.try_start) I changed it for start which to my understanding create a queue of max 1 item. Makes some difference

indi 1.9.1 47M images (7680x6144) with previewThreadPool.start change:

208,3  6,9    indi_simulator_                                                                                                                                
99,0  0,7   streamer                                                                                                                                       
85,4  0,2    indiserver         
 => 7.5 FPS (was 5.5 without threadpool.start change)

shared buffer version, 47M images (7680x6144) with previewThreadPool.start change::

241,9  5,6  indi_simulator_                                                                                                                                
64,1  0,1    streamer                                                                                                                                       
0,3  0,0    indiserver    
 => 15.28 FPS (was  @7.9FPS without threadpool.start change)

In indi_simulator, the GammaLut16 function is always the slowest point. And then most of the remaining time is spent doing vector copy or initialization : lots of std::vector moves actually end up doing memcpy or memset indeed. Also my current shared buffer has one more copy for shared buffer initialization, that could be optimized by preallocating buffers

@knro
Copy link
Contributor Author

knro commented Aug 26, 2021

For that resolution, 15fps is impressive. This is also a good chance to optimize any bottlenecks or unnecessary copying in the pipeline. What's the next step?

@pludov
Copy link
Contributor

pludov commented Aug 26, 2021

I have the following point to address before other could seriously test:

  • Ensure base64 encoding on server side is outside the main loop
  • Ensure proper compilation on Linux CI
  • Fix indiapi.h (mostly ensure asc compatibility)
  • Properly disable the feature on Macos (I won't be able to properly dev/test that myself...)

At that point testing will be welcome. Also a good time to start a code review

The next step would be and move on to optimization of the driver / driver api side.
This will require discussions, as totally avoiding copy in this end may have impact on the exposed features/format (to propose configurations where data just fly through a blob without copy/endianess/bit convertion. It could be adding new streaming format...) We probably need to focus on some use cases to fully optimize.

But I think the indiserver should be merged before driver/driver api work occurs - as they are mostly independant

@knro
Copy link
Contributor Author

knro commented Aug 26, 2021

Sure, we can merge INDI server. Please submit a PR and we'll test it thoroughly under multiple scenarios. I presumed you tested it with chained servers?

Regarding MacOS, I can test on it, so no need to disable there if you need feedback. I have both Intel based and M1 Mac minis.

Regarding transfer formats, in #1200 I proposed a way to decouple capture vs transfer formats. For streaming, we send either motion jpeg if natively supported by the source, or raw 8bit or 24bit bitmap or convert the source to jpeg and send that.

@pludov
Copy link
Contributor

pludov commented Aug 29, 2021

I'll fix the four point above and test on my side (takes some time) before.
For MacOS, it won't build as is. It will need some adjustment (which includes are required, what constant exists or not there, ...).

By the way, I've found an existing bug as well about blob handling: #1528. Would be great to have some smoke test in CI about that.

@knro
Copy link
Contributor Author

knro commented Sep 7, 2021

@pludov Wanted to check if there has been any updates on this feature? anything to test now?

@pludov
Copy link
Contributor

pludov commented Sep 7, 2021 via email

@knro
Copy link
Contributor Author

knro commented Sep 28, 2021

@pludov Any good news about this?

@knro
Copy link
Contributor Author

knro commented Oct 20, 2021

Heya! Sorry to bother you again regarding this, but is there any update? Anything to test?

@pludov
Copy link
Contributor

pludov commented Oct 21, 2021 via email

@knro
Copy link
Contributor Author

knro commented Nov 6, 2021

Great, let me know if there is anything we can do to make this make it into 1.9.3. Can we break this into milestones that are easily trackable?

@knro knro modified the milestones: 1.9.2, v1.9.4 Nov 24, 2021
@knro
Copy link
Contributor Author

knro commented Nov 24, 2021

Alright, so I moved this to 1.9.4 milestone. @pludov do you think we can get something ready by then?

@knro
Copy link
Contributor Author

knro commented Dec 16, 2021

Hello @pludov I hope you're doing great. Just wanted to check in and see if there has been any progress here? What's the next step?

@pludov
Copy link
Contributor

pludov commented Dec 16, 2021 via email

@pludov
Copy link
Contributor

pludov commented Jan 17, 2022

I did move forward on that subject (you can see here: https://github.com/pludov/indi/ ) It's working on my test system with phd2 and mobindi. Still some todos here and here, especially in the qt-client.

To ensure feeback for blob delivery I had to add something to the protocol, because it's not possible at lower level: without protocol change/addition, sender cannot get knowledge of what the reader has received, and so may fill the socket buffers with blobs and trigger oom. This comes from the fact that the actual message size in byte occupied within the kernel socket buffer does not reflect the memory used. It's easy with big blobs attached to have GB of RAM used in a small socket buffer (especially if the receiver is busy). And lowering socket buffer to extremely low value (like < 100 bytes) would badly impact performances.

The idea I implemented is that a blob sender will not send a new blob while the previous one has not been acknowledge somehow.
I've implemented for that purpose pingRequest/pingReply messages on the indi protocol. this allows the sender to know that the receiver proceeded up to the given message.

This pingRequest/Reply is used when sending blobs. Before sending a blob (no2), the sender will ensure the previous one(no1) has been acknowledged:

  • send the blob no1
  • send a ping request
  • ... go back to normal business
  • wait for reply of previous ping request
  • send the blob no2

I think this mechanism is usefull only for using unix connection/shared blob. For existing connection it makes no difference as the blob size is largely bigger than buffering between client and server and no two blobs can be infly at the same time...

Also, I felt the need to add some automated integration tests to indiserver and baseclient. The idea here is to start one process (indiserver or indiclient) and instrument (mock) its connections to test various behaviours (at the xml/protocol level). You can see what it looks like in the /integs directory.
The pingRepquest/pingReply also is greatly usefull here, as it gives a way to ensure the exact timeline of a given test.

For example, when testing indiserver, ping helps in the following test :

  • we send enableBlob on the client side,
  • we send a blob value on the driver side,
  • we expect the blob to come up on the client side.
    If there is no synchronization, then there is not guaranty that the enableBlob will be handled by indiserver before the actual blob value. They may enter the process at the same time. In the end, the OS scheduler will decide in which order the receiving process will actually receive the data, and not necessarily in the sending order (from the select syscal, ...) We can try to workaround that be introducing delay, but it will always be subject to time variation and available resources, especially if automated in CI (leading to false breakage).

The alternative is to insert a pingRequest/reply to ensure the state of the server after the enableBlob. The test becomes:

  • we send enableBlob on the client side,
  • we send a pingRequest on the client side
  • we wait a pingRepy on the client side => it's now sure the enableBlob has been processed by the server
  • we send a blob value on the driver side,
  • we expect the blob to come up on the client side.

So I propose we add this "ping" mechanism to a new version of indi protocol to go on (with proper handling of previous version), for shared blobs and for integration tests. What do you think about it ?

@pludov
Copy link
Contributor

pludov commented Jan 20, 2022

Hi @knro : What's your opinion on the proposal above ? Any chance it would be accepted upstream ? We can discuss live if you prefer...

@knro
Copy link
Contributor Author

knro commented Jan 20, 2022

Thank you for the proposal, I need time to review it fully (just released INDI 1.9.4 yesterday and I should be able to look at it in depth this weekend).

@knro
Copy link
Contributor Author

knro commented Feb 19, 2022

Sorry, been busier than usual. Will get back to this soon. I think it's going to be extremely challenging to change INDI wire protocol, but not impossible if it is kept backward compatible.

@knro knro modified the milestones: 1.9.4, 1.9.6 May 21, 2022
@knro knro pinned this issue May 21, 2022
@knro
Copy link
Contributor Author

knro commented May 21, 2022

@pludov Sorry it took me such a long time to respond to this, but now that INDI 1.9.6 is ready for release, I've been thinking of how we can incorporate this in 1.9.7

Let's take this scenario:

  1. One driver
  2. One Server
  3. 1 local client with shared memory support.
  4. 1 remote client with shared memory support.
  5. 1 local client without shared memory support

From my understanding, once the driver has the data, it can signal to the server that data is ready to be read (perhaps by a property?). The server then checks the clients. Since it has 3 clients, it checks how it sends the BLOB to them.

  1. For first local client with shared memory support: No need to send blob, but only signal to it that it is ready (perhaps by same property mentioned above?)
  2. For second remote client, it doesn't matter if it has shared memory support, we need to send it via TCP sockets. So the server must read the shared memory and then sends it.
  3. Same as 2, server must read and send via TCP.

Driver cannot use shared memory yet (no queue?) until server and 1st client complete reading the shared memory. How is the driver informed about this so that it can start utilizing the memory again? Regarding the change in protocol, you propose a new message type and ? how is that exactly used? is it only between client and server, or also between server and driver?

@pludov
Copy link
Contributor

pludov commented May 22, 2022 via email

@knro
Copy link
Contributor Author

knro commented May 25, 2022

Thank you Ludovic for the detailed response. I think we should arrange a video meeting to go over this, perhaps even with the participation of other INDI developers so we can agree on some consensus moving forward. I believe this would bring significant improvements especially if the driver/server/client are all on the same physical machine. Thanks for clarifying my misunderstandings about this and I probably have more!

How about we arrange a meeting next week? how is your availability? You can send me a direct email to discuss the set up.

@knro
Copy link
Contributor Author

knro commented Jul 21, 2022

This is now fully implemented in INDI thanks to outstanding work by @pludov!

@knro knro closed this as completed Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants