You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Calls to receive UDP packets eventually reach a function that uses recv(..., MSG_DONTWAIT) and then poll(..., timeout_ms) to check for packets and then wait for packets if none were available.
However, the thread always passes a timeout of 0 which causes poll to return as quickly as possible. If no packets are received, the thread attempts to sleep.
If the system is under load, the kernel may not be able to sleep this thread in the time allotted (I'm guessing, see Additional Information). Which leaves this thread to wrap up a CPU core polling for UDP packets that arrive relatively infrequently.
The "right" solution probably consists of passing a non-zero timeout to poll that will let the kernel block this thread until data arrives or the timeout expires. But currently, poll is called while holding a mutex shared between other threads that need to communicate with the device, including those sending commands the receiving thread needs to respond to. Thus, passing a non-zero timeout causes device initialization, for example, to take several minutes. The mutex is owned by this class.
Comments throughout the code refer to a "threaded_io_service" that needs to be developed to solve this problem. Right now, the only work around I have found is to patch UHD to try and sleep this thread for a longer amount of time. Using a sleep time of 100us has worked for me and doesn't seem to affect functional behavior. I'm not 100% sure there aren't any side effects to this action though.
Setup Details
UHD v4.1.0.3
X300
UBX160
CentOS 8.4.2105
Linux 4.18.0-305.19.1.el8_4.x86_64
Intel(R) Xeon(R) W-10885M CPU (8 cores)
Expected Behavior
The thread named uhd_ctrl_ep_<id> should consume very little CPU time when executing the benchmark_rate example program.
Actual Behaviour
The thread named uhd_ctrl_ep_<id> consumes around up to 99% of CPU time when executing the benchmark_rate example program (and other applications).
Steps to reproduce the problem
Install a single UBX160 card in the X300
Connect host to X300 on SFP port 1
Run top -H
In another terminal, run benchmark_rate --args "addr=192.168.40.2" --rx_rate 200e6 --duration 60
Additional Information
I was able to reproduce the problem with lower sampling rates as well on the system described. But on another system (Ubuntu 20, Linux 5.11, Intel i9-9880H CPU, 16 cores) I was unable to reproduce the issue. I gave my guess above that the kernel is unable to sleep the thread in time on the "smaller" system when it is under load.
Questions
Is the right answer to develop this "threaded_io_service"?
What exactly needs to be "threaded"?
The text was updated successfully, but these errors were encountered:
Actually, after going over the code once more, I believe this problem affects all RFNoC devices as the chdr_ctrl_endpoint is instantiated by the class that manages connections to nodes on an RFNoC graph.
This makes more sense too because on that Ubuntu system I mentioned, I have seen this issue streaming two channels with 160MHz of bandwidth from an N320 (XG firmware). It also leads me to believe that my guess is right that the control recv_worker isn't able to sleep when the system is under heavy load. I don't know how much the wasted CPU time is worth relative to the load that induces this problem but it would be nice to not waste it either way.
Issue Description
The X300 uses a thread owned by the
chdr_ctrl_endpoint
class to poll for control ACKs and asynchronous command responses.uhd/host/lib/rfnoc/chdr_ctrl_endpoint.cpp
Line 121 in 748162e
Calls to receive UDP packets eventually reach a function that uses
recv(..., MSG_DONTWAIT)
and thenpoll(..., timeout_ms)
to check for packets and then wait for packets if none were available.uhd/host/lib/include/uhdlib/transport/udp_common.hpp
Line 101 in 748162e
However, the thread always passes a timeout of
0
which causespoll
to return as quickly as possible. If no packets are received, the thread attempts to sleep.uhd/host/lib/rfnoc/chdr_ctrl_endpoint.cpp
Line 150 in 748162e
If the system is under load, the kernel may not be able to sleep this thread in the time allotted (I'm guessing, see Additional Information). Which leaves this thread to wrap up a CPU core polling for UDP packets that arrive relatively infrequently.
The "right" solution probably consists of passing a non-zero timeout to
poll
that will let the kernel block this thread until data arrives or the timeout expires. But currently,poll
is called while holding amutex
shared between other threads that need to communicate with the device, including those sending commands the receiving thread needs to respond to. Thus, passing a non-zero timeout causes device initialization, for example, to take several minutes. The mutex is owned by this class.Comments throughout the code refer to a "threaded_io_service" that needs to be developed to solve this problem. Right now, the only work around I have found is to patch UHD to try and sleep this thread for a longer amount of time. Using a sleep time of
100us
has worked for me and doesn't seem to affect functional behavior. I'm not 100% sure there aren't any side effects to this action though.Setup Details
Expected Behavior
The thread named
uhd_ctrl_ep_<id>
should consume very little CPU time when executing thebenchmark_rate
example program.Actual Behaviour
The thread named
uhd_ctrl_ep_<id>
consumes around up to 99% of CPU time when executing thebenchmark_rate
example program (and other applications).Steps to reproduce the problem
top -H
benchmark_rate --args "addr=192.168.40.2" --rx_rate 200e6 --duration 60
Additional Information
I was able to reproduce the problem with lower sampling rates as well on the system described. But on another system (Ubuntu 20, Linux 5.11, Intel i9-9880H CPU, 16 cores) I was unable to reproduce the issue. I gave my guess above that the kernel is unable to sleep the thread in time on the "smaller" system when it is under load.
Questions
The text was updated successfully, but these errors were encountered: