-
Notifications
You must be signed in to change notification settings - Fork 40
Crashing under large load #50
Comments
Does the dashboard with metrics render fine, when you connect erlangpl to your app? If so, what throughput is reported? We got reports from users about epl_st crashing while waiting for a response from which_children/1 - is it the kind of crash you're getting too? |
The dashboard renders perfectly fine when I connect This is the crash I'm receiving as well. I'll post my crash in a followup report. NB: I am also using |
@michalslaski as promised, here's the log:
|
Pretty sure we can parallelize most of these list operations (except for foldl). Let's try to make some progress on that end? Think it's timing out due to too list operations stalling. |
Progress update: increasing the timeout and |
Alright, so two (non-mutually-exclusive) methods of attack:
|
Aha! I think I got to the root of the problem. If a process times out ( |
Thanks for progressing on this issue. If the dashboard renders fine, but reports throughput N/A, it means not a single trace report was collected from the epl_tracer process running on the observed node. This is typically due to a timeout set to 5 seconds, which is not long enough to collect statistics if the system is under heavy load. The suggested optimisations are interesting, but I am not sure we should parallelise list operations. Having just one epl_tracer process spawned on the observed node ensure we will not utilise more than one CPU core for processing trace events. Although we can't avoid creating some overhead, I think we should aim at limiting the impact on the observed node as much as possible. Spawning just one process is a very naive way of limiting the impact, but better than none. I'm sure we can find a way of optimising the processing of long lists. One way would be to use the new tracer backend introduced in Erlang 19, which allows to implement NIF callback functions. Your crash report confirms that another timeout occurred while querying one of the supervisor processes. This happens if a supervisor process does not comply with the OTP conventions and as a result the which_children/1 function does not get a proper response. This could happen when the 'application' behaviour init/1 callback function does not return a pid of an OTP supervisor. See a similar issue #29. |
@michalslaski I'm pretty sure my applications all return top-level supervisors. If I don't implement |
Identical error when loading this in my large Phoenix application except the |
My application info of iex(my_server@127.0.0.1)1> :application.info()[:running]
[
sasl: #PID<0.2025.0>,
iex: #PID<0.2019.0>,
my_server: #PID<0.1864.0>,
runtime_tools: #PID<0.1859.0>,
pid_file: #PID<0.1853.0>,
gettext: #PID<0.1848.0>,
mbu: :undefined,
fs: #PID<0.1841.0>,
permission_ex: :undefined,
comeonin: :undefined,
credo: #PID<0.1829.0>,
bunt: #PID<0.1826.0>,
inets: #PID<0.1815.0>,
cachex: #PID<0.1803.0>,
eternal: :undefined,
deppie: #PID<0.1797.0>,
bodyguard: :undefined,
ueberauth_identity: :undefined,
ueberauth: :undefined,
drab: #PID<0.1786.0>,
cowboy: #PID<0.1781.0>,
ranch: #PID<0.1776.0>,
ssl: #PID<0.1764.0>,
public_key: :undefined,
asn1: :undefined,
cowlib: :undefined,
phoenix_html: :undefined,
phoenix: #PID<0.1755.0>,
phoenix_pubsub: #PID<0.1751.0>,
poison: :undefined,
eex: :undefined,
postgrex: #PID<0.1741.0>,
db_connection: #PID<0.1734.0>,
connection: :undefined,
ecto_interval: :undefined,
quantum: #PID<0.1729.0>,
crontab: :undefined,
exceptional: :undefined,
phoenix_ecto: :undefined,
plug: #PID<0.1721.0>,
crypto: :undefined,
mime: :undefined,
ecto: #PID<0.1714.0>,
logger: #PID<0.1705.0>,
decimal: :undefined,
elixir: #PID<0.1698.0>,
compiler: :undefined,
poolboy: :undefined,
stdlib: :undefined,
kernel: #PID<0.1662.0>
] |
My app uses about 1000 processes per machine for controlling and distributing tasks across a cluster.
epl_st
timeouts when creating the supervisor tree to send to the web GUI. Is there either a way I can specify only certain processes to be tracked, or (better) just extend that timeout and accept a bit of lag?The text was updated successfully, but these errors were encountered: