-
Notifications
You must be signed in to change notification settings - Fork 203
Hanging/recurring methods #645
Comments
Sorry for the delayed response. @mariusz-jachimowicz-83 Were you able to show this these functions actually hang when the program tries to purposefully abort? I believe if you look at the context around the code, a closed ZooKeeper connection will make both of these exit out cleanly. I agree though, we should switch to what you're suggesting. Just wondering if this is an operational or correctness problem. |
First, thanks for letting us know @mariusz-jachimowicz-83. I think the primary problem here is that the threads will get stuck on trying to read these chunks, with no escape, as @mariusz-jachimowicz-83 has discovered. Since we're passing in the log component, we should check whether I think trying to reboot the peer here is the wrong approach because there are a lot of non-peer users of subscribe-to-log (e.g. the dashboard) which call on these functions. I am open to increasing the sleep time between recurs (this shouldn't happen to peers because they try to write it before they join), as well as decreasing the log level from warn to info, and combining the two log statements. One additional option is that we could be more picky about what to do on different checked exception types. A node that doesn't exist may require a different reaction than being unable to connect to ZooKeeper or having a broken connection. I think the other changes above are probably enough though. |
I believe |
The cleanup-broken-connections in read-chunk rethrows an exception on all caught exceptions, and passes the rest through, so as far as I can tell it will never clean itself up. This may have changed from the way it used to work at some point. |
Ah, seems likely. I definitely remember stepping through this code with JVisualVM at one point to make sure threads weren't leaking, but I guess that was a while back. |
I am working on handling ZK in the dashboard correctly (onyx-platform/onyx-dashboard#63) so I am experimenting now. There are 2 main situations that I want to handle:
I need more time to analyze and experiment. I don't want to mess up. |
This is best handled in onyx-dashboard. |
There are 2 hanging(recurring) methods when there is problem with ZK during subscription to log
The text was updated successfully, but these errors were encountered: