-
Notifications
You must be signed in to change notification settings - Fork 203
ZK connection problem handling #679
Comments
I am fine with improving the feedback from a bad ZooKeeper connection, but there are two problems with the approach in #680:
An enhancement patch for a more responsive ZooKeeper connection should be quite few lines of code changed. |
@MichaelDrogalis Sure. I will try to satisfy those requirements. For now it's a experiment generally. |
Generally I am thinking about more complex Policy. Combination of 2 policies to able to have simple state machine where:
This is my idea of fail fast and be more responsive. This is experiment in my head. |
It might be worth looking at what Storm does in this regard, since it's had a fair bit of battle testing. Their default config is here: https://github.com/apache/storm/blob/master/conf/defaults.yaml#L31 Assuming these look sane, I would rather use a similar policy to something battle tested, than come up with retry policies without a good amount of testing (we should jepsen in any case) |
@lbradstreet You can see that those settings are for failing fast. So you can have in logfile message about connecting problems before 30s. Storm has also Nimbus or something like Supervisor that will try to reconnect/restart after it happens, I think, but I might be wrong with that. Onyx has only this hanging couple of minutes policy for now. I am also very intrested in learning jepsen. I want to have this as an first class citizen in my skills toolbelt. |
Onyx has very good building blocks so we can build solution that handle many failing scenarios much better than competitors 😄 |
I'm definitely open to a different defaults, but I will want to run through some scenarios and what the peers will do in each. |
We're significantly more risk-averse to changes in this part of the code base since it is critical to Onyx being able to run correctly. I would prefer to keep the policy simple, even if it's mildly less responsive. |
@MichaelDrogalis Sure. It's very reasonable to have small incremental improvements rather than big revolution. It's just a very early experiment for me for now. |
I think that generally failures should be first class citizen in this kind of system. Netflix treats them this way so they are even injecting failures in production to check end learn to answer question 'Are we respond to failures correctly'. Why this is important? Because hanging components/systems/subsystems after failures and badly handled failures == loosing data + loosing time == loosing money.
Each component should be able to send msg into this component. This is my loose thought. |
When I submit job and there is a ZK connection problem then developer sees
If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
but I have to wait couple of minutes to have more feedback and get any error messages.It's easy to see this behaviour. You can just:
:zookeeper/server? false
in thetest-resources/test-config.edn
fileZK connections based on
BoundedExponentialBackoffRetry
, and this is one of the problem.It could be changed the same way as I did it for dashboard onyx-platform/onyx-dashboard#63.
This way we will get:
The text was updated successfully, but these errors were encountered: