ZK connection problem handling #679

mariusz-jachimowicz-83 · 2016-11-17T11:51:58Z

When I submit job and there is a ZK connection problem then developer sees If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper. but I have to wait couple of minutes to have more feedback and get any error messages.

It's easy to see this behaviour. You can just:

set :zookeeper/server? false in the test-resources/test-config.edn file
run any test

ZK connections based on BoundedExponentialBackoffRetry, and this is one of the problem.
It could be changed the same way as I did it for dashboard onyx-platform/onyx-dashboard#63.
This way we will get:

message about connection problem after first 5s
messages that components are trying to reconnect

The text was updated successfully, but these errors were encountered:

MichaelDrogalis · 2016-11-17T16:40:32Z

I am fine with improving the feedback from a bad ZooKeeper connection, but there are two problems with the approach in #680:

We shouldn't be handling the retry loop ourselves. Curator provides callbacks to perform an action in response to a retry. Curator is far more battle tested than anything we'll write. In particular, ensuring that all resources are cleaned up during a premature shutdown is non-trivial work.
We're using a BoundedExponentialBackoffRetry policy because we intentionally don't want to retry forever, and do not want to retry after a constant period of time. If the ZooKeeper cluster is experiencing high load, continually reconnecting after the same period of time might make things work. This is the usual rationale for anything that uses exponential backoff. Secondly, we eventually want to give up on connecting if it can't be established in a reasonable period of time. Most of the time Onyx is run in a cluster manager such as Kubernetes or Mesos. We need to give Onyx the ability to quit trying when it's unsuccessful so that the cluster manager can move the process to another machine - one that perhaps does have a connection available to ZooKeeper.

An enhancement patch for a more responsive ZooKeeper connection should be quite few lines of code changed.

mariusz-jachimowicz-83 · 2016-11-17T16:54:39Z

@MichaelDrogalis Sure. I will try to satisfy those requirements. For now it's a experiment generally.

mariusz-jachimowicz-83 · 2016-11-18T12:30:12Z

Generally I am thinking about more complex Policy. Combination of 2 policies to able to have simple state machine where:

when we have connection to ZK use Policy-FailFast
Policy-FailFast = each write operation is trying for 3 times (3 x 5s). When policy exhausted then signal connection lost + switch into Policy-TryReconnect
use Policy-TryReconnect
Policy-TryReconnect =

signal to other components to stop/pause
Try to reconnect for 3-5 minutes (for each 5s)
when reconnected then switch into Policy-FailFast + signal to resume computations
when policy exhausted then restart whole system or throw exception so that K8 or something like that could allocate node that has connection to ZK

This is my idea of fail fast and be more responsive. This is experiment in my head.

lbradstreet · 2016-11-18T13:26:33Z

It might be worth looking at what Storm does in this regard, since it's had a fair bit of battle testing. Their default config is here: https://github.com/apache/storm/blob/master/conf/defaults.yaml#L31

Assuming these look sane, I would rather use a similar policy to something battle tested, than come up with retry policies without a good amount of testing (we should jepsen in any case)

mariusz-jachimowicz-83 · 2016-11-18T14:15:53Z

@lbradstreet You can see that those settings are for failing fast. So you can have in logfile message about connecting problems before 30s. Storm has also Nimbus or something like Supervisor that will try to reconnect/restart after it happens, I think, but I might be wrong with that. Onyx has only this hanging couple of minutes policy for now.

I am also very intrested in learning jepsen. I want to have this as an first class citizen in my skills toolbelt.

mariusz-jachimowicz-83 · 2016-11-18T14:18:54Z

Onyx has very good building blocks so we can build solution that handle many failing scenarios much better than competitors 😄

lbradstreet · 2016-11-18T15:53:39Z

I'm definitely open to a different defaults, but I will want to run through some scenarios and what the peers will do in each.

MichaelDrogalis · 2016-11-18T16:00:10Z

We're significantly more risk-averse to changes in this part of the code base since it is critical to Onyx being able to run correctly. I would prefer to keep the policy simple, even if it's mildly less responsive.

mariusz-jachimowicz-83 · 2016-11-18T16:19:17Z

@MichaelDrogalis Sure. It's very reasonable to have small incremental improvements rather than big revolution. It's just a very early experiment for me for now.

mariusz-jachimowicz-83 · 2016-11-21T10:45:02Z

I think that generally failures should be first class citizen in this kind of system. Netflix treats them this way so they are even injecting failures in production to check end learn to answer question 'Are we respond to failures correctly'. Why this is important? Because hanging components/systems/subsystems after failures and badly handled failures == loosing data + loosing time == loosing money.
I think about some FailureSupervisor component for each node (PeerGroup). This component could:

store info that some failure happened
This could be displayable in the dashboard. So for instance I could see that there are some problems that requires to modify phisycal architecture, change computation algorithm (my workflow), be aware of problems with third party libraries, ....
respond for some failures automatically
Send email, restart, restart with using snapshot, turn off, propose allocation more peers/vpeers, adjust some parameters maybe (snapshoting or triggering frequency), force to restart only particular component...

Each component should be able to send msg into this component. This is my loose thought.

mariusz-jachimowicz-83 mentioned this issue Nov 17, 2016

WIP: Improve ZK connection state handling #680

Closed

4 tasks

MichaelDrogalis added the enhancement label Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZK connection problem handling #679

ZK connection problem handling #679

mariusz-jachimowicz-83 commented Nov 17, 2016 •

edited

Loading

MichaelDrogalis commented Nov 17, 2016

mariusz-jachimowicz-83 commented Nov 17, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016

lbradstreet commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016 •

edited

Loading

lbradstreet commented Nov 18, 2016

MichaelDrogalis commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016 •

edited

Loading

mariusz-jachimowicz-83 commented Nov 21, 2016 •

edited

Loading

ZK connection problem handling #679

ZK connection problem handling #679

Comments

mariusz-jachimowicz-83 commented Nov 17, 2016 • edited Loading

MichaelDrogalis commented Nov 17, 2016

mariusz-jachimowicz-83 commented Nov 17, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016

lbradstreet commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016 • edited Loading

lbradstreet commented Nov 18, 2016

MichaelDrogalis commented Nov 18, 2016

mariusz-jachimowicz-83 commented Nov 18, 2016 • edited Loading

mariusz-jachimowicz-83 commented Nov 21, 2016 • edited Loading

mariusz-jachimowicz-83 commented Nov 17, 2016 •

edited

Loading

mariusz-jachimowicz-83 commented Nov 18, 2016 •

edited

Loading

mariusz-jachimowicz-83 commented Nov 18, 2016 •

edited

Loading

mariusz-jachimowicz-83 commented Nov 21, 2016 •

edited

Loading