Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

ZK connection problem handling #679

Open
mariusz-jachimowicz-83 opened this issue Nov 17, 2016 · 10 comments
Open

ZK connection problem handling #679

mariusz-jachimowicz-83 opened this issue Nov 17, 2016 · 10 comments

Comments

@mariusz-jachimowicz-83
Copy link
Contributor

mariusz-jachimowicz-83 commented Nov 17, 2016

When I submit job and there is a ZK connection problem then developer sees If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper. but I have to wait couple of minutes to have more feedback and get any error messages.

It's easy to see this behaviour. You can just:

  • set :zookeeper/server? false in the test-resources/test-config.edn file
  • run any test

ZK connections based on BoundedExponentialBackoffRetry, and this is one of the problem.
It could be changed the same way as I did it for dashboard onyx-platform/onyx-dashboard#63.
This way we will get:

  • message about connection problem after first 5s
  • messages that components are trying to reconnect
@MichaelDrogalis
Copy link
Contributor

I am fine with improving the feedback from a bad ZooKeeper connection, but there are two problems with the approach in #680:

  1. We shouldn't be handling the retry loop ourselves. Curator provides callbacks to perform an action in response to a retry. Curator is far more battle tested than anything we'll write. In particular, ensuring that all resources are cleaned up during a premature shutdown is non-trivial work.
  2. We're using a BoundedExponentialBackoffRetry policy because we intentionally don't want to retry forever, and do not want to retry after a constant period of time. If the ZooKeeper cluster is experiencing high load, continually reconnecting after the same period of time might make things work. This is the usual rationale for anything that uses exponential backoff. Secondly, we eventually want to give up on connecting if it can't be established in a reasonable period of time. Most of the time Onyx is run in a cluster manager such as Kubernetes or Mesos. We need to give Onyx the ability to quit trying when it's unsuccessful so that the cluster manager can move the process to another machine - one that perhaps does have a connection available to ZooKeeper.

An enhancement patch for a more responsive ZooKeeper connection should be quite few lines of code changed.

@mariusz-jachimowicz-83
Copy link
Contributor Author

@MichaelDrogalis Sure. I will try to satisfy those requirements. For now it's a experiment generally.

@mariusz-jachimowicz-83
Copy link
Contributor Author

Generally I am thinking about more complex Policy. Combination of 2 policies to able to have simple state machine where:

  1. when we have connection to ZK use Policy-FailFast
    Policy-FailFast = each write operation is trying for 3 times (3 x 5s). When policy exhausted then signal connection lost + switch into Policy-TryReconnect
  2. use Policy-TryReconnect
    Policy-TryReconnect =
  • signal to other components to stop/pause
  • Try to reconnect for 3-5 minutes (for each 5s)
  • when reconnected then switch into Policy-FailFast + signal to resume computations
  • when policy exhausted then restart whole system or throw exception so that K8 or something like that could allocate node that has connection to ZK

This is my idea of fail fast and be more responsive. This is experiment in my head.

@lbradstreet
Copy link
Member

It might be worth looking at what Storm does in this regard, since it's had a fair bit of battle testing. Their default config is here: https://github.com/apache/storm/blob/master/conf/defaults.yaml#L31

Assuming these look sane, I would rather use a similar policy to something battle tested, than come up with retry policies without a good amount of testing (we should jepsen in any case)

@mariusz-jachimowicz-83
Copy link
Contributor Author

@lbradstreet You can see that those settings are for failing fast. So you can have in logfile message about connecting problems before 30s. Storm has also Nimbus or something like Supervisor that will try to reconnect/restart after it happens, I think, but I might be wrong with that. Onyx has only this hanging couple of minutes policy for now.

I am also very intrested in learning jepsen. I want to have this as an first class citizen in my skills toolbelt.

@mariusz-jachimowicz-83
Copy link
Contributor Author

mariusz-jachimowicz-83 commented Nov 18, 2016

Onyx has very good building blocks so we can build solution that handle many failing scenarios much better than competitors 😄

@lbradstreet
Copy link
Member

I'm definitely open to a different defaults, but I will want to run through some scenarios and what the peers will do in each.

@MichaelDrogalis
Copy link
Contributor

We're significantly more risk-averse to changes in this part of the code base since it is critical to Onyx being able to run correctly. I would prefer to keep the policy simple, even if it's mildly less responsive.

@mariusz-jachimowicz-83
Copy link
Contributor Author

mariusz-jachimowicz-83 commented Nov 18, 2016

@MichaelDrogalis Sure. It's very reasonable to have small incremental improvements rather than big revolution. It's just a very early experiment for me for now.

@mariusz-jachimowicz-83
Copy link
Contributor Author

mariusz-jachimowicz-83 commented Nov 21, 2016

I think that generally failures should be first class citizen in this kind of system. Netflix treats them this way so they are even injecting failures in production to check end learn to answer question 'Are we respond to failures correctly'. Why this is important? Because hanging components/systems/subsystems after failures and badly handled failures == loosing data + loosing time == loosing money.
I think about some FailureSupervisor component for each node (PeerGroup). This component could:

  • store info that some failure happened
    This could be displayable in the dashboard. So for instance I could see that there are some problems that requires to modify phisycal architecture, change computation algorithm (my workflow), be aware of problems with third party libraries, ....

  • respond for some failures automatically
    Send email, restart, restart with using snapshot, turn off, propose allocation more peers/vpeers, adjust some parameters maybe (snapshoting or triggering frequency), force to restart only particular component...

Each component should be able to send msg into this component. This is my loose thought.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants