Hanging/recurring methods #645

mariusz-jachimowicz-83 · 2016-09-08T09:47:23Z

There are 2 hanging(recurring) methods when there is problem with ZK during subscription to log

(defn find-job-scheduler [log]
  (loop []
    (if-let [chunk
             (try
                ;; curator has internal retry policy
               (extensions/read-chunk log :job-scheduler nil)
               (catch Throwable e

                 ;; this leads to large log file for long time ZK problems
                 ;; harder to see what is going on for developer based on large log
                 (warn e)

                 (warn "Job scheduler couldn't be discovered. Backing off 500ms and trying again...")
                 nil))]
      (:job-scheduler chunk)
      ;; hanging and producing large log file
      ;; it could be better to fail fast (erlang strategy) and restart peer quicker  
      ;; rather than hanging here
      (do (Thread/sleep 500)
          (recur))

)))

;; the same problems as above
(defn find-messaging [log]
  (loop []
    (if-let [chunk
             (try
               (extensions/read-chunk log :messaging nil)
               (catch Throwable e
                 (warn e)
                 (warn "Messaging couldn't be discovered. Backing off 500ms and trying again...")
                 nil))]
      (:messaging chunk)
      (do (Thread/sleep 500)
          (recur)))))

MichaelDrogalis · 2016-09-18T23:15:29Z

Sorry for the delayed response.

@mariusz-jachimowicz-83 Were you able to show this these functions actually hang when the program tries to purposefully abort? I believe if you look at the context around the code, a closed ZooKeeper connection will make both of these exit out cleanly.

I agree though, we should switch to what you're suggesting. Just wondering if this is an operational or correctness problem.

lbradstreet · 2016-09-20T19:17:24Z

First, thanks for letting us know @mariusz-jachimowicz-83.

I think the primary problem here is that the threads will get stuck on trying to read these chunks, with no escape, as @mariusz-jachimowicz-83 has discovered. Since we're passing in the log component, we should check whether (:kill-ch log) is closed via alts!! before recurring and trying again. This will allow everything to shut down properly when the log component is shutdown (which will be done when the peer group is shutdown.

I think trying to reboot the peer here is the wrong approach because there are a lot of non-peer users of subscribe-to-log (e.g. the dashboard) which call on these functions.

I am open to increasing the sleep time between recurs (this shouldn't happen to peers because they try to write it before they join), as well as decreasing the log level from warn to info, and combining the two log statements.

One additional option is that we could be more picky about what to do on different checked exception types. A node that doesn't exist may require a different reaction than being unable to connect to ZooKeeper or having a broken connection. I think the other changes above are probably enough though.

MichaelDrogalis · 2016-09-20T19:22:20Z

I believe read-chunk returns nil when the ZooKeeper connection closes, so the thread ends up exiting.

lbradstreet · 2016-09-20T19:44:55Z

The cleanup-broken-connections in read-chunk rethrows an exception on all caught exceptions, and passes the rest through, so as far as I can tell it will never clean itself up. This may have changed from the way it used to work at some point.

MichaelDrogalis · 2016-09-20T19:47:23Z

Ah, seems likely. I definitely remember stepping through this code with JVisualVM at one point to make sure threads weren't leaking, but I guess that was a while back.

mariusz-jachimowicz-83 · 2016-09-21T10:00:36Z

I am working on handling ZK in the dashboard correctly (onyx-platform/onyx-dashboard#63) so I am experimenting now. There are 2 main situations that I want to handle:

lost/reconnect connection
node not found (for instance after ZK restart)

I need more time to analyze and experiment. I don't want to mess up.

lbradstreet · 2018-05-26T03:19:04Z

This is best handled in onyx-dashboard.

MichaelDrogalis added the bug label Sep 18, 2016

lbradstreet closed this as completed May 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging/recurring methods #645

Hanging/recurring methods #645

mariusz-jachimowicz-83 commented Sep 8, 2016 •

edited

Loading

MichaelDrogalis commented Sep 18, 2016

lbradstreet commented Sep 20, 2016 •

edited

Loading

MichaelDrogalis commented Sep 20, 2016

lbradstreet commented Sep 20, 2016

MichaelDrogalis commented Sep 20, 2016

mariusz-jachimowicz-83 commented Sep 21, 2016

lbradstreet commented May 26, 2018

Hanging/recurring methods #645

Hanging/recurring methods #645

Comments

mariusz-jachimowicz-83 commented Sep 8, 2016 • edited Loading

MichaelDrogalis commented Sep 18, 2016

lbradstreet commented Sep 20, 2016 • edited Loading

MichaelDrogalis commented Sep 20, 2016

lbradstreet commented Sep 20, 2016

MichaelDrogalis commented Sep 20, 2016

mariusz-jachimowicz-83 commented Sep 21, 2016

lbradstreet commented May 26, 2018

mariusz-jachimowicz-83 commented Sep 8, 2016 •

edited

Loading

lbradstreet commented Sep 20, 2016 •

edited

Loading