Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Hanging/recurring methods #645

Closed
mariusz-jachimowicz-83 opened this issue Sep 8, 2016 · 7 comments
Closed

Hanging/recurring methods #645

mariusz-jachimowicz-83 opened this issue Sep 8, 2016 · 7 comments
Labels

Comments

@mariusz-jachimowicz-83
Copy link
Contributor

mariusz-jachimowicz-83 commented Sep 8, 2016

There are 2 hanging(recurring) methods when there is problem with ZK during subscription to log

(defn find-job-scheduler [log]
  (loop []
    (if-let [chunk
             (try
                ;; curator has internal retry policy
               (extensions/read-chunk log :job-scheduler nil)
               (catch Throwable e

                 ;; this leads to large log file for long time ZK problems
                 ;; harder to see what is going on for developer based on large log
                 (warn e)

                 (warn "Job scheduler couldn't be discovered. Backing off 500ms and trying again...")
                 nil))]
      (:job-scheduler chunk)
      ;; hanging and producing large log file
      ;; it could be better to fail fast (erlang strategy) and restart peer quicker  
      ;; rather than hanging here
      (do (Thread/sleep 500)
          (recur))

)))

;; the same problems as above
(defn find-messaging [log]
  (loop []
    (if-let [chunk
             (try
               (extensions/read-chunk log :messaging nil)
               (catch Throwable e
                 (warn e)
                 (warn "Messaging couldn't be discovered. Backing off 500ms and trying again...")
                 nil))]
      (:messaging chunk)
      (do (Thread/sleep 500)
          (recur)))))
@MichaelDrogalis
Copy link
Contributor

Sorry for the delayed response.

@mariusz-jachimowicz-83 Were you able to show this these functions actually hang when the program tries to purposefully abort? I believe if you look at the context around the code, a closed ZooKeeper connection will make both of these exit out cleanly.

I agree though, we should switch to what you're suggesting. Just wondering if this is an operational or correctness problem.

@lbradstreet
Copy link
Member

lbradstreet commented Sep 20, 2016

First, thanks for letting us know @mariusz-jachimowicz-83.

I think the primary problem here is that the threads will get stuck on trying to read these chunks, with no escape, as @mariusz-jachimowicz-83 has discovered. Since we're passing in the log component, we should check whether (:kill-ch log) is closed via alts!! before recurring and trying again. This will allow everything to shut down properly when the log component is shutdown (which will be done when the peer group is shutdown.

I think trying to reboot the peer here is the wrong approach because there are a lot of non-peer users of subscribe-to-log (e.g. the dashboard) which call on these functions.

I am open to increasing the sleep time between recurs (this shouldn't happen to peers because they try to write it before they join), as well as decreasing the log level from warn to info, and combining the two log statements.

One additional option is that we could be more picky about what to do on different checked exception types. A node that doesn't exist may require a different reaction than being unable to connect to ZooKeeper or having a broken connection. I think the other changes above are probably enough though.

@MichaelDrogalis
Copy link
Contributor

I believe read-chunk returns nil when the ZooKeeper connection closes, so the thread ends up exiting.

@lbradstreet
Copy link
Member

The cleanup-broken-connections in read-chunk rethrows an exception on all caught exceptions, and passes the rest through, so as far as I can tell it will never clean itself up. This may have changed from the way it used to work at some point.

@MichaelDrogalis
Copy link
Contributor

Ah, seems likely. I definitely remember stepping through this code with JVisualVM at one point to make sure threads weren't leaking, but I guess that was a while back.

@mariusz-jachimowicz-83
Copy link
Contributor Author

I am working on handling ZK in the dashboard correctly (onyx-platform/onyx-dashboard#63) so I am experimenting now. There are 2 main situations that I want to handle:

  • lost/reconnect connection
  • node not found (for instance after ZK restart)

I need more time to analyze and experiment. I don't want to mess up.

@lbradstreet
Copy link
Member

This is best handled in onyx-dashboard.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants