Skip to content
This repository has been archived by the owner on Apr 3, 2019. It is now read-only.

Server hanging at FUTEX_WAIT_PRIVATE #88

Open
kylehuang17 opened this issue Apr 3, 2015 · 4 comments
Open

Server hanging at FUTEX_WAIT_PRIVATE #88

kylehuang17 opened this issue Apr 3, 2015 · 4 comments

Comments

@kylehuang17
Copy link

Hi, I use tornado-redis to async subscribe some chatting message, but when the server running minutes later, maybe hours or days later , the server hanging without any error reason. I used the strace tool with $ strace -p "The Process PID", then I found the process stuck in this status futex(0x7ffff64b3e00, FUTEX_WAIT_PRIVATE, 2, NULL) = 0.
Can Anyone tell me what causing this issue, and how to solve it ?
Some others solved similar situations by setting the connection timeout...
Would setting the stay_after be useful?

Thanks your help.

@blodone
Copy link

blodone commented Jul 1, 2015

I'm having similar problems but not the whole server hangs, just the subscribing to redis, especially the publishing from redis to tornado is not working anymore

@kylehuang17
Copy link
Author

My situation is the same, only part of the processes would hang. I still not found the solution for this even I had try to tune the linux network core params and redis params. So I changed the transfer protocol from pure Websocket to SockJS which is like Socket.IO, trying to use Websocket in default and replacing by comet technology when Websocket not working in some network environment. Not like the example in Websocket, tornado-redis uses only one socket to subscribe messages of all channels, and the issue never happening. I guess that ... maybe there are strange things that would cause this problem while too many redis subscribe clients exist.

@blodone
Copy link

blodone commented Jul 8, 2015

i use swampdragon which has tornado-redis as a part but with SockJS as protocol.
For me i have created a heartbeat checker which checks publish on tornado every 10secs and if its not working anymore it restarts the whole tornado server... works kind of... but i think it should work more reliable without restarting the whole thing and hassle with the re-subscribing of every client...

@Tinche
Copy link

Tinche commented Sep 22, 2015

So we're getting this as well. It seems the tornado-redis client destructor is to blame. I've managed to attach to a blocked process with Pyrasite and get the following stack trace:

['  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main\n    "__main__", fname, loader, pkg_name)\n',
 '  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code\n    exec code in run_globals\n',
 '  File "/root/ngs_webserver/server.py", line 117, in <module>\n    main()\n',
 '  File "/root/ngs_webserver/server.py", line 113, in main\n    tornado.ioloop.IOLoop.instance().start()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 151, in start\n    super(ZMQIOLoop, self).start()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 781, in start\n    self._run_callback(callback)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 565, in _run_callback\n    ret = callback()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper\n    return fn(*args, **kwargs)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 508, in wrapper\n    return callback(*args)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped\n    ret = fn(*args, **kwargs)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/websocket.py", line 530, in _on_frame_start\n    self.stream.read_bytes(2, self._on_frame_length_16)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 284, in read_bytes\n    self._try_inline_read()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 653, in _try_inline_read\n    self._read_from_buffer(pos)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 720, in _read_from_buffer\n    self._run_read_callback(pos, False)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 636, in _run_read_callback\n    self._run_callback(callback, self._consume(size))\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 537, in _run_callback\n    self.io_loop.add_callback(wrapper)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 888, in add_callback\n    stack_context.wrap(callback), *args, **kwargs))\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornadoredis/client.py", line 260, in __del__\n    connection.disconnect()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornadoredis/connection.py", line 97, in disconnect\n    s.close()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 388, in close\n    self._maybe_run_close_callback()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 416, in _maybe_run_close_callback\n    self._run_callback(cb)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 537, in _run_callback\n    self.io_loop.add_callback(wrapper)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 883, in add_callback\n    with self._callback_lock:\n']

So, the client is trying to close its connection in the destructor, which gets executed while another callback is being added (and a non-reentrant lock has been taken already in the io loop), hence the deadlock.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants