Skip to content

Judgehost stuck: deadlock in runguard #3034

Open
@rry-je

Description

@rry-je

Description of the problem

During GCPC, one of our judgehosts got stuck in a deadlock and stopped judging submissions.

Your environment

DOMjudge version: b32fc5c8c5a84160e6e26203228fbbe1ec8444e9 with 028995f9c00e7897ec863283986ef995661e38b9 cherry-picked on top of that
Operating system / Linux distribution and version: Debian GNU/Linux 12 (bookworm), 6.1.0-27-amd64
Webserver: nginx/1.22.1

Steps to reproduce

We don't have a reproducer. Probably timing-dependent

Expected behaviour

Judgehost should judge submission and continue judging afterwards.

Actual behaviour

Judgehost stops judging and the web interface shows a warning.

Any other information that you want to share?

Here's a stacktrace from the runguard process:

(gdb) bt
#0  0x00007fa4dc0a50d6 in __lll_lock_wait_private () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fa4dc07da45 in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fa4dc135a7f in __fprintf_chk () from target:/lib/x86_64-linux-gnu/libc.so.6
#3  0x000055cc552c6621 in fprintf (__fmt=0x55cc552cb004 "%s: warning: ", __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:79
#4  warning (format=format@entry=0x55cc552cbc08 "timelimit exceeded (hard wall time): aborting command") at runguard.cc:203
#5  0x000055cc552c6d8d in terminate (sig=14) at runguard.cc:693
#6  <signal handler called>
#7  0x00007fa4dc07d9ff in ?? () from target:/lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fa4dc135a7f in __fprintf_chk () from target:/lib/x86_64-linux-gnu/libc.so.6
#9  0x000055cc552c6621 in fprintf (__fmt=0x55cc552cb004 "%s: warning: ", __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:79
#10 warning (format=format@entry=0x55cc552ccf28 "timelimit exceeded (hard cpu time)") at runguard.cc:203
#11 0x000055cc552c62c1 in main (argc=<optimized out>, argv=<optimized out>) at runguard.cc:1517 

It seems like the signal handler terminate() for SIGALRM called fprintf() (via warning()) at the same time as the main function. fprintf seems to use locking, and since the signal handler blocks the execution of main, the lock is never released and the process is stuck.

I think fixing this would require (at least) removing all output from the signal handler.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions