Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launch command worker earlier #470

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sni
Copy link
Contributor

@sni sni commented Jul 4, 2024

since the command worker forks from the main naemon process, it inherits all open files like ex.: pidfile, logfiles, etc... It will keep those references open, even if the main process rotates and reopens those files.

This patch closes query handler and pid file references after starting the command worker and also moves starting the command worker before initializing the neb modules, so it won't inherit open logfiles from neb modules.

references:

since the command worker forks the main naemon process, it inherits all open
files like ex.: pidfile, logfiles, etc... It will keep those references open, even
if the main process rotates and reopens those files.

This patch closes query handler and pid file references after starting the
command worker and also moves starting the command worker before initializing
the neb modules, so it won't inherit open logfiles from neb modules.

references:

- ConSol-Monitoring/omd#146

Signed-off-by: Sven Nierlein <[email protected]>
@sni sni force-pushed the launch_command_worker_earlier branch from ddc4d76 to e842ca7 Compare July 5, 2024 09:09
@nook24
Copy link
Member

nook24 commented Jul 7, 2024

The only issue I see with this is that the command file worker completely loses the ability to communicate. Currently, the command file worker is able to write to the naemon.log until this line:

free_memory(get_global_macros());

After that, it can only write to stdout to log any errors. Not optimal, but better than nothing, I guess.

Due to the change, the command file worker can't even write to stdout anymore, so any error messages are completely gone. The static int command_file_worker method makes some calls to nm_log, which now get completely blackholed.

This change also introduces a sort of "breaking change". Currently, whenever Naemon logs the message
Successfully launched command file worker with pid 12135, we know that Naemon has done all initial processing and is up and running.

I personal only use this when working or debugging Naemon directly, but it turns out that some scripts rely on this. For example, this one which is part of our repository:

def naemon_started_and_ready(context, timeout_s):
timeout_s = int(timeout_s or 30)
timeout = time.monotonic() + timeout_s
ready = False
while not ready and time.monotonic() < timeout:
try:
log = slurp_file('naemon.log')
# When we see this line in the log, we'll wait 1 more second and
# then Naemon should be ready, with signal handlers setup so that a
# test can SIGTERM it.
if 'Successfully launched command file worker with pid' in log:
ready = True
time.sleep(1)
break
except OSError:
pass
time.sleep(1)
assert ready, "Naemon did not start within %d seconds" % timeout_s

I think we should add a new message to the log that clearly indicates that Naemon is ready. Something like

Naemon initialization complete. Monitoring initiated.

@sni
Copy link
Contributor Author

sni commented Jul 7, 2024

i am not to happy yet either, maybe we keep the order as it was but send a USR1 signal to the command worker so it can rotate/reopen the logfile as well.

@sni
Copy link
Contributor Author

sni commented Jul 8, 2024

btw, the message Successfully launched command file worker with is logged from the main naemon process and should still be printed.

@nook24
Copy link
Member

nook24 commented Jul 8, 2024

maybe we keep the order as it was but send a USR1 signal to the command worker so it can rotate/reopen the logfile as well.

Currently the command file worker can only write to stdout. I guess this is due to it's a fork() and could otherwise break the log file if multiple processes wring to the log at the same time.

I think this is also the reason why the nm_core_worker is using printf instead of nm_log

btw, the message Successfully launched command file worker with is logged from the main naemon process and should still be printed.

Yes, this message gets still logged as before, but due due it is now before the broker modules get loaded, it does not indicate that Naemon is ready anymore.
To be fair, this message should never indicate that Naemon is ready so I think we should not be too concerned about this. But I would still recommend to add a new log message to make it possible to detect if Naemon is ready.

@sni sni added the WIP label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants