Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSSD Not Restarted After Being Killed By Watchdog #7838

Open
markandrewj opened this issue Feb 12, 2025 · 2 comments
Open

SSSD Not Restarted After Being Killed By Watchdog #7838

markandrewj opened this issue Feb 12, 2025 · 2 comments

Comments

@markandrewj
Copy link

markandrewj commented Feb 12, 2025

I am investigating an ongoing issue. We are working with vendor support also, but we have still not been able to find a solution.

SSSD Version: 2.9.4
OS: RHEL 7/8/9

SSSD is connected upstream to a RedHat IdM (FreeIPA) cluster.

There seems to be two related issues.

  1. SSSD is being killed by watchdog. We think external load from backups is causing this to happen, but it is still unclear for certain.
  2. SSSD is not restarted after being killed by Watchdog.

When this happens users become unable to login via SSH. We have tried the following to resolve the issue, but we continue to see SSSD get killed by Watchdog without being restarted.

  • Upgrading SSSD to latest version available to RHEL.
  • Increasing SSSD timeout.
  • Adding 'Restart=on-failure' to the SSSD systemd unit.
  • Looking for selinux alerts and setting selinux to permissive.
  • Disabling third party security services.
  • Validating the configs.
  • Reviewing relevant logs.

As a temporary fix we added a cron job to restart the service, but this does not work reliably. I can collect logs, or configs, at request to further this investigation. I am seeking feedback regarding known issues or ways I may continue to look for root cause.

Thank you in advance.

@alexey-tikhonov
Copy link
Member

alexey-tikhonov commented Feb 25, 2025

Enable 'debug_level = 9' in all relevant section of 'sssd.conf' (main '[sssd]' section and components that are being terminated by a watchdog).

Then inspect the logs to figure out what happens around "Child [...] (...) was terminated by own WATCHDOG" message in the sssd.log
Both to understand what the component was blocked on and why did it fail to restart.

Increasing SSSD timeout.

Where did you put 'timeout' option?

@wjcbsr
Copy link

wjcbsr commented Feb 25, 2025

We've already tried adding the "timeout" option to EVERY section including the main SSSD section. No value whatsoever in examining the logs or rectifying the problem.

[domain/corp.ads]
sudo_provider = ipa
timeout = 30
debug_leverl = 9
cache_credentials = True
krb5_store_password_if_offline = True
ipa_domain = corp.ads
id_provider = ipa
auth_provider = ipa
access_provider = ipa
ipa_hostname = bberesna-rhl9.corp.ads
chpass_provider = ipa
ldap_tls_cacert = /etc/ipa/ca.crt
ipa_server = abc.corp.ads,def.corp.ads,ghi.corp.ads

[sssd]
services = nss, sudo, pam, ssh
domains = corp.ads
config_file_version = 9
debug_level = 2
timeout = 30

[nss]
debug_level = 9
homedir_substring = /home
timeout = 30

[pam]
debug_level = 9
timeout = 30

[sudo]
debug_level = 9
timeout = 30

[autofs]
debug_level = 9
timeout = 30

[ssh]
debug_level = 9
timeout = 30

[pac]
debug_level = 9
timeout = 30

[ifp]
debug_level = 9
timeout = 30
~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants