Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDAP purge operation (cleanup_groups()) times out, so that sssd_be is terminated by internal watchdog #7851

Open
marco-kusa opened this issue Feb 24, 2025 · 12 comments

Comments

@marco-kusa
Copy link

This is related to #7793 as discussed with Alexei Tikonov, using the latest patched version (v3) related to that Bug

Steps to reproduce:

  • grow database to >100MB (550MB in our case) with enough groups/users
  • wait for
    entry_cache_timeout
    entry_cache_user_timeout
    entry_cache_group_timeout
    to expire the entries
  • set ldap_purge_cache_timeout to a low value to trigger a db purge event

Observed behaviour:

  • Each group in the db goes through a search operation:

[be[ad.dneg.com]] [cleanup_groups] (0x1000): Searching with: ...

each search operation takes 1/2 seconds

  • the operation never completes and sssd_be crashes / is restarted
  • the purge operation starts again, and it repeats

This is especially bad since even if it didn't crash the purge operation seems to be blocking any request to sssd?

I will open a case on the RH support portal and upload related backend lvl 9 logs

Thanks

@marco-kusa
Copy link
Author

marco-kusa commented Feb 24, 2025

I should add that sssd_be is at 100% single thread cpu while doing the searches, and all commands that depend on sssd are blocked (eg. ps)

@alexey-tikhonov
Copy link
Member

and sssd_be crashes / is restarted

Is there "Child [...] ('...':'...') was terminated by own WATCHDOG" message in /var/log/sssd.log and in system journal that corresponds to this moment?

@alexey-tikhonov
Copy link
Member

  • Each group in the db goes through a search operation:
    [be[ad.dneg.com]] [cleanup_groups] (0x1000): Searching with: ...
    each search operation takes 1/2 seconds

Well, this again uses "(%s=%s)", SYSDB_MEMBEROF, ... search:

subfilter = talloc_asprintf(tmpctx, "(%s=%s)", SYSDB_MEMBEROF,

Probably deref (sysdb_asq_search()) can be used here as well...

@marco-kusa
Copy link
Author

marco-kusa commented Feb 26, 2025

any idea of why:

  • it crashes the process after some time
  • it blocks everything that requests data to sssd while it runs?

Thanks

ps. you can get full logs in the RH Case 04069463

@alexey-tikhonov
Copy link
Member

any idea of why:
* it crashes the process after some time

See #7851 (comment)
If I guess right it doesn't crash but is terminated by internal watchdog.

* it blocks everything that requests data to sssd while it runs?

Because cleanup_groups() is one blocking operation. I guess original author(s) didn't expect it to run so long.

@alexey-tikhonov
Copy link
Member

alexey-tikhonov commented Feb 26, 2025

ps. you can get full logs in the RH Case 04069463

Right, this is it - internal watchdog.

timeout seems to be big already, but if you'll increase it by some margin, maybe +5..10%, job should be able to finish.

@marco-kusa
Copy link
Author

Right so since we have control over the timeout and we can get a fix for the db read speed backported like in the other issue we should be good?

@alexey-tikhonov
Copy link
Member

Issue is clear, but patches to fix this issue is yet to be written (so there is nothing to backport at the moment).
It's similar to #7793 but fix won't be exactly the same.

But if I understand correctly, you were using purge to work around #7793, so hopefully it is not that critical for you if lookup latency improved.

@alexey-tikhonov alexey-tikhonov changed the title ldap_purge_cache_timeout crashes sssd_be (AD) with large db sizes (and loops) LDAP purge operation (cleanup_groups()) times out, so that sssd_be is terminated by internal watchdog Feb 27, 2025
@marco-kusa
Copy link
Author

Ideally we'd be able to run the purge but yes our priority is to get #7793 fixed, does the backport mean that we'll be able to get #7793 fixed in RHEL 8.10 too? Thanks

@alexey-tikhonov
Copy link
Member

alexey-tikhonov commented Feb 27, 2025

does the backport mean that we'll be able to get #7793 fixed in RHEL 8.10 too?

What do you mean saying "the backport" in this context?

Anyway, I can't give you any hard promises wrt specific product versions in general. And definitely not now, when patches weren't even reviewed upstream. But if patches get accepted and no regressions are found, sure thing we'll try our best to deliver it.

@marco-kusa
Copy link
Author

marco-kusa commented Feb 27, 2025 via email

@marco-kusa
Copy link
Author

Alexei,

if improving the performance of the purge is so problematic,

it's acceptable for us to clear the database completely, but we don't want to have a hard service restart that would cause any requests in the time interval that the service is down to fail.

Essentially if there was a database reset option that simply purged the whole database with a locking operation (like the current purge) that just kept the clients waiting while the db is reset that would be fine.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants