LDAP purge operation (`cleanup_groups()`) times out, so that `sssd_be` is terminated by internal watchdog #7851

marco-kusa · 2025-02-24T22:08:27Z

This is related to #7793 as discussed with Alexei Tikonov, using the latest patched version (v3) related to that Bug

Steps to reproduce:

grow database to >100MB (550MB in our case) with enough groups/users
wait for
entry_cache_timeout
entry_cache_user_timeout
entry_cache_group_timeout
to expire the entries
set ldap_purge_cache_timeout to a low value to trigger a db purge event

Observed behaviour:

Each group in the db goes through a search operation:

[be[ad.dneg.com]] [cleanup_groups] (0x1000): Searching with: ...

each search operation takes 1/2 seconds

the operation never completes and sssd_be crashes / is restarted
the purge operation starts again, and it repeats

This is especially bad since even if it didn't crash the purge operation seems to be blocking any request to sssd?

I will open a case on the RH support portal and upload related backend lvl 9 logs

Thanks

The text was updated successfully, but these errors were encountered:

marco-kusa · 2025-02-24T22:14:33Z

I should add that sssd_be is at 100% single thread cpu while doing the searches, and all commands that depend on sssd are blocked (eg. ps)

alexey-tikhonov · 2025-02-26T15:04:56Z

and sssd_be crashes / is restarted

Is there "Child [...] ('...':'...') was terminated by own WATCHDOG" message in /var/log/sssd.log and in system journal that corresponds to this moment?

alexey-tikhonov · 2025-02-26T15:15:44Z

Each group in the db goes through a search operation:
[be[ad.dneg.com]] [cleanup_groups] (0x1000): Searching with: ...
each search operation takes 1/2 seconds

Well, this again uses "(%s=%s)", SYSDB_MEMBEROF, ... search:

sssd/src/providers/ldap/ldap_id_cleanup.c

Line 467 in e2408c2

subfilter = talloc_asprintf(tmpctx, "(%s=%s)", SYSDB_MEMBEROF,

Probably deref (sysdb_asq_search()) can be used here as well...

marco-kusa · 2025-02-26T15:27:07Z

any idea of why:

it crashes the process after some time
it blocks everything that requests data to sssd while it runs?

Thanks

ps. you can get full logs in the RH Case 04069463

alexey-tikhonov · 2025-02-26T15:33:08Z

any idea of why:
* it crashes the process after some time

See #7851 (comment)
If I guess right it doesn't crash but is terminated by internal watchdog.

* it blocks everything that requests data to sssd while it runs?

Because cleanup_groups() is one blocking operation. I guess original author(s) didn't expect it to run so long.

alexey-tikhonov · 2025-02-26T15:43:12Z

ps. you can get full logs in the RH Case 04069463

Right, this is it - internal watchdog.

timeout seems to be big already, but if you'll increase it by some margin, maybe +5..10%, job should be able to finish.

marco-kusa · 2025-02-26T23:46:23Z

Right so since we have control over the timeout and we can get a fix for the db read speed backported like in the other issue we should be good?

alexey-tikhonov · 2025-02-27T07:55:37Z

Issue is clear, but patches to fix this issue is yet to be written (so there is nothing to backport at the moment).
It's similar to #7793 but fix won't be exactly the same.

But if I understand correctly, you were using purge to work around #7793, so hopefully it is not that critical for you if lookup latency improved.

marco-kusa · 2025-02-27T12:31:34Z

Ideally we'd be able to run the purge but yes our priority is to get #7793 fixed, does the backport mean that we'll be able to get #7793 fixed in RHEL 8.10 too? Thanks

alexey-tikhonov · 2025-02-27T13:19:17Z

does the backport mean that we'll be able to get #7793 fixed in RHEL 8.10 too?

What do you mean saying "the backport" in this context?

Anyway, I can't give you any hard promises wrt specific product versions in general. And definitely not now, when patches weren't even reviewed upstream. But if patches get accepted and no regressions are found, sure thing we'll try our best to deliver it.

marco-kusa · 2025-02-27T16:19:29Z

Alright thanks appreciate it!

…

On Thu, 27 Feb 2025, at 1:19 PM, Alexey A Tikhonov wrote: > does the backport mean that we'll be able to get #7793 <#7793> fixed in RHEL 8.10 too? > What do you mean saying "the backport" in this context? Anyway, I can't give you any hard promises wrt specific product versions in general. And definitely not now, when patches weren't even reviewed upstream. But if patches get accepted and no regressions are found, sure thing we'll try out best to deliver it. — Reply to this email directly, view it on GitHub <#7851 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC2BPLNZSJIDBMOQDZTO7TL2R4GGXAVCNFSM6AAAAABXZDWAMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBXHE2DIMBRG4>. You are receiving this because you authored the thread.Message ID: ***@***.***> alexey-tikhonov*alexey-tikhonov* left a comment (SSSD/sssd#7851) <#7851 (comment)> > does the backport mean that we'll be able to get #7793 <#7793> fixed in RHEL 8.10 too? > What do you mean saying "the backport" in this context? Anyway, I can't give you any hard promises wrt specific product versions in general. And definitely not now, when patches weren't even reviewed upstream. But if patches get accepted and no regressions are found, sure thing we'll try out best to deliver it. — Reply to this email directly, view it on GitHub <#7851 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC2BPLNZSJIDBMOQDZTO7TL2R4GGXAVCNFSM6AAAAABXZDWAMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBXHE2DIMBRG4>. You are receiving this because you authored the thread.Message ID: ***@***.***>

marco-kusa · 2025-03-14T14:38:06Z

Alexei,

if improving the performance of the purge is so problematic,

it's acceptable for us to clear the database completely, but we don't want to have a hard service restart that would cause any requests in the time interval that the service is down to fail.

Essentially if there was a database reset option that simply purged the whole database with a locking operation (like the current purge) that just kept the clients waiting while the db is reset that would be fine.

Thanks

marco-kusa mentioned this issue Feb 25, 2025

Disk cache failure with large db sizes #7793

Closed

alexey-tikhonov changed the title ~~ldap_purge_cache_timeout crashes sssd_be (AD) with large db sizes (and loops)~~ LDAP purge operation (cleanup_groups()) times out, so that sssd_be is terminated by internal watchdog Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDAP purge operation (`cleanup_groups()`) times out, so that `sssd_be` is terminated by internal watchdog #7851

LDAP purge operation (`cleanup_groups()`) times out, so that `sssd_be` is terminated by internal watchdog #7851

marco-kusa commented Feb 24, 2025

marco-kusa commented Feb 24, 2025 •

edited

Loading

alexey-tikhonov commented Feb 26, 2025

alexey-tikhonov commented Feb 26, 2025

marco-kusa commented Feb 26, 2025 •

edited

Loading

alexey-tikhonov commented Feb 26, 2025

alexey-tikhonov commented Feb 26, 2025 •

edited

Loading

marco-kusa commented Feb 26, 2025

alexey-tikhonov commented Feb 27, 2025

marco-kusa commented Feb 27, 2025

alexey-tikhonov commented Feb 27, 2025 •

edited

Loading

marco-kusa commented Feb 27, 2025 via email

marco-kusa commented Mar 14, 2025

LDAP purge operation (cleanup_groups()) times out, so that sssd_be is terminated by internal watchdog #7851

LDAP purge operation (cleanup_groups()) times out, so that sssd_be is terminated by internal watchdog #7851

Comments

marco-kusa commented Feb 24, 2025

marco-kusa commented Feb 24, 2025 • edited Loading

alexey-tikhonov commented Feb 26, 2025

alexey-tikhonov commented Feb 26, 2025

marco-kusa commented Feb 26, 2025 • edited Loading

alexey-tikhonov commented Feb 26, 2025

alexey-tikhonov commented Feb 26, 2025 • edited Loading

marco-kusa commented Feb 26, 2025

alexey-tikhonov commented Feb 27, 2025

marco-kusa commented Feb 27, 2025

alexey-tikhonov commented Feb 27, 2025 • edited Loading

marco-kusa commented Feb 27, 2025 via email

marco-kusa commented Mar 14, 2025

LDAP purge operation (`cleanup_groups()`) times out, so that `sssd_be` is terminated by internal watchdog #7851

LDAP purge operation (`cleanup_groups()`) times out, so that `sssd_be` is terminated by internal watchdog #7851

marco-kusa commented Feb 24, 2025 •

edited

Loading

marco-kusa commented Feb 26, 2025 •

edited

Loading

alexey-tikhonov commented Feb 26, 2025 •

edited

Loading

alexey-tikhonov commented Feb 27, 2025 •

edited

Loading