Created attachment 1928235 [details] debug sssd logs Description of problem: After severe load on the system, including oom state, sssd's dbus looses ability to talk with it's data providers: * ... skipping repetitive backtrace ... (2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.pam] timed out before identification [0x55cc9111b620]! * ... skipping repetitive backtrace ... (2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.nss] timed out before identification [0x55cc91118770]! This state remains until the sssd service is restarted. Version-Release number of selected component (if applicable): RHEL 8.6, sssd-2.6.2-4.el8_6.x86_64 How reproducible: Couldn't reproduce the state in the lab. In client's environment, it's happening every 24 hours when custom cron script is running. Steps to Reproduce: 1. 2. 3. Actual results: SSSD state is not recovered automatically Expected results: SSSD continues to work as expected after the system is out of heavy load and/or OOM. Additional info: Debug logs attached (nss trimmed to last 2000 lines, full logs in case). Issue happened around 5:39-5:41 AM , difficult to distinguish better because system appears to be hang at the time. Additional data and sosreport are in the attached case.
Upstream PR: https://github.com/SSSD/sssd/pull/6804
Pushed PR: https://github.com/SSSD/sssd/pull/6804 * `master` * cca9361d92501e0be34d264d370fe897a0c970af - sbus: arm watchdog for sbus_connect_init_send() * 75f2b35ad3b9256de905d05c5108400d35688554 - watchdog: add arm_watchdog() and disarm_watchdog() calls * `sssd-2-8` * 55564defec8fdbb4d9df6b0124a8b18b31743230 - sbus: arm watchdog for sbus_connect_init_send() * 2cd5a6a2c8fd1826177d6bb51e7d4f4ad368bcfb - watchdog: add arm_watchdog() and disarm_watchdog() calls * `sssd-2-9` * 27987c791bc452f53696a3a33f0d607ab040e78d - sbus: arm watchdog for sbus_connect_init_send() * f16e570838d1c6cd30b5883f364b0f437c314b1f - watchdog: add arm_watchdog() and disarm_watchdog() calls