Bug 2218858

Summary: [sssd] SSSD enters failed state after heavy load in the system
Product: Red Hat Enterprise Linux 9 Reporter: Alexey Tikhonov <atikhono>
Component: sssdAssignee: Sumit Bose <sbose>
Status: VERIFIED --- QA Contact: shridhar <sgadekar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.2CC: pbrezina, sgadekar, tscherf
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira
Fixed In Version: sssd-2.9.1-2.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2219353 2219354 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2219354, 2219353    

Description Alexey Tikhonov 2023-06-30 09:48:42 UTC
This bug was initially created as a copy of Bug #2149241

I am copying this bug because: to track fix for RHEL9



Description of problem:
After severe load on the system, including oom state, sssd's dbus looses ability to talk with it's data providers:

   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.pam] timed out before identification [0x55cc9111b620]!
   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.nss] timed out before identification [0x55cc91118770]!

This state remains until the sssd service is restarted.


Version-Release number of selected component (if applicable):
RHEL 8.6, sssd-2.6.2-4.el8_6.x86_64

How reproducible:
Couldn't reproduce the state in the lab. In client's environment, it's happening every 24 hours when custom cron script is running.

Steps to Reproduce:
1.
2.
3.

Actual results:

SSSD state is not recovered automatically

Expected results:

SSSD continues to work as expected after the system is out of heavy load and/or OOM.

Additional info:

Debug logs attached (nss trimmed to last 2000 lines, full logs in case). Issue happened around 5:39-5:41 AM , difficult to distinguish better because system appears to be hang at the time.

Additional data and sosreport are in the attached case.

Comment 1 Alexey Tikhonov 2023-06-30 09:52:02 UTC
Upstream PR: https://github.com/SSSD/sssd/pull/6804

Comment 4 Alexey Tikhonov 2023-07-04 13:39:42 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/6804

* `master`
    * cca9361d92501e0be34d264d370fe897a0c970af - sbus: arm watchdog for sbus_connect_init_send()
    * 75f2b35ad3b9256de905d05c5108400d35688554 - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-8`
    * 55564defec8fdbb4d9df6b0124a8b18b31743230 - sbus: arm watchdog for sbus_connect_init_send()
    * 2cd5a6a2c8fd1826177d6bb51e7d4f4ad368bcfb - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-9`
    * 27987c791bc452f53696a3a33f0d607ab040e78d - sbus: arm watchdog for sbus_connect_init_send()
    * f16e570838d1c6cd30b5883f364b0f437c314b1f - watchdog: add arm_watchdog() and disarm_watchdog() calls