Bug 2218858 - [sssd] SSSD enters failed state after heavy load in the system
Summary: [sssd] SSSD enters failed state after heavy load in the system
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: sssd
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Sumit Bose
QA Contact: shridhar
URL:
Whiteboard: sync-to-jira
Depends On:
Blocks: 2219354 2219353
TreeView+ depends on / blocked
 
Reported: 2023-06-30 09:48 UTC by Alexey Tikhonov
Modified: 2023-07-19 12:18 UTC (History)
3 users (show)

Fixed In Version: sssd-2.9.1-2.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2219353 2219354 (view as bug list)
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 6803 0 None open [sssd] SSSD enters failed state after heavy load in the system 2023-06-30 09:54:37 UTC
Github SSSD sssd pull 6804 0 None open sbus: arm watchdog for sbus_connect_init_send() 2023-06-30 09:54:37 UTC
Red Hat Issue Tracker RHELPLAN-161311 0 None None None 2023-06-30 09:49:35 UTC
Red Hat Issue Tracker SSSD-6381 0 None None None 2023-06-30 09:51:43 UTC

Description Alexey Tikhonov 2023-06-30 09:48:42 UTC
This bug was initially created as a copy of Bug #2149241

I am copying this bug because: to track fix for RHEL9



Description of problem:
After severe load on the system, including oom state, sssd's dbus looses ability to talk with it's data providers:

   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.pam] timed out before identification [0x55cc9111b620]!
   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.nss] timed out before identification [0x55cc91118770]!

This state remains until the sssd service is restarted.


Version-Release number of selected component (if applicable):
RHEL 8.6, sssd-2.6.2-4.el8_6.x86_64

How reproducible:
Couldn't reproduce the state in the lab. In client's environment, it's happening every 24 hours when custom cron script is running.

Steps to Reproduce:
1.
2.
3.

Actual results:

SSSD state is not recovered automatically

Expected results:

SSSD continues to work as expected after the system is out of heavy load and/or OOM.

Additional info:

Debug logs attached (nss trimmed to last 2000 lines, full logs in case). Issue happened around 5:39-5:41 AM , difficult to distinguish better because system appears to be hang at the time.

Additional data and sosreport are in the attached case.

Comment 1 Alexey Tikhonov 2023-06-30 09:52:02 UTC
Upstream PR: https://github.com/SSSD/sssd/pull/6804

Comment 4 Alexey Tikhonov 2023-07-04 13:39:42 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/6804

* `master`
    * cca9361d92501e0be34d264d370fe897a0c970af - sbus: arm watchdog for sbus_connect_init_send()
    * 75f2b35ad3b9256de905d05c5108400d35688554 - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-8`
    * 55564defec8fdbb4d9df6b0124a8b18b31743230 - sbus: arm watchdog for sbus_connect_init_send()
    * 2cd5a6a2c8fd1826177d6bb51e7d4f4ad368bcfb - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-9`
    * 27987c791bc452f53696a3a33f0d607ab040e78d - sbus: arm watchdog for sbus_connect_init_send()
    * f16e570838d1c6cd30b5883f364b0f437c314b1f - watchdog: add arm_watchdog() and disarm_watchdog() calls


Note You need to log in before you can comment on or make changes to this bug.