Bug 2149241 - [sssd] SSSD enters failed state after heavy load in the system
Summary: [sssd] SSSD enters failed state after heavy load in the system
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: sssd
Version: 8.6
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Sumit Bose
QA Contact: shridhar
URL:
Whiteboard: sync-to-jira
Depends On:
Blocks: 2219352 2219351
TreeView+ depends on / blocked
 
Reported: 2022-11-29 10:19 UTC by Aleksandr Sharov
Modified: 2023-07-24 08:30 UTC (History)
11 users (show)

Fixed In Version: sssd-2.9.1-2.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2219351 2219352 (view as bug list)
Environment:
Last Closed: 2023-04-11 13:40:51 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
debug sssd logs (16.82 MB, application/gzip)
2022-11-29 10:19 UTC, Aleksandr Sharov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 6803 0 None open [sssd] SSSD enters failed state after heavy load in the system 2023-06-30 08:19:26 UTC
Github SSSD sssd pull 6804 0 None open sbus: arm watchdog for sbus_connect_init_send() 2023-06-30 09:54:25 UTC
Red Hat Issue Tracker RHELPLAN-140762 0 None None None 2022-11-29 10:27:48 UTC
Red Hat Issue Tracker SSSD-6386 0 None None None 2023-07-03 13:35:57 UTC

Description Aleksandr Sharov 2022-11-29 10:19:52 UTC
Created attachment 1928235 [details]
debug sssd logs

Description of problem:
After severe load on the system, including oom state, sssd's dbus looses ability to talk with it's data providers:

   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.pam] timed out before identification [0x55cc9111b620]!
   *  ... skipping repetitive backtrace ...
(2022-11-26 12:18:08): [be[default]] [dp_client_handshake_timeout] (0x0040): Client [sssd.nss] timed out before identification [0x55cc91118770]!

This state remains until the sssd service is restarted.


Version-Release number of selected component (if applicable):
RHEL 8.6, sssd-2.6.2-4.el8_6.x86_64

How reproducible:
Couldn't reproduce the state in the lab. In client's environment, it's happening every 24 hours when custom cron script is running.

Steps to Reproduce:
1.
2.
3.

Actual results:

SSSD state is not recovered automatically

Expected results:

SSSD continues to work as expected after the system is out of heavy load and/or OOM.

Additional info:

Debug logs attached (nss trimmed to last 2000 lines, full logs in case). Issue happened around 5:39-5:41 AM , difficult to distinguish better because system appears to be hang at the time.

Additional data and sosreport are in the attached case.

Comment 50 Alexey Tikhonov 2023-06-30 09:51:21 UTC
Upstream PR: https://github.com/SSSD/sssd/pull/6804

Comment 53 Alexey Tikhonov 2023-07-04 13:39:07 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/6804

* `master`
    * cca9361d92501e0be34d264d370fe897a0c970af - sbus: arm watchdog for sbus_connect_init_send()
    * 75f2b35ad3b9256de905d05c5108400d35688554 - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-8`
    * 55564defec8fdbb4d9df6b0124a8b18b31743230 - sbus: arm watchdog for sbus_connect_init_send()
    * 2cd5a6a2c8fd1826177d6bb51e7d4f4ad368bcfb - watchdog: add arm_watchdog() and disarm_watchdog() calls
* `sssd-2-9`
    * 27987c791bc452f53696a3a33f0d607ab040e78d - sbus: arm watchdog for sbus_connect_init_send()
    * f16e570838d1c6cd30b5883f364b0f437c314b1f - watchdog: add arm_watchdog() and disarm_watchdog() calls


Note You need to log in before you can comment on or make changes to this bug.