RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1837545 - Users must be informed better when internal WATCHDOG terminates process.
Summary: Users must be informed better when internal WATCHDOG terminates process.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd
Version: 7.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Alexey Tikhonov
QA Contact: sssd-qe
URL:
Whiteboard: sync-to-jira review
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-19 15:59 UTC by Alexey Tikhonov
Modified: 2020-09-29 19:50 UTC (History)
9 users (show)

Fixed In Version: sssd-1.16.5-7.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 19:50:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 5146 0 None closed Users must be informed better when internal WATCHDOG terminates process. 2021-02-11 17:44:46 UTC
Red Hat Product Errata RHBA-2020:3904 0 None None None 2020-09-29 19:50:43 UTC

Description Alexey Tikhonov 2020-05-19 15:59:02 UTC
Currently it is very difficult to understand that one of SSSD processes was terminated by internal watchdog.

This often makes troubleshooting complicated.

It would be beneficial to have this information easily available in system journal and sssd logs.

Comment 2 Alexey Tikhonov 2020-05-19 16:01:29 UTC
Upstream PR: https://github.com/SSSD/sssd/issues/5146

Comment 4 Alexey Tikhonov 2020-05-21 09:16:06 UTC
    sssd-1-16
        f657f03 - WATCHDOG: log process termination to the journal
        225fe99 - DEBUG: changed "debug_prg_name" format
        902c893 - DEBUG: introduce new SSSDBG_TRACE_LDB level
        502138b - DEBUG: changed timestamp output format

Comment 11 Niranjan Mallapadi Raghavender 2020-05-28 17:01:10 UTC
version:
sssd-common-pac-1.16.4-21.el7_7.1.x86_64
python-sssdconfig-1.16.4-21.el7_7.1.noarch
sssd-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.4-21.el7_7.1.x86_64
sssd-krb5-common-1.16.4-21.el7_7.1.x86_64
sssd-ldap-1.16.4-21.el7_7.1.x86_64
sssd-proxy-1.16.4-21.el7_7.1.x86_64
sssd-ipa-1.16.4-21.el7_7.1.x86_64
sssd-common-1.16.4-21.el7_7.1.x86_64
sssd-krb5-1.16.4-21.el7_7.1.x86_64
sssd-ad-1.16.4-21.el7_7.1.x86_64


Reproducing the issue:
Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created
1. allusers which contains 1500000 users as members 
2. lgroup1 which contains 15000 users as members
3. lgroup2 which contains 50000 users as members 


cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = example1
debug_level = 9

[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://cloud-qe-09.idmqe.lab.eng.bos.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
timeout = 2
debug_level = 9

[nss]
debug_level = 9

4. Start sssd and note the process pid
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23986 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     23998 23843  0 12:47 pts/1    00:00:00 grep --color=auto sssd


5. First lookup group lgroup1 which contains 15000 users 
time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1

real    0m0.816s
user    0m0.002s
sys     0m0.007s

5. now lookup group lgroup2 which contains 50000 users 
[root@qe-blade-01 sssd]# time getent group lgroup2@example1

real    0m15.850s
user    0m0.001s
sys     0m0.001s

6. check sssd process 

[root@qe-blade-01 sssd]# ps -ef | grep sssd
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     24001 23985  1 12:50 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     24003 23843  0 12:50 pts/1    00:00:00 grep --color=auto sssd

As you can see sssd_be process was shutdown and restarted by watchdog. 

(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sysdb_cache_search_users] (0x2000): No such entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_missing_member_2307] (0x0400): Adding a ghost entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_group_send] (0x1000): All group members processed
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_active_requests] (0x0400): Terminating active data provider requests
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_request] (0x4000): DP Request [Account #2]: Terminating.
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [_dp_req_recv] (0x0400): DP Request [Account #2]: Receiving request data.


7. Update sssd to below versions:

sssd-common-pac-1.16.5-7.el7.x86_64
sssd-krb5-1.16.5-7.el7.x86_64
sssd-winbind-idmap-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.5-7.el7.x86_64
sssd-krb5-common-1.16.5-7.el7.x86_64
sssd-ipa-1.16.5-7.el7.x86_64
sssd-ldap-1.16.5-7.el7.x86_64
sssd-dbus-1.16.5-7.el7.x86_64
sssd-tools-1.16.5-7.el7.x86_64
sssd-kcm-1.16.5-7.el7.x86_64
python-sssdconfig-1.16.5-7.el7.noarch
sssd-common-1.16.5-7.el7.x86_64
sssd-ad-1.16.5-7.el7.x86_64
sssd-proxy-1.16.5-7.el7.x86_64
sssd-1.16.5-7.el7.x86_64

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  1 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29985 29984  3 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29989 29873  0 12:57 pts/0    00:00:00 grep --color=auto sssd

8. lookup group lgroup1@example1

time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1
real    0m0.008s
user    0m0.002s
sys     0m0.004s


9. lookup group lgroup2@example1


[root@auto-hv-02-guest10 sssd]# time getent group lgroup2@example1

real    0m3.955s
user    0m0.000s
sys     0m0.002s

10. we can see that sssd_be was shutdown and below message logged:

May 28 12:57:53 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com polkitd[27670]: Unregistered Authentication Agent for unix-process:29978:12930963 (system bus name :1.43, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com sssd[29984]: Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com be[example1][29991]: Starting up


11. check the sssd_be process id which is changed as sssd_be process was restarted

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  0 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29991 29984  0 12:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     30001 29873  0 12:59 pts/0    00:00:00 grep --color=auto sssd

12. sssd.log contains below information:

sssd.log:(2020-05-28 12:58:26): [sssd] [svc_child_info] (0x0020): Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG

Comment 13 errata-xmlrpc 2020-09-29 19:50:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3904


Note You need to log in before you can comment on or make changes to this bug.