Bug 1837545

Summary: Users must be informed better when internal WATCHDOG terminates process.
Product: Red Hat Enterprise Linux 7 Reporter: Alexey Tikhonov <atikhono>
Component: sssdAssignee: Alexey Tikhonov <atikhono>
Status: CLOSED ERRATA QA Contact: sssd-qe <sssd-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.8CC: grajaiya, jhrozek, lslebodn, mniranja, mzidek, pbrezina, sgoveas, thalman, tscherf
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira review
Fixed In Version: sssd-1.16.5-7.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-29 19:50:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexey Tikhonov 2020-05-19 15:59:02 UTC
Currently it is very difficult to understand that one of SSSD processes was terminated by internal watchdog.

This often makes troubleshooting complicated.

It would be beneficial to have this information easily available in system journal and sssd logs.

Comment 2 Alexey Tikhonov 2020-05-19 16:01:29 UTC
Upstream PR: https://github.com/SSSD/sssd/issues/5146

Comment 4 Alexey Tikhonov 2020-05-21 09:16:06 UTC
    sssd-1-16
        f657f03 - WATCHDOG: log process termination to the journal
        225fe99 - DEBUG: changed "debug_prg_name" format
        902c893 - DEBUG: introduce new SSSDBG_TRACE_LDB level
        502138b - DEBUG: changed timestamp output format

Comment 11 Niranjan Mallapadi Raghavender 2020-05-28 17:01:10 UTC
version:
sssd-common-pac-1.16.4-21.el7_7.1.x86_64
python-sssdconfig-1.16.4-21.el7_7.1.noarch
sssd-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.4-21.el7_7.1.x86_64
sssd-krb5-common-1.16.4-21.el7_7.1.x86_64
sssd-ldap-1.16.4-21.el7_7.1.x86_64
sssd-proxy-1.16.4-21.el7_7.1.x86_64
sssd-ipa-1.16.4-21.el7_7.1.x86_64
sssd-common-1.16.4-21.el7_7.1.x86_64
sssd-krb5-1.16.4-21.el7_7.1.x86_64
sssd-ad-1.16.4-21.el7_7.1.x86_64


Reproducing the issue:
Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created
1. allusers which contains 1500000 users as members 
2. lgroup1 which contains 15000 users as members
3. lgroup2 which contains 50000 users as members 


cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = example1
debug_level = 9

[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://cloud-qe-09.idmqe.lab.eng.bos.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
timeout = 2
debug_level = 9

[nss]
debug_level = 9

4. Start sssd and note the process pid
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23986 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     23998 23843  0 12:47 pts/1    00:00:00 grep --color=auto sssd


5. First lookup group lgroup1 which contains 15000 users 
time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1

real    0m0.816s
user    0m0.002s
sys     0m0.007s

5. now lookup group lgroup2 which contains 50000 users 
[root@qe-blade-01 sssd]# time getent group lgroup2@example1

real    0m15.850s
user    0m0.001s
sys     0m0.001s

6. check sssd process 

[root@qe-blade-01 sssd]# ps -ef | grep sssd
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     24001 23985  1 12:50 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     24003 23843  0 12:50 pts/1    00:00:00 grep --color=auto sssd

As you can see sssd_be process was shutdown and restarted by watchdog. 

(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sysdb_cache_search_users] (0x2000): No such entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_missing_member_2307] (0x0400): Adding a ghost entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_group_send] (0x1000): All group members processed
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_active_requests] (0x0400): Terminating active data provider requests
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_request] (0x4000): DP Request [Account #2]: Terminating.
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [_dp_req_recv] (0x0400): DP Request [Account #2]: Receiving request data.


7. Update sssd to below versions:

sssd-common-pac-1.16.5-7.el7.x86_64
sssd-krb5-1.16.5-7.el7.x86_64
sssd-winbind-idmap-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.5-7.el7.x86_64
sssd-krb5-common-1.16.5-7.el7.x86_64
sssd-ipa-1.16.5-7.el7.x86_64
sssd-ldap-1.16.5-7.el7.x86_64
sssd-dbus-1.16.5-7.el7.x86_64
sssd-tools-1.16.5-7.el7.x86_64
sssd-kcm-1.16.5-7.el7.x86_64
python-sssdconfig-1.16.5-7.el7.noarch
sssd-common-1.16.5-7.el7.x86_64
sssd-ad-1.16.5-7.el7.x86_64
sssd-proxy-1.16.5-7.el7.x86_64
sssd-1.16.5-7.el7.x86_64

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  1 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29985 29984  3 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29989 29873  0 12:57 pts/0    00:00:00 grep --color=auto sssd

8. lookup group lgroup1@example1

time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1
real    0m0.008s
user    0m0.002s
sys     0m0.004s


9. lookup group lgroup2@example1


[root@auto-hv-02-guest10 sssd]# time getent group lgroup2@example1

real    0m3.955s
user    0m0.000s
sys     0m0.002s

10. we can see that sssd_be was shutdown and below message logged:

May 28 12:57:53 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com polkitd[27670]: Unregistered Authentication Agent for unix-process:29978:12930963 (system bus name :1.43, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com sssd[29984]: Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com be[example1][29991]: Starting up


11. check the sssd_be process id which is changed as sssd_be process was restarted

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  0 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29991 29984  0 12:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     30001 29873  0 12:59 pts/0    00:00:00 grep --color=auto sssd

12. sssd.log contains below information:

sssd.log:(2020-05-28 12:58:26): [sssd] [svc_child_info] (0x0020): Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG

Comment 13 errata-xmlrpc 2020-09-29 19:50:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3904