Bug 1845965

Summary: Users must be informed better when internal WATCHDOG terminates process.
Product: Red Hat Enterprise Linux 8 Reporter: Alexey Tikhonov <atikhono>
Component: sssdAssignee: Alexey Tikhonov <atikhono>
Status: CLOSED ERRATA QA Contact: sssd-qe <sssd-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.2CC: grajaiya, jhrozek, lslebodn, mniranja, mzidek, pbrezina, sgoveas, thalman, tscherf
Target Milestone: rcKeywords: Triaged
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-2.3.0-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 02:05:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexey Tikhonov 2020-06-10 13:45:36 UTC
This bug was initially created as a copy of Bug #1837545

I am copying this bug because: to track fix for RHEL8



Currently it is very difficult to understand that one of SSSD processes was terminated by internal watchdog.

This often makes troubleshooting complicated.

It would be beneficial to have this information easily available in system journal and sssd logs.

Comment 1 Alexey Tikhonov 2020-06-10 13:51:43 UTC
* `master`
    * 65369f293b06ce0fe5622502bb32596bb50c523a - WATCHDOG: log process termination to the journal
    * 00e7b1ada3d1c1071eac79b65c17cd2701c2ae6a - DEBUG: changed "debug_prg_name" format
    * b5604d072e93bca7fc0c408fcfbb88f41c4d50ca - DEBUG: introduce new SSSDBG_TRACE_LDB level
    * 7b25375155e7b7f640d30dcfdf5b55482a6102a7 - DEBUG: changed timestamp output format

Comment 4 Niranjan Mallapadi Raghavender 2020-07-13 11:04:26 UTC
python3-sssdconfig-2.3.0-4.el8.noarch
sssd-common-2.3.0-4.el8.x86_64
sssd-winbind-idmap-2.3.0-4.el8.x86_64
sssd-2.3.0-4.el8.x86_64
sssd-client-2.3.0-4.el8.x86_64
sssd-nfs-idmap-2.3.0-4.el8.x86_64
sssd-kcm-2.3.0-4.el8.x86_64
sssd-dbus-2.3.0-4.el8.x86_64
sssd-krb5-common-2.3.0-4.el8.x86_64
sssd-krb5-2.3.0-4.el8.x86_64
sssd-ad-2.3.0-4.el8.x86_64
sssd-ipa-2.3.0-4.el8.x86_64
sssd-tools-2.3.0-4.el8.x86_64
sssd-common-pac-2.3.0-4.el8.x86_64
sssd-ldap-2.3.0-4.el8.x86_64
sssd-proxy-2.3.0-4.el8.x86_64


Reproducing the issue:
Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created
1. allusers which contains 1500000 users as members 
2. lgroup1 which contains 15000 users as members
3. lgroup2 which contains 50000 users as members 


[sssd]
config_file_version = 2
services = nss, pam
domains = example1

[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://auto-hv-02-guest08.idmqe.lab.eng.bos.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
debug_level = 9

[nss]
debug_level = 9

4. Start sssd and note the process pid

[root@ibm-x3650m4-01-vm-11 mc]# ps -ef | grep sssd
root       34653       1  1 06:58 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root       34656   34653  3 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
root       34657   34653  1 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root       34658   34653  1 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root       34659   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files


5. First lookup group lgroup1 which contains 15000 users 

....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1

real    0m4.380s
user    0m0.006s
sys     0m0.008s


6. Check if the sssd_be process id same as in step 4:

root       34653       1  0 06:58 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root       34656   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
root       34657   34653  5 06:58 ?        00:00:03 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root       34658   34653  1 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root       34659   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files

7. now lookup group lgroup2 which contains 50000 users 
[root@qe-blade-01 sssd]# time getent group lgroup2@example1


real    0m19.577s
user    0m0.003s
sys     0m0.004s


8. check sssd process 

[root@ibm-x3650m4-01-vm-11 ~]# ps -ef | grep sssd

9. As we can see sssd_be process was shutdown and restarted by watchdog. 

root       34653       1  0 06:58 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root       34656   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
root       34658   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root       34659   34653  0 06:58 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root       34712   34653  0 07:00 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files



As we can see sssd_be process was shutdown and started as the process id changed. 


10. Below is the logs in sssd.log

(2020-07-13  7:00:22): [sssd] [sbus_dispatch_reconnect] (0x0400): Connection lost. Terminating active requests.
(2020-07-13  7:00:22): [sssd] [sbus_dispatch_reconnect] (0x4000): Remote client terminated the connection. Releasing data...
(2020-07-13  7:00:22): [sssd] [sbus_connection_free] (0x4000): Connection 0x55b2548db630 will be freed during next loop!
(2020-07-13  7:00:22): [sssd] [mt_svc_exit_handler] (0x1000): SIGCHLD handler of service example1 called
(2020-07-13  7:00:22): [sssd] [svc_child_info] (0x0020): Child [34657] ('example1':'%BE_example1') was terminated by own WATCHDOG
(2020-07-13  7:00:22): [sssd] [mt_svc_restart] (0x0400): Scheduling service example1 for restart 1
(2020-07-13  7:00:22): [sssd] [start_service] (0x0100): Queueing service example1 for startup

Comment 7 errata-xmlrpc 2020-11-04 02:05:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4569