Bug 1837545
Summary: | Users must be informed better when internal WATCHDOG terminates process. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Alexey Tikhonov <atikhono> |
Component: | sssd | Assignee: | Alexey Tikhonov <atikhono> |
Status: | CLOSED ERRATA | QA Contact: | sssd-qe <sssd-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.8 | CC: | grajaiya, jhrozek, lslebodn, mniranja, mzidek, pbrezina, sgoveas, thalman, tscherf |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | sync-to-jira review | ||
Fixed In Version: | sssd-1.16.5-7.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-29 19:50:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexey Tikhonov
2020-05-19 15:59:02 UTC
Upstream PR: https://github.com/SSSD/sssd/issues/5146 sssd-1-16 f657f03 - WATCHDOG: log process termination to the journal 225fe99 - DEBUG: changed "debug_prg_name" format 902c893 - DEBUG: introduce new SSSDBG_TRACE_LDB level 502138b - DEBUG: changed timestamp output format version: sssd-common-pac-1.16.4-21.el7_7.1.x86_64 python-sssdconfig-1.16.4-21.el7_7.1.noarch sssd-1.16.4-21.el7_7.1.x86_64 sssd-client-1.16.4-21.el7_7.1.x86_64 sssd-krb5-common-1.16.4-21.el7_7.1.x86_64 sssd-ldap-1.16.4-21.el7_7.1.x86_64 sssd-proxy-1.16.4-21.el7_7.1.x86_64 sssd-ipa-1.16.4-21.el7_7.1.x86_64 sssd-common-1.16.4-21.el7_7.1.x86_64 sssd-krb5-1.16.4-21.el7_7.1.x86_64 sssd-ad-1.16.4-21.el7_7.1.x86_64 Reproducing the issue: Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created 1. allusers which contains 1500000 users as members 2. lgroup1 which contains 15000 users as members 3. lgroup2 which contains 50000 users as members cat /etc/sssd/sssd.conf [sssd] config_file_version = 2 services = nss, pam domains = example1 debug_level = 9 [domain/example1] ldap_search_base = dc=example,dc=test id_provider = ldap auth_provider = ldap ldap_user_home_directory = /home/%u ldap_uri = ldaps://cloud-qe-09.idmqe.lab.eng.bos.redhat.com ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem use_fully_qualified_names = True timeout = 2 debug_level = 9 [nss] debug_level = 9 4. Start sssd and note the process pid root 23985 1 0 12:46 ? 00:00:00 /usr/sbin/sssd -i --logger=files root 23986 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files root 23987 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files root 23988 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files root 23998 23843 0 12:47 pts/1 00:00:00 grep --color=auto sssd 5. First lookup group lgroup1 which contains 15000 users time getent group lgroup1@example1 .... .... student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1, student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1, student14993@example1,student14994@example1 real 0m0.816s user 0m0.002s sys 0m0.007s 5. now lookup group lgroup2 which contains 50000 users [root@qe-blade-01 sssd]# time getent group lgroup2@example1 real 0m15.850s user 0m0.001s sys 0m0.001s 6. check sssd process [root@qe-blade-01 sssd]# ps -ef | grep sssd root 23985 1 0 12:46 ? 00:00:00 /usr/sbin/sssd -i --logger=files root 23987 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files root 23988 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files root 24001 23985 1 12:50 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files root 24003 23843 0 12:50 pts/1 00:00:00 grep --color=auto sssd As you can see sssd_be process was shutdown and restarted by watchdog. (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sysdb_cache_search_users] (0x2000): No such entry (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_missing_member_2307] (0x0400): Adding a ghost entry (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_group_send] (0x1000): All group members processed (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh] (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh] (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_active_requests] (0x0400): Terminating active data provider requests (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_request] (0x4000): DP Request [Account #2]: Terminating. (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_id_op_destroy] (0x4000): releasing operation connection (Thu May 28 12:50:21 2020) [sssd[be[example1]]] [_dp_req_recv] (0x0400): DP Request [Account #2]: Receiving request data. 7. Update sssd to below versions: sssd-common-pac-1.16.5-7.el7.x86_64 sssd-krb5-1.16.5-7.el7.x86_64 sssd-winbind-idmap-1.16.4-21.el7_7.1.x86_64 sssd-client-1.16.5-7.el7.x86_64 sssd-krb5-common-1.16.5-7.el7.x86_64 sssd-ipa-1.16.5-7.el7.x86_64 sssd-ldap-1.16.5-7.el7.x86_64 sssd-dbus-1.16.5-7.el7.x86_64 sssd-tools-1.16.5-7.el7.x86_64 sssd-kcm-1.16.5-7.el7.x86_64 python-sssdconfig-1.16.5-7.el7.noarch sssd-common-1.16.5-7.el7.x86_64 sssd-ad-1.16.5-7.el7.x86_64 sssd-proxy-1.16.5-7.el7.x86_64 sssd-1.16.5-7.el7.x86_64 [root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd root 29984 1 1 12:57 ? 00:00:00 /usr/sbin/sssd -i --logger=files root 29985 29984 3 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files root 29986 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files root 29987 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files root 29989 29873 0 12:57 pts/0 00:00:00 grep --color=auto sssd 8. lookup group lgroup1@example1 time getent group lgroup1@example1 .... .... student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1, student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1, student14993@example1,student14994@example1 real 0m0.008s user 0m0.002s sys 0m0.004s 9. lookup group lgroup2@example1 [root@auto-hv-02-guest10 sssd]# time getent group lgroup2@example1 real 0m3.955s user 0m0.000s sys 0m0.002s 10. we can see that sssd_be was shutdown and below message logged: May 28 12:57:53 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com polkitd[27670]: Unregistered Authentication Agent for unix-process:29978:12930963 (system bus name :1.43, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com sssd[29984]: Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason. May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com be[example1][29991]: Starting up 11. check the sssd_be process id which is changed as sssd_be process was restarted [root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd root 29984 1 0 12:57 ? 00:00:00 /usr/sbin/sssd -i --logger=files root 29986 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files root 29987 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files root 29991 29984 0 12:58 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files root 30001 29873 0 12:59 pts/0 00:00:00 grep --color=auto sssd 12. sssd.log contains below information: sssd.log:(2020-05-28 12:58:26): [sssd] [svc_child_info] (0x0020): Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sssd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3904 |