Bug 1834266
| Summary: | "off-by-one error" in watchdog implementation | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Alexey Tikhonov <atikhono> |
| Component: | sssd | Assignee: | Alexey Tikhonov <atikhono> |
| Status: | CLOSED ERRATA | QA Contact: | sssd-qe <sssd-qe> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.8 | CC: | grajaiya, jhrozek, lslebodn, mniranja, mzidek, pbrezina, sgoveas, thalman, tscherf |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | sync-to-jira | ||
| Fixed In Version: | sssd-1.16.5-5.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-29 19:50:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
master
653df69 - Watchdog: fixes "off-by-one" error
sssd-1-16
b46e2d4 - Watchdog: fixes "off-by-one" error
version:
sssd-common-pac-1.16.4-21.el7_7.1.x86_64
python-sssdconfig-1.16.4-21.el7_7.1.noarch
sssd-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.4-21.el7_7.1.x86_64
sssd-krb5-common-1.16.4-21.el7_7.1.x86_64
sssd-ldap-1.16.4-21.el7_7.1.x86_64
sssd-proxy-1.16.4-21.el7_7.1.x86_64
sssd-ipa-1.16.4-21.el7_7.1.x86_64
sssd-common-1.16.4-21.el7_7.1.x86_64
sssd-krb5-1.16.4-21.el7_7.1.x86_64
sssd-ad-1.16.4-21.el7_7.1.x86_64
Reproducing the issue:
Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created
1. allusers which contains 1500000 users as members
2. lgroup1 which contains 15000 users as members
3. lgroup2 which contains 50000 users as members
cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = example1
debug_level = 9
[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://cloud-qe-09.idmqe.lab.eng.bos.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
timeout = 2
debug_level = 9
[nss]
debug_level = 9
4. Start sssd and note the process pid
root 23985 1 0 12:46 ? 00:00:00 /usr/sbin/sssd -i --logger=files
root 23986 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root 23987 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root 23988 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root 23998 23843 0 12:47 pts/1 00:00:00 grep --color=auto sssd
5. First lookup group lgroup1 which contains 15000 users
time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1
real 0m0.816s
user 0m0.002s
sys 0m0.007s
5. now lookup group lgroup2 which contains 50000 users
[root@qe-blade-01 sssd]# time getent group lgroup2@example1
real 0m15.850s
user 0m0.001s
sys 0m0.001s
6. check sssd process
[root@qe-blade-01 sssd]# ps -ef | grep sssd
root 23985 1 0 12:46 ? 00:00:00 /usr/sbin/sssd -i --logger=files
root 23987 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root 23988 23985 0 12:46 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root 24001 23985 1 12:50 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root 24003 23843 0 12:50 pts/1 00:00:00 grep --color=auto sssd
As you can see sssd_be process was shutdown and restarted by watchdog.
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sysdb_cache_search_users] (0x2000): No such entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_missing_member_2307] (0x0400): Adding a ghost entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_group_send] (0x1000): All group members processed
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_active_requests] (0x0400): Terminating active data provider requests
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_request] (0x4000): DP Request [Account #2]: Terminating.
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [_dp_req_recv] (0x0400): DP Request [Account #2]: Receiving request data.
7. Update sssd to below versions:
sssd-common-pac-1.16.5-7.el7.x86_64
sssd-krb5-1.16.5-7.el7.x86_64
sssd-winbind-idmap-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.5-7.el7.x86_64
sssd-krb5-common-1.16.5-7.el7.x86_64
sssd-ipa-1.16.5-7.el7.x86_64
sssd-ldap-1.16.5-7.el7.x86_64
sssd-dbus-1.16.5-7.el7.x86_64
sssd-tools-1.16.5-7.el7.x86_64
sssd-kcm-1.16.5-7.el7.x86_64
python-sssdconfig-1.16.5-7.el7.noarch
sssd-common-1.16.5-7.el7.x86_64
sssd-ad-1.16.5-7.el7.x86_64
sssd-proxy-1.16.5-7.el7.x86_64
sssd-1.16.5-7.el7.x86_64
[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root 29984 1 1 12:57 ? 00:00:00 /usr/sbin/sssd -i --logger=files
root 29985 29984 3 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root 29986 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root 29987 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root 29989 29873 0 12:57 pts/0 00:00:00 grep --color=auto sssd
8. lookup group lgroup1@example1
time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1
real 0m0.008s
user 0m0.002s
sys 0m0.004s
9. lookup group lgroup2@example1
[root@auto-hv-02-guest10 sssd]# time getent group lgroup2@example1
real 0m3.955s
user 0m0.000s
sys 0m0.002s
10. we can see that sssd_be was shutdown and below message logged:
May 28 12:57:53 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com polkitd[27670]: Unregistered Authentication Agent for unix-process:29978:12930963 (system bus name :1.43, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com sssd[29984]: Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com be[example1][29991]: Starting up
11. check the sssd_be process id which is changed as sssd_be process was restarted
[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root 29984 1 0 12:57 ? 00:00:00 /usr/sbin/sssd -i --logger=files
root 29986 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root 29987 29984 0 12:57 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root 29991 29984 0 12:58 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root 30001 29873 0 12:59 pts/0 00:00:00 grep --color=auto sssd
12. sssd.log contains below information:
sssd.log:(2020-05-28 12:58:26): [sssd] [svc_child_info] (0x0020): Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sssd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3904 |
'man sssd.conf': timeout: "Note that after three missed heartbeats the process will terminate itself." But implementation is: #define WATCHDOG_MAX_TICKS 3 ... if (__sync_add_and_fetch(&watchdog_ctx.ticks, 1) > WATCHDOG_MAX_TICKS) { ... _exit(1); -- since after reset ticks start from 0 effectively this is 4 heartbeats.