Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1834266

Summary:	"off-by-one error" in watchdog implementation
Product:	Red Hat Enterprise Linux 7	Reporter:	Alexey Tikhonov <atikhono>
Component:	sssd	Assignee:	Alexey Tikhonov <atikhono>
Status:	CLOSED ERRATA	QA Contact:	sssd-qe <sssd-qe>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	7.8	CC:	grajaiya, jhrozek, lslebodn, mniranja, mzidek, pbrezina, sgoveas, thalman, tscherf
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	sync-to-jira
Fixed In Version:	sssd-1.16.5-5.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-29 19:50:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexey Tikhonov 2020-05-11 12:16:38 UTC

'man sssd.conf': timeout: "Note that after three missed heartbeats the process will terminate itself."

But implementation is:

#define WATCHDOG_MAX_TICKS 3
...
    if (__sync_add_and_fetch(&watchdog_ctx.ticks, 1) > WATCHDOG_MAX_TICKS) {
        ...
        _exit(1);

-- since after reset ticks start from 0 effectively this is 4 heartbeats.

Comment 1 Alexey Tikhonov 2020-05-11 12:18:31 UTC

    master
        653df69 - Watchdog: fixes "off-by-one" error
    sssd-1-16
        b46e2d4 - Watchdog: fixes "off-by-one" error

Comment 7 Niranjan Mallapadi Raghavender 2020-05-28 17:01:32 UTC

version:
sssd-common-pac-1.16.4-21.el7_7.1.x86_64
python-sssdconfig-1.16.4-21.el7_7.1.noarch
sssd-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.4-21.el7_7.1.x86_64
sssd-krb5-common-1.16.4-21.el7_7.1.x86_64
sssd-ldap-1.16.4-21.el7_7.1.x86_64
sssd-proxy-1.16.4-21.el7_7.1.x86_64
sssd-ipa-1.16.4-21.el7_7.1.x86_64
sssd-common-1.16.4-21.el7_7.1.x86_64
sssd-krb5-1.16.4-21.el7_7.1.x86_64
sssd-ad-1.16.4-21.el7_7.1.x86_64


Reproducing the issue:
Create a ldap server (389-ds) where there 1500000 users (student1 to student1500000) and following largegroups created
1. allusers which contains 1500000 users as members 
2. lgroup1 which contains 15000 users as members
3. lgroup2 which contains 50000 users as members 


cat /etc/sssd/sssd.conf
[sssd]
config_file_version = 2
services = nss, pam
domains = example1
debug_level = 9

[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://cloud-qe-09.idmqe.lab.eng.bos.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
timeout = 2
debug_level = 9

[nss]
debug_level = 9

4. Start sssd and note the process pid
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23986 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     23998 23843  0 12:47 pts/1    00:00:00 grep --color=auto sssd


5. First lookup group lgroup1 which contains 15000 users 
time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1

real    0m0.816s
user    0m0.002s
sys     0m0.007s

5. now lookup group lgroup2 which contains 50000 users 
[root@qe-blade-01 sssd]# time getent group lgroup2@example1

real    0m15.850s
user    0m0.001s
sys     0m0.001s

6. check sssd process 

[root@qe-blade-01 sssd]# ps -ef | grep sssd
root     23985     1  0 12:46 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     23987 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     23988 23985  0 12:46 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     24001 23985  1 12:50 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     24003 23843  0 12:50 pts/1    00:00:00 grep --color=auto sssd

As you can see sssd_be process was shutdown and restarted by watchdog. 

(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sysdb_cache_search_users] (0x2000): No such entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_missing_member_2307] (0x0400): Adding a ghost entry
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_process_group_send] (0x1000): All group members processed
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh]
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_active_requests] (0x0400): Terminating active data provider requests
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [dp_terminate_request] (0x4000): DP Request [Account #2]: Terminating.
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
(Thu May 28 12:50:21 2020) [sssd[be[example1]]] [_dp_req_recv] (0x0400): DP Request [Account #2]: Receiving request data.


7. Update sssd to below versions:

sssd-common-pac-1.16.5-7.el7.x86_64
sssd-krb5-1.16.5-7.el7.x86_64
sssd-winbind-idmap-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.5-7.el7.x86_64
sssd-krb5-common-1.16.5-7.el7.x86_64
sssd-ipa-1.16.5-7.el7.x86_64
sssd-ldap-1.16.5-7.el7.x86_64
sssd-dbus-1.16.5-7.el7.x86_64
sssd-tools-1.16.5-7.el7.x86_64
sssd-kcm-1.16.5-7.el7.x86_64
python-sssdconfig-1.16.5-7.el7.noarch
sssd-common-1.16.5-7.el7.x86_64
sssd-ad-1.16.5-7.el7.x86_64
sssd-proxy-1.16.5-7.el7.x86_64
sssd-1.16.5-7.el7.x86_64

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  1 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29985 29984  3 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29989 29873  0 12:57 pts/0    00:00:00 grep --color=auto sssd

8. lookup group lgroup1@example1

time getent group lgroup1@example1
....
....
student14982@example1,student14983@example1,student14984@example1,student14985@example1,student14986@example1,
student14987@example1,student14988@example1,student14989@example1,student14990@example1,student14991@example1,student14992@example1,
student14993@example1,student14994@example1
real    0m0.008s
user    0m0.002s
sys     0m0.004s


9. lookup group lgroup2@example1


[root@auto-hv-02-guest10 sssd]# time getent group lgroup2@example1

real    0m3.955s
user    0m0.000s
sys     0m0.002s

10. we can see that sssd_be was shutdown and below message logged:

May 28 12:57:53 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com polkitd[27670]: Unregistered Authentication Agent for unix-process:29978:12930963 (system bus name :1.43, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com sssd[29984]: Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 28 12:58:26 auto-hv-02-guest10.idmqe.lab.eng.bos.redhat.com be[example1][29991]: Starting up


11. check the sssd_be process id which is changed as sssd_be process was restarted

[root@auto-hv-02-guest10 sssd]# ps -ef | grep sssd
root     29984     1  0 12:57 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root     29986 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root     29987 29984  0 12:57 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root     29991 29984  0 12:58 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain example1 --uid 0 --gid 0 --logger=files
root     30001 29873  0 12:59 pts/0    00:00:00 grep --color=auto sssd

12. sssd.log contains below information:

sssd.log:(2020-05-28 12:58:26): [sssd] [svc_child_info] (0x0020): Child [29985] ('example1':'%BE_example1') was terminated by own WATCHDOG

Comment 9 errata-xmlrpc 2020-09-29 19:50:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3904