1241723 – cleanallruv hangs shutdown if not all replicas online

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1241723 - cleanallruv hangs shutdown if not all replicas online

Summary: cleanallruv hangs shutdown if not all replicas online

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-09 23:46 UTC by Noriko Hosoi
Modified:	2020-09-13 21:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:	389-ds-base-1.3.4.0-16.el7
Doc Type:	Bug Fix
Doc Text:	No Doc needed. (Fixing a regression added to this version)
Clone Of:
Environment:
Last Closed:	2015-11-19 11:43:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 1548	0	None	None	None	2020-09-13 21:27:37 UTC
Red Hat Product Errata	RHBA-2015:2351	0	normal	SHIPPED_LIVE	389-ds-base bug fix and enhancement update	2015-11-19 10:28:44 UTC

Description Noriko Hosoi 2015-07-09 23:46:15 UTC

There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down.  Like when checking if replicas are online:

Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260
#2  0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387
#3  0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275
#4  0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816
#5  0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#6  0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309
#7  0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)):
#0  0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035
#2  0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993
#3  0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621
#4  0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539
#5  0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387
#6  0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115

Comment 1 mreynolds 2015-07-10 11:13:42 UTC

Fixed upstream

Comment 3 Viktor Ashirov 2015-08-31 21:07:18 UTC

Hi Mark,
could you please provide steps to verify?

Thanks!

Comment 4 mreynolds 2015-09-01 14:06:50 UTC

Hi Viktor,

I'm afraid this one is very difficult to reproduce.  You have to stop the server at a very precise time in order to reproduce the issue.  Here are the steps, but it might take 10,000 tries to actually catch it.

[1]  Setup MMR: replicas A & B
[2]  stop replica B
[3]  Issue cleanallruv task for the replicas B's rid
[4]  Wait 3 minutes so the cleanallruv task starts to loop/backoff (waiting for replica B to come back online).
[5]  Stop the replica A - server should stop within a few seconds (not minutes)

The trick is to issue the "stop-dirsrv" between the "check to see if the replica is online", but before we goto sleep waiting for the next interval to recheck if the server is online.  There is only a few milliseconds that you have to issue the stop command.  Hence this will be very difficult to reproduce.  

Mark

Comment 6 mreynolds 2015-09-18 17:28:20 UTC

Fixed upstream

Comment 8 Sankar Ramalingam 2015-09-23 11:13:44 UTC

[root@rhds10-vm2 MMR_WINSYNC]# /usr/sbin/stop-dirsrv M2
[root@rhds10-vm2 MMR_WINSYNC]# PORT="1189"; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_`hostname`,cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config"
(reverse-i-search)`./': ^CAddEntry.sh  Users 1189 "ou=people,dc=passsync,dc=com" ddaaareee 9 localhost
[root@rhds10-vm2 MMR_WINSYNC]# TASK_NAME=task123; SUFFIX="dc=passsync,dc=com";REPLICA_ID=1232 ; PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 << EOF
> dn: cn=$TASK_NAME,cn=cleanallruv,cn=tasks,cn=config
> cn: $TASK_NAME
> objectclass: extensibleObject
> replica-base-dn: $SUFFIX
> replica-id: $REPLICA_ID
> EOF
adding new entry "cn=task123,cn=cleanallruv,cn=tasks,cn=config"

[root@rhds10-vm2 MMR_WINSYNC]# sleep 180 ; time /usr/sbin/stop-dirsrv M1

real	0m2.337s
user	0m0.013s
sys	0m0.042s

It took less than 3 secs to stop M1. Hence, marking the bug as Verified.

Build tested:
[root@rhds10-vm2 ~]# rpm -qa |grep -i 389-ds-base
389-ds-base-libs-1.3.4.0-18.el7.x86_64
389-ds-base-1.3.4.0-18.el7.x86_64

Comment 9 errata-xmlrpc 2015-11-19 11:43:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2351.html

Note You need to log in before you can comment on or make changes to this bug.