Bug 1241723

Summary:	cleanallruv hangs shutdown if not all replicas online
Product:	Red Hat Enterprise Linux 7	Reporter:	Noriko Hosoi <nhosoi>
Component:	389-ds-base	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	mreynolds, nkinder, rmeggins, sramling, tlavigne
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	389-ds-base-1.3.4.0-16.el7	Doc Type:	Bug Fix
Doc Text:	No Doc needed. (Fixing a regression added to this version)	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-19 11:43:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Noriko Hosoi 2015-07-09 23:46:15 UTC

There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down.  Like when checking if replicas are online:

Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260
#2  0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387
#3  0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275
#4  0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816
#5  0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#6  0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309
#7  0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)):
#0  0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035
#2  0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993
#3  0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621
#4  0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539
#5  0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387
#6  0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115

Comment 1 mreynolds 2015-07-10 11:13:42 UTC

Fixed upstream

Comment 3 Viktor Ashirov 2015-08-31 21:07:18 UTC

Hi Mark,
could you please provide steps to verify?

Thanks!

Comment 4 mreynolds 2015-09-01 14:06:50 UTC

Hi Viktor,

I'm afraid this one is very difficult to reproduce.  You have to stop the server at a very precise time in order to reproduce the issue.  Here are the steps, but it might take 10,000 tries to actually catch it.

[1]  Setup MMR: replicas A & B
[2]  stop replica B
[3]  Issue cleanallruv task for the replicas B's rid
[4]  Wait 3 minutes so the cleanallruv task starts to loop/backoff (waiting for replica B to come back online).
[5]  Stop the replica A - server should stop within a few seconds (not minutes)

The trick is to issue the "stop-dirsrv" between the "check to see if the replica is online", but before we goto sleep waiting for the next interval to recheck if the server is online.  There is only a few milliseconds that you have to issue the stop command.  Hence this will be very difficult to reproduce.  

Mark

Comment 6 mreynolds 2015-09-18 17:28:20 UTC

Fixed upstream

Comment 8 Sankar Ramalingam 2015-09-23 11:13:44 UTC

[root@rhds10-vm2 MMR_WINSYNC]# /usr/sbin/stop-dirsrv M2
[root@rhds10-vm2 MMR_WINSYNC]# PORT="1189"; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_`hostname`,cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config"
(reverse-i-search)`./': ^CAddEntry.sh  Users 1189 "ou=people,dc=passsync,dc=com" ddaaareee 9 localhost
[root@rhds10-vm2 MMR_WINSYNC]# TASK_NAME=task123; SUFFIX="dc=passsync,dc=com";REPLICA_ID=1232 ; PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 << EOF
> dn: cn=$TASK_NAME,cn=cleanallruv,cn=tasks,cn=config
> cn: $TASK_NAME
> objectclass: extensibleObject
> replica-base-dn: $SUFFIX
> replica-id: $REPLICA_ID
> EOF
adding new entry "cn=task123,cn=cleanallruv,cn=tasks,cn=config"

[root@rhds10-vm2 MMR_WINSYNC]# sleep 180 ; time /usr/sbin/stop-dirsrv M1

real	0m2.337s
user	0m0.013s
sys	0m0.042s

It took less than 3 secs to stop M1. Hence, marking the bug as Verified.

Build tested:
[root@rhds10-vm2 ~]# rpm -qa |grep -i 389-ds-base
389-ds-base-libs-1.3.4.0-18.el7.x86_64
389-ds-base-1.3.4.0-18.el7.x86_64

Comment 9 errata-xmlrpc 2015-11-19 11:43:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2351.html