Bug 1241723 - cleanallruv hangs shutdown if not all replicas online
Summary: cleanallruv hangs shutdown if not all replicas online
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base
Version: 7.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: ---
Assignee: Noriko Hosoi
QA Contact: Viktor Ashirov
Depends On:
TreeView+ depends on / blocked
Reported: 2015-07-09 23:46 UTC by Noriko Hosoi
Modified: 2020-09-13 21:27 UTC (History)
5 users (show)

Fixed In Version: 389-ds-base-
Doc Type: Bug Fix
Doc Text:
No Doc needed. (Fixing a regression added to this version)
Clone Of:
Last Closed: 2015-11-19 11:43:02 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 1548 0 None None None 2020-09-13 21:27:37 UTC
Red Hat Product Errata RHBA-2015:2351 0 normal SHIPPED_LIVE 389-ds-base bug fix and enhancement update 2015-11-19 10:28:44 UTC

Description Noriko Hosoi 2015-07-09 23:46:15 UTC
There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down.  Like when checking if replicas are online:

Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260
#2  0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387
#3  0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275
#4  0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816
#5  0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#6  0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309
#7  0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)):
#0  0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035
#2  0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993
#3  0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621
#4  0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539
#5  0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387
#6  0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115

Comment 1 mreynolds 2015-07-10 11:13:42 UTC
Fixed upstream

Comment 3 Viktor Ashirov 2015-08-31 21:07:18 UTC
Hi Mark,
could you please provide steps to verify?


Comment 4 mreynolds 2015-09-01 14:06:50 UTC
Hi Viktor,

I'm afraid this one is very difficult to reproduce.  You have to stop the server at a very precise time in order to reproduce the issue.  Here are the steps, but it might take 10,000 tries to actually catch it.

[1]  Setup MMR: replicas A & B
[2]  stop replica B
[3]  Issue cleanallruv task for the replicas B's rid
[4]  Wait 3 minutes so the cleanallruv task starts to loop/backoff (waiting for replica B to come back online).
[5]  Stop the replica A - server should stop within a few seconds (not minutes)

The trick is to issue the "stop-dirsrv" between the "check to see if the replica is online", but before we goto sleep waiting for the next interval to recheck if the server is online.  There is only a few milliseconds that you have to issue the stop command.  Hence this will be very difficult to reproduce.  


Comment 6 mreynolds 2015-09-18 17:28:20 UTC
Fixed upstream

Comment 8 Sankar Ramalingam 2015-09-23 11:13:44 UTC
[root@rhds10-vm2 MMR_WINSYNC]# /usr/sbin/stop-dirsrv M2
[root@rhds10-vm2 MMR_WINSYNC]# PORT="1189"; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_`hostname`,cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config"
(reverse-i-search)`./': ^CAddEntry.sh  Users 1189 "ou=people,dc=passsync,dc=com" ddaaareee 9 localhost
[root@rhds10-vm2 MMR_WINSYNC]# TASK_NAME=task123; SUFFIX="dc=passsync,dc=com";REPLICA_ID=1232 ; PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 << EOF
> dn: cn=$TASK_NAME,cn=cleanallruv,cn=tasks,cn=config
> cn: $TASK_NAME
> objectclass: extensibleObject
> replica-base-dn: $SUFFIX
> replica-id: $REPLICA_ID
adding new entry "cn=task123,cn=cleanallruv,cn=tasks,cn=config"

[root@rhds10-vm2 MMR_WINSYNC]# sleep 180 ; time /usr/sbin/stop-dirsrv M1

real	0m2.337s
user	0m0.013s
sys	0m0.042s

It took less than 3 secs to stop M1. Hence, marking the bug as Verified.

Build tested:
[root@rhds10-vm2 ~]# rpm -qa |grep -i 389-ds-base

Comment 9 errata-xmlrpc 2015-11-19 11:43:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.