There are race conditions in some of the cleanallruv code where we can go to sleep without checking if the server is shutting down. Like when checking if replicas are online:
Thread 2 (Thread 0x7f2a4efe5700 (LWP 29721)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1 0x00007f2a86c73217 in pt_TimedWait (cv=cv@entry=0x7f2a8ab43e78, ml=0x7f2a8ab72e20, timeout=timeout@entry=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:260
#2 0x00007f2a86c736de in PR_WaitCondVar (cvar=0x7f2a8ab43e70, timeout=320000) at ../../../nspr/pr/src/pthreads/ptsynch.c:387
#3 0x00007f2a7e8282dc in check_agmts_are_alive (replica=0x7f2a8ab8ee40, rid=300, task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:2275
#4 0x00007f2a7e827015 in replica_cleanallruv_thread (arg=0x7f29fc010bc0) at ldap/servers/plugins/replication/repl5_replica_config.c:1816
#5 0x00007f2a86c78b46 in _pt_root (arg=0x7f29fc014950) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#6 0x00007f2a8661bd14 in start_thread (arg=0x7f2a4efe5700) at pthread_create.c:309
#7 0x00007f2a8613968d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 1 (Thread 0x7f2a88d35800 (LWP 29195)):
#0 0x00007f2a861329f3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f2a888df81f in DS_Sleep (ticks=100) at ldap/servers/slapd/util.c:1035
#2 0x00007f2a7e8277d0 in replica_cleanall_ruv_destructor (task=0x7f29fc0111f0) at ldap/servers/plugins/replication/repl5_replica_config.c:1993
#3 0x00007f2a888d4396 in destroy_task (when=0, arg=0x7f29fc0111f0) at ldap/servers/slapd/task.c:621
#4 0x00007f2a888d96c5 in task_shutdown () at ldap/servers/slapd/task.c:2539
#5 0x00007f2a88d86537 in slapd_daemon (ports=0x7fffac8b2e90) at ldap/servers/slapd/daemon.c:1387
#6 0x00007f2a88d8f05d in main (argc=7, argv=0x7fffac8b2fc8) at ldap/servers/slapd/main.c:1115
could you please provide steps to verify?
I'm afraid this one is very difficult to reproduce. You have to stop the server at a very precise time in order to reproduce the issue. Here are the steps, but it might take 10,000 tries to actually catch it.
 Setup MMR: replicas A & B
 stop replica B
 Issue cleanallruv task for the replicas B's rid
 Wait 3 minutes so the cleanallruv task starts to loop/backoff (waiting for replica B to come back online).
 Stop the replica A - server should stop within a few seconds (not minutes)
The trick is to issue the "stop-dirsrv" between the "check to see if the replica is online", but before we goto sleep waiting for the next interval to recheck if the server is online. There is only a few milliseconds that you have to issue the stop command. Hence this will be very difficult to reproduce.
[root@rhds10-vm2 MMR_WINSYNC]# /usr/sbin/stop-dirsrv M2
[root@rhds10-vm2 MMR_WINSYNC]# PORT="1189"; ldapdelete -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 "cn=1189_to_1626_on_`hostname`,cn=replica,cn=\"dc=passsync,dc=com\",cn=mapping tree,cn=config"
(reverse-i-search)`./': ^CAddEntry.sh Users 1189 "ou=people,dc=passsync,dc=com" ddaaareee 9 localhost
[root@rhds10-vm2 MMR_WINSYNC]# TASK_NAME=task123; SUFFIX="dc=passsync,dc=com";REPLICA_ID=1232 ; PORT="1189"; ldapmodify -a -x -p $PORT -h localhost -D "cn=Directory Manager" -w Secret123 << EOF
> dn: cn=$TASK_NAME,cn=cleanallruv,cn=tasks,cn=config
> cn: $TASK_NAME
> objectclass: extensibleObject
> replica-base-dn: $SUFFIX
> replica-id: $REPLICA_ID
adding new entry "cn=task123,cn=cleanallruv,cn=tasks,cn=config"
[root@rhds10-vm2 MMR_WINSYNC]# sleep 180 ; time /usr/sbin/stop-dirsrv M1
It took less than 3 secs to stop M1. Hence, marking the bug as Verified.
[root@rhds10-vm2 ~]# rpm -qa |grep -i 389-ds-base
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.