Bug 450046

Summary:

changelog writes can hang due to leftover semaphore

Product:

Red Hat Directory Server

Reporter:

Ulf Weltman <ulf.weltman>

Component:

Replication - General

Assignee:

Nathan Kinder <nkinder>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Chandrasekar Kannan <ckannan>

Severity:

medium

Docs Contact:

Priority:

low

Version:

8.0

CC:

benl, nhosoi, rmeggins, yzhang

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

8.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-04-29 23:04:16 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

249650, 493682

Attachments:

Description	Flags
CVS Diffs	none

Description Ulf Weltman 2008-06-04 22:35:28 UTC

If the semaphore that controls changelog write concurrency happens to be at 0
when ns-slapd exits uncleanly, then unless system is rebooted changelog writes
will hang after ns-slapd is started back up.  Upon startup the semaphore should
be removed if it exists to ensure its counter is reset; add the PR_SEM_EXCL flag
when creating the semaphore and if it fails with
PR_GetError()==PR_FILE_EXISTS_ERROR then delete the semaphore and retry.

Comment 1 Ulf Weltman 2008-07-02 23:18:00 UTC

This was happening when I was testing a failover cluster so I was killing
ns-slapd a lot.  To reproduce you can run ns-slapd in a debugger, put a
breakpoint in _cl5WriteOperation() right after the PR_WaitSemaphore() call, do
an LDAP update so that it enters the function, and then kill the process when it
hits breakpoint.  Repeat three times.  Now any changelog updates should hang...

Or you can just create the semaphore with a separate small program using
PR_OpenSemaphore() or sem_open() directly and initialize with value of 3.  The
semaphore name should be your changelog directory plus the replica name and an
extension of .sema.  On HP-UX:

/var/opt/dirsrv/slapd-example/changelogdb/47fe7102-1dd211b2-8068b513-45fb0000.sema

Comment 2 Nathan Kinder 2008-11-14 01:22:11 UTC

Created attachment 323522 [details]
CVS Diffs

This fix makes the changelog code attempt to create the semaphore with exclusive access.  If this fails due to the semaphore being left around from a previous unclean exit of the server, we delete the semaphore and re-create it.

Comment 3 Nathan Kinder 2008-11-14 02:07:11 UTC

Checked into ldapserver (HEAD).  Thanks to Noriko for her review.

Checking in ldap/servers/plugins/replication/cl5_api.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_api.c,v  <--  cl5_api.c
new revision: 1.24; previous revision: 1.23
done

Comment 4 Yi Zhang 2009-04-08 00:55:35 UTC

Test design: [based on discussion with Noriko]

1. setup single master replication (this means changelog is enabled on master)
2. on master machine, using ldclt pumping data for 8 hours
3. while the data is bing pumping into master machine, kill DS server on master machine every 5 minutes, and then do normal restart. 

-- after 8 hours, we can check changelog on the master machine, if the modify time of this changelog db file is not uptoday, then it means the relication is hanging at some point, otherwise, the test is pass.

i will write script to do it and post result later

Comment 5 Yi Zhang 2009-04-08 05:30:42 UTC

test is running now
 [of course, we have to setup replication first]
script to kill/start server every 5 minutes
[root@mv32a-vm ~]# cat k.sh 
c=0

while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1))
	echo "c=$c"
	echo "[`date`] sleep 5 minutes (300 sec)"
	sleep 300
done

ldclt cmd to inject data
[root@mv32a-vm ~]# cat test.sh 
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=tuserXXXXXXXX \
        -r 1000 -R 99999999 \
        -e person -e incr -e noloop \
      -V -q -n 20 -W 10

=== i will post result in the morning ====

Comment 6 Yi Zhang 2009-04-08 18:30:08 UTC

test passed. bug closed

improved script is below:

cat kill.sh
c=0
while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	sleep 2
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1)) 
	echo "[`date`] sleep 1 minutes "
	sleep 60
done
-------------------------------------
cat test.sh 
while true
do
	header=u.$RANDOM
	echo "header=[$header]"
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e bindeach \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=q${header}_XXXXX \
        -r 1 -R 99999 \
        -e person -e random \
       -E 10000000 -V -q -n 20 -W 2
done

-----------------------
cat count.sh 
while true
do
   countMaster=`/usr/lib/mozldap/ldapsearch -p 389 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   countRepl=`/usr/lib/mozldap/ldapsearch -p 12537 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   echo "[`date`] ============================================"
   echo "   count on master  : $countMaster"
   echo "   count on replica : $countRepl"
   sleep 30
done

------------------------
and finally, check the 
/var/lib/dirsrv/slapd-mv32a-vm/changelogdb
with 
ls -l --time=atime --full-time

ensure the access time is uptoday

Comment 7 Chandrasekar Kannan 2009-04-29 23:04:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html