450046 – changelog writes can hang due to leftover semaphore

Bug 450046 - changelog writes can hang due to leftover semaphore

Summary: changelog writes can hang due to leftover semaphore

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Directory Server
Classification:	Red Hat
Component:	Replication - General
Sub Component:
Version:	8.0
Hardware:	All
OS:	All
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nathan Kinder
QA Contact:	Chandrasekar Kannan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	249650 FDS1.2.0
TreeView+	depends on / blocked

Reported:	2008-06-04 22:35 UTC by Ulf Weltman
Modified:	2015-01-04 23:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:	8.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-29 23:04:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CVS Diffs (2.88 KB, patch) 2008-11-14 01:22 UTC, Nathan Kinder	no flags	Details \| Diff
View All

Description Ulf Weltman 2008-06-04 22:35:28 UTC

If the semaphore that controls changelog write concurrency happens to be at 0
when ns-slapd exits uncleanly, then unless system is rebooted changelog writes
will hang after ns-slapd is started back up.  Upon startup the semaphore should
be removed if it exists to ensure its counter is reset; add the PR_SEM_EXCL flag
when creating the semaphore and if it fails with
PR_GetError()==PR_FILE_EXISTS_ERROR then delete the semaphore and retry.

Comment 1 Ulf Weltman 2008-07-02 23:18:00 UTC

This was happening when I was testing a failover cluster so I was killing
ns-slapd a lot.  To reproduce you can run ns-slapd in a debugger, put a
breakpoint in _cl5WriteOperation() right after the PR_WaitSemaphore() call, do
an LDAP update so that it enters the function, and then kill the process when it
hits breakpoint.  Repeat three times.  Now any changelog updates should hang...

Or you can just create the semaphore with a separate small program using
PR_OpenSemaphore() or sem_open() directly and initialize with value of 3.  The
semaphore name should be your changelog directory plus the replica name and an
extension of .sema.  On HP-UX:

/var/opt/dirsrv/slapd-example/changelogdb/47fe7102-1dd211b2-8068b513-45fb0000.sema

Comment 2 Nathan Kinder 2008-11-14 01:22:11 UTC

Created attachment 323522 [details]
CVS Diffs

This fix makes the changelog code attempt to create the semaphore with exclusive access.  If this fails due to the semaphore being left around from a previous unclean exit of the server, we delete the semaphore and re-create it.

Comment 3 Nathan Kinder 2008-11-14 02:07:11 UTC

Checked into ldapserver (HEAD).  Thanks to Noriko for her review.

Checking in ldap/servers/plugins/replication/cl5_api.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_api.c,v  <--  cl5_api.c
new revision: 1.24; previous revision: 1.23
done

Comment 4 Yi Zhang 2009-04-08 00:55:35 UTC

Test design: [based on discussion with Noriko]

1. setup single master replication (this means changelog is enabled on master)
2. on master machine, using ldclt pumping data for 8 hours
3. while the data is bing pumping into master machine, kill DS server on master machine every 5 minutes, and then do normal restart. 

-- after 8 hours, we can check changelog on the master machine, if the modify time of this changelog db file is not uptoday, then it means the relication is hanging at some point, otherwise, the test is pass.

i will write script to do it and post result later

Comment 5 Yi Zhang 2009-04-08 05:30:42 UTC

test is running now
 [of course, we have to setup replication first]
script to kill/start server every 5 minutes
[root@mv32a-vm ~]# cat k.sh 
c=0

while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1))
	echo "c=$c"
	echo "[`date`] sleep 5 minutes (300 sec)"
	sleep 300
done

ldclt cmd to inject data
[root@mv32a-vm ~]# cat test.sh 
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=tuserXXXXXXXX \
        -r 1000 -R 99999999 \
        -e person -e incr -e noloop \
      -V -q -n 20 -W 10

=== i will post result in the morning ====

Comment 6 Yi Zhang 2009-04-08 18:30:08 UTC

test passed. bug closed

improved script is below:

cat kill.sh
c=0
while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	sleep 2
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1)) 
	echo "[`date`] sleep 1 minutes "
	sleep 60
done
-------------------------------------
cat test.sh 
while true
do
	header=u.$RANDOM
	echo "header=[$header]"
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e bindeach \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=q${header}_XXXXX \
        -r 1 -R 99999 \
        -e person -e random \
       -E 10000000 -V -q -n 20 -W 2
done

-----------------------
cat count.sh 
while true
do
   countMaster=`/usr/lib/mozldap/ldapsearch -p 389 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   countRepl=`/usr/lib/mozldap/ldapsearch -p 12537 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   echo "[`date`] ============================================"
   echo "   count on master  : $countMaster"
   echo "   count on replica : $countRepl"
   sleep 30
done

------------------------
and finally, check the 
/var/lib/dirsrv/slapd-mv32a-vm/changelogdb
with 
ls -l --time=atime --full-time

ensure the access time is uptoday

Comment 7 Chandrasekar Kannan 2009-04-29 23:04:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html

Note You need to log in before you can comment on or make changes to this bug.