Bug 450046 - changelog writes can hang due to leftover semaphore
changelog writes can hang due to leftover semaphore
Status: CLOSED CURRENTRELEASE
Product: Red Hat Directory Server
Classification: Red Hat
Component: Replication - General (Show other bugs)
8.0
All All
low Severity medium
: ---
: ---
Assigned To: Nathan Kinder
Chandrasekar Kannan
:
Depends On:
Blocks: 249650 FDS1.2.0
  Show dependency treegraph
 
Reported: 2008-06-04 18:35 EDT by Ulf Weltman
Modified: 2015-01-04 18:32 EST (History)
4 users (show)

See Also:
Fixed In Version: 8.1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-29 19:04:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
CVS Diffs (2.88 KB, patch)
2008-11-13 20:22 EST, Nathan Kinder
no flags Details | Diff

  None (edit)
Description Ulf Weltman 2008-06-04 18:35:28 EDT
If the semaphore that controls changelog write concurrency happens to be at 0
when ns-slapd exits uncleanly, then unless system is rebooted changelog writes
will hang after ns-slapd is started back up.  Upon startup the semaphore should
be removed if it exists to ensure its counter is reset; add the PR_SEM_EXCL flag
when creating the semaphore and if it fails with
PR_GetError()==PR_FILE_EXISTS_ERROR then delete the semaphore and retry.
Comment 1 Ulf Weltman 2008-07-02 19:18:00 EDT
This was happening when I was testing a failover cluster so I was killing
ns-slapd a lot.  To reproduce you can run ns-slapd in a debugger, put a
breakpoint in _cl5WriteOperation() right after the PR_WaitSemaphore() call, do
an LDAP update so that it enters the function, and then kill the process when it
hits breakpoint.  Repeat three times.  Now any changelog updates should hang...

Or you can just create the semaphore with a separate small program using
PR_OpenSemaphore() or sem_open() directly and initialize with value of 3.  The
semaphore name should be your changelog directory plus the replica name and an
extension of .sema.  On HP-UX:

/var/opt/dirsrv/slapd-example/changelogdb/47fe7102-1dd211b2-8068b513-45fb0000.sema
Comment 2 Nathan Kinder 2008-11-13 20:22:11 EST
Created attachment 323522 [details]
CVS Diffs

This fix makes the changelog code attempt to create the semaphore with exclusive access.  If this fails due to the semaphore being left around from a previous unclean exit of the server, we delete the semaphore and re-create it.
Comment 3 Nathan Kinder 2008-11-13 21:07:11 EST
Checked into ldapserver (HEAD).  Thanks to Noriko for her review.

Checking in ldap/servers/plugins/replication/cl5_api.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_api.c,v  <--  cl5_api.c
new revision: 1.24; previous revision: 1.23
done
Comment 4 Yi Zhang 2009-04-07 20:55:35 EDT
Test design: [based on discussion with Noriko]

1. setup single master replication (this means changelog is enabled on master)
2. on master machine, using ldclt pumping data for 8 hours
3. while the data is bing pumping into master machine, kill DS server on master machine every 5 minutes, and then do normal restart. 

-- after 8 hours, we can check changelog on the master machine, if the modify time of this changelog db file is not uptoday, then it means the relication is hanging at some point, otherwise, the test is pass.

i will write script to do it and post result later
Comment 5 Yi Zhang 2009-04-08 01:30:42 EDT
test is running now
 [of course, we have to setup replication first]
script to kill/start server every 5 minutes
[root@mv32a-vm ~]# cat k.sh 
c=0

while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1))
	echo "c=$c"
	echo "[`date`] sleep 5 minutes (300 sec)"
	sleep 300
done

ldclt cmd to inject data
[root@mv32a-vm ~]# cat test.sh 
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=tuserXXXXXXXX \
        -r 1000 -R 99999999 \
        -e person -e incr -e noloop \
      -V -q -n 20 -W 10

=== i will post result in the morning ====
Comment 6 Yi Zhang 2009-04-08 14:30:08 EDT
test passed. bug closed

improved script is below:

cat kill.sh
c=0
while true
do
	echo "[`date`] cycle :[$c]"
	echo "[`date`] kill slapd"
	ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9
	sleep 2
	echo "[`date`] start slapd"
	service dirsrv start
	((c=c+1)) 
	echo "[`date`] sleep 1 minutes "
	sleep 60
done
-------------------------------------
cat test.sh 
while true
do
	header=u.$RANDOM
	echo "header=[$header]"
ldclt -h localhost -p 389 \
      -D "cn=directory manager" -w redhat123  \
      -e bindeach \
      -e add \
        -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \
	-fcn=q${header}_XXXXX \
        -r 1 -R 99999 \
        -e person -e random \
       -E 10000000 -V -q -n 20 -W 2
done

-----------------------
cat count.sh 
while true
do
   countMaster=`/usr/lib/mozldap/ldapsearch -p 389 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   countRepl=`/usr/lib/mozldap/ldapsearch -p 12537 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc`
   echo "[`date`] ============================================"
   echo "   count on master  : $countMaster"
   echo "   count on replica : $countRepl"
   sleep 30
done

------------------------
and finally, check the 
/var/lib/dirsrv/slapd-mv32a-vm/changelogdb
with 
ls -l --time=atime --full-time

ensure the access time is uptoday
Comment 7 Chandrasekar Kannan 2009-04-29 19:04:16 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html

Note You need to log in before you can comment on or make changes to this bug.