If the semaphore that controls changelog write concurrency happens to be at 0 when ns-slapd exits uncleanly, then unless system is rebooted changelog writes will hang after ns-slapd is started back up. Upon startup the semaphore should be removed if it exists to ensure its counter is reset; add the PR_SEM_EXCL flag when creating the semaphore and if it fails with PR_GetError()==PR_FILE_EXISTS_ERROR then delete the semaphore and retry.
This was happening when I was testing a failover cluster so I was killing ns-slapd a lot. To reproduce you can run ns-slapd in a debugger, put a breakpoint in _cl5WriteOperation() right after the PR_WaitSemaphore() call, do an LDAP update so that it enters the function, and then kill the process when it hits breakpoint. Repeat three times. Now any changelog updates should hang... Or you can just create the semaphore with a separate small program using PR_OpenSemaphore() or sem_open() directly and initialize with value of 3. The semaphore name should be your changelog directory plus the replica name and an extension of .sema. On HP-UX: /var/opt/dirsrv/slapd-example/changelogdb/47fe7102-1dd211b2-8068b513-45fb0000.sema
Created attachment 323522 [details] CVS Diffs This fix makes the changelog code attempt to create the semaphore with exclusive access. If this fails due to the semaphore being left around from a previous unclean exit of the server, we delete the semaphore and re-create it.
Checked into ldapserver (HEAD). Thanks to Noriko for her review. Checking in ldap/servers/plugins/replication/cl5_api.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_api.c,v <-- cl5_api.c new revision: 1.24; previous revision: 1.23 done
Test design: [based on discussion with Noriko] 1. setup single master replication (this means changelog is enabled on master) 2. on master machine, using ldclt pumping data for 8 hours 3. while the data is bing pumping into master machine, kill DS server on master machine every 5 minutes, and then do normal restart. -- after 8 hours, we can check changelog on the master machine, if the modify time of this changelog db file is not uptoday, then it means the relication is hanging at some point, otherwise, the test is pass. i will write script to do it and post result later
test is running now [of course, we have to setup replication first] script to kill/start server every 5 minutes [root@mv32a-vm ~]# cat k.sh c=0 while true do echo "[`date`] cycle :[$c]" echo "[`date`] kill slapd" ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9 echo "[`date`] start slapd" service dirsrv start ((c=c+1)) echo "c=$c" echo "[`date`] sleep 5 minutes (300 sec)" sleep 300 done ldclt cmd to inject data [root@mv32a-vm ~]# cat test.sh ldclt -h localhost -p 389 \ -D "cn=directory manager" -w redhat123 \ -e add \ -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \ -fcn=tuserXXXXXXXX \ -r 1000 -R 99999999 \ -e person -e incr -e noloop \ -V -q -n 20 -W 10 === i will post result in the morning ====
test passed. bug closed improved script is below: cat kill.sh c=0 while true do echo "[`date`] cycle :[$c]" echo "[`date`] kill slapd" ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9 sleep 2 echo "[`date`] start slapd" service dirsrv start ((c=c+1)) echo "[`date`] sleep 1 minutes " sleep 60 done ------------------------------------- cat test.sh while true do header=u.$RANDOM echo "header=[$header]" ldclt -h localhost -p 389 \ -D "cn=directory manager" -w redhat123 \ -e bindeach \ -e add \ -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \ -fcn=q${header}_XXXXX \ -r 1 -R 99999 \ -e person -e random \ -E 10000000 -V -q -n 20 -W 2 done ----------------------- cat count.sh while true do countMaster=`/usr/lib/mozldap/ldapsearch -p 389 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc` countRepl=`/usr/lib/mozldap/ldapsearch -p 12537 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc` echo "[`date`] ============================================" echo " count on master : $countMaster" echo " count on replica : $countRepl" sleep 30 done ------------------------ and finally, check the /var/lib/dirsrv/slapd-mv32a-vm/changelogdb with ls -l --time=atime --full-time ensure the access time is uptoday
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0455.html