Bug 450046
Summary: | changelog writes can hang due to leftover semaphore | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Directory Server | Reporter: | Ulf Weltman <ulf.weltman> | ||||
Component: | Replication - General | Assignee: | Nathan Kinder <nkinder> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Chandrasekar Kannan <ckannan> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 8.0 | CC: | benl, nhosoi, rmeggins, yzhang | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | 8.1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-04-29 23:04:16 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 249650, 493682 | ||||||
Attachments: |
|
Description
Ulf Weltman
2008-06-04 22:35:28 UTC
This was happening when I was testing a failover cluster so I was killing ns-slapd a lot. To reproduce you can run ns-slapd in a debugger, put a breakpoint in _cl5WriteOperation() right after the PR_WaitSemaphore() call, do an LDAP update so that it enters the function, and then kill the process when it hits breakpoint. Repeat three times. Now any changelog updates should hang... Or you can just create the semaphore with a separate small program using PR_OpenSemaphore() or sem_open() directly and initialize with value of 3. The semaphore name should be your changelog directory plus the replica name and an extension of .sema. On HP-UX: /var/opt/dirsrv/slapd-example/changelogdb/47fe7102-1dd211b2-8068b513-45fb0000.sema Created attachment 323522 [details]
CVS Diffs
This fix makes the changelog code attempt to create the semaphore with exclusive access. If this fails due to the semaphore being left around from a previous unclean exit of the server, we delete the semaphore and re-create it.
Checked into ldapserver (HEAD). Thanks to Noriko for her review. Checking in ldap/servers/plugins/replication/cl5_api.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_api.c,v <-- cl5_api.c new revision: 1.24; previous revision: 1.23 done Test design: [based on discussion with Noriko] 1. setup single master replication (this means changelog is enabled on master) 2. on master machine, using ldclt pumping data for 8 hours 3. while the data is bing pumping into master machine, kill DS server on master machine every 5 minutes, and then do normal restart. -- after 8 hours, we can check changelog on the master machine, if the modify time of this changelog db file is not uptoday, then it means the relication is hanging at some point, otherwise, the test is pass. i will write script to do it and post result later test is running now [of course, we have to setup replication first] script to kill/start server every 5 minutes [root@mv32a-vm ~]# cat k.sh c=0 while true do echo "[`date`] cycle :[$c]" echo "[`date`] kill slapd" ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9 echo "[`date`] start slapd" service dirsrv start ((c=c+1)) echo "c=$c" echo "[`date`] sleep 5 minutes (300 sec)" sleep 300 done ldclt cmd to inject data [root@mv32a-vm ~]# cat test.sh ldclt -h localhost -p 389 \ -D "cn=directory manager" -w redhat123 \ -e add \ -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \ -fcn=tuserXXXXXXXX \ -r 1000 -R 99999999 \ -e person -e incr -e noloop \ -V -q -n 20 -W 10 === i will post result in the morning ==== test passed. bug closed improved script is below: cat kill.sh c=0 while true do echo "[`date`] cycle :[$c]" echo "[`date`] kill slapd" ps -elf | grep ns-slapd | cut -d" " -f7 | xargs kill -9 sleep 2 echo "[`date`] start slapd" service dirsrv start ((c=c+1)) echo "[`date`] sleep 1 minutes " sleep 60 done ------------------------------------- cat test.sh while true do header=u.$RANDOM echo "header=[$header]" ldclt -h localhost -p 389 \ -D "cn=directory manager" -w redhat123 \ -e bindeach \ -e add \ -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" \ -fcn=q${header}_XXXXX \ -r 1 -R 99999 \ -e person -e random \ -E 10000000 -V -q -n 20 -W 2 done ----------------------- cat count.sh while true do countMaster=`/usr/lib/mozldap/ldapsearch -p 389 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc` countRepl=`/usr/lib/mozldap/ldapsearch -p 12537 -D "cn=directory manager" -w redhat123 -s sub -b "ou=people,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com" "cn=*" dn | grep "dn" | wc` echo "[`date`] ============================================" echo " count on master : $countMaster" echo " count on replica : $countRepl" sleep 30 done ------------------------ and finally, check the /var/lib/dirsrv/slapd-mv32a-vm/changelogdb with ls -l --time=atime --full-time ensure the access time is uptoday An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0455.html |