Bug 888197

Summary:	Wdmd keep failed rem after force kill and restart sanlock service
Product:	Red Hat Enterprise Linux 6	Reporter:	Luwen Su <lsu>
Component:	sanlock	Assignee:	David Teigland <teigland>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	6.4	CC:	ajia, cluster-maint, dyuan, jdenemar, mprivozn, mzhan, rwu
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-01-02 16:22:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luwen Su 2012-12-18 09:23:22 UTC

Description of problem:
After force kill sanlock service and start it , config something need use it first like libvirt  ,wdmd will keep failed rem .It will lead the host auto reboot on low memory box , through my test , both 2 8G's reboot and a 12G pass it after hang a long while.

Version-Release number of selected component (if applicable):
sanlock-2.6-2.el6.x86_64
libvirt-0.10.2-12.el6.x86_64
kernel-2.6.32-345.el6.x86_64

How reproducible:
100%


Steps to Reproduce:
1.Config libvirt for using sanlock

#tail -5 /etc/libvirt/qemu-sanlock.conf 
user = "sanlock"
group = "sanlock"
host_id = 1
auto_disk_leases = 1
disk_lease_dir = "/var/lib/libvirt/sanlock"

# tail -1 /etc/libvirt/qemu.conf 
lock_manager = "sanlock"

# getsebool -a | grep sanlock
sanlock_use_fusefs --> off
sanlock_use_nfs --> on
sanlock_use_samba --> off
virt_use_sanlock --> on


2.Kill and start sanlock service 
# ps aux | grep sanlock
root      1740  0.0  0.0 103244   824 pts/0    S+   17:11   0:00 grep sanlock
root      1773  0.0  0.0  13548  3316 ?        SLs  Dec17   0:00 wdmd -G sanlock
sanlock   1795  0.0  0.3 342108 23068 ?        SLsl Dec17   0:16 sanlock daemon -U sanlock -G sanlock
root      1796  0.0  0.0  23076   288 ?        S    Dec17   0:00 sanlock daemon -U sanlock -G sanlock
root      1810  0.0  0.0  18832   396 ?        Ss   Dec17   0:00 fence_sanlockd -w

#kill -9  1795
#service sanlock start
#service libvirtd restart


3.Check it
//use a non relevent virsh command
#virsh nodeinfo
//It will hang here , and there are lots of log below  appeared
wdmd[1819]: test failed rem 59 now 229186 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1

BTW if no virsh command , the log also appear , it's just a dirct way to check if hang.And the keep request will lead to host auto reboot . On my test , two 8G memory machines appeared the situation . For another 12G's , it will ok after hang a long  while.

4.
If virsh command works well , just force kill sanlock and restart it again

Actual results:
hang and reboot


Expected results:
Should work well

Additional info:
A part of /var/log/message log:
wdmd[1819]: test warning now 229185 ping 229175 close 0 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:20 localhost wdmd[1819]: /dev/watchdog closed unclean
Dec 17 10:36:20 localhost kernel: iTCO_wdt: Unexpected close, not stopping watchdog!
Dec 17 10:36:21 localhost wdmd[1819]: test failed rem 59 now 229186 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:22 localhost wdmd[1819]: test failed rem 58 now 229187 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:23 localhost wdmd[1819]: test failed rem 57 now 229188 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1

........snip

Dec 17 10:37:14 localhost wdmd[1819]: test failed rem 6 now 229239 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:37:15 localhost wdmd[1819]: test failed rem 5 now 229240 ping 229175 close 229185 renewal 229106 expire 

------------------------Here should be the time line of reboot---------------

Dec 17 10:38:09 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started.
Dec 17 10:38:09 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1666" x-info="http://www.rsyslog.com"] start
Dec 17 10:38:09 localhost kernel: Initializing cgroup subsys cpuset
Dec 17 10:38:09 localhost kernel: Initializing cgroup subsys cpu
Dec 17 10:38:09 localhost kernel: Linux version 2.6.32-345.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Nov 28 21:10:19 EST 2012

Comment 2 David Teigland 2012-12-18 15:09:22 UTC

libvirt is using sanlock when you restart it, so the reboot is expected.
If you don't want a reboot, then you need to cleanly shut down libvirt
before restarting sanlock.

(Also, you should not be running fence_sanlockd, that is *only* for use with the fence_sanlock agent in the cluster product.)

Comment 3 David Teigland 2012-12-18 15:16:14 UTC

Sorry, after thinking about this again, comment 2 was probably not entirely correct. If a pid holding a lease (like libvirt) restarts, sanlock should simply release the leases, it should not cause a wdmd reboot.  To get a wdmd reboot, the access to the lease storage (the __LIBVIRT_DISKS__ file) must have been lost.  If you include sanlock errors/warnings in /var/log/messages or /var/log/sanlock.log, then I was probably explain what happened.

Comment 4 David Teigland 2012-12-18 15:25:13 UTC

Ah, I think I see now -- you appear to be doing kill -9 on the sanlock daemon while it's being used.  That correctly causes wdmd to reboot the machine.

Comment 5 Luwen Su 2012-12-19 10:13:09 UTC

(In reply to comment #4)
> Ah, I think I see now -- you appear to be doing kill -9 on the sanlock
> daemon while it's being used.  That correctly causes wdmd to reboot the
> machine.

Thanks your replay , David
En..Looks like this can be closed as Not bug though it's not nice enough for using and testing...
Are there some plans to improvement it? As some users like me just want to restart the sanlock service after change the config files , but when sanlock service locked by something others ,  force kill and restart it is a fast and simple method ,  however i know it's not a suggessed action :)

Comment 6 David Teigland 2012-12-19 15:10:09 UTC

To shut down sanlock without causing a wdmd reboot, you can run the following command:  "sanlock client shutdown -f 1"
This will cause sanlock to kill any pid's that are holding leases, release those leases, and then exit.

Comment 7 RHEL Program Management 2012-12-24 06:49:43 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.