888197 – Wdmd keep failed rem after force kill and restart sanlock service

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 888197 - Wdmd keep failed rem after force kill and restart sanlock service

Summary: Wdmd keep failed rem after force kill and restart sanlock service

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	sanlock
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-12-18 09:23 UTC by Luwen Su
Modified:	2013-01-02 16:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-01-02 16:22:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Luwen Su 2012-12-18 09:23:22 UTC

Description of problem:
After force kill sanlock service and start it , config something need use it first like libvirt  ,wdmd will keep failed rem .It will lead the host auto reboot on low memory box , through my test , both 2 8G's reboot and a 12G pass it after hang a long while.

Version-Release number of selected component (if applicable):
sanlock-2.6-2.el6.x86_64
libvirt-0.10.2-12.el6.x86_64
kernel-2.6.32-345.el6.x86_64

How reproducible:
100%


Steps to Reproduce:
1.Config libvirt for using sanlock

#tail -5 /etc/libvirt/qemu-sanlock.conf 
user = "sanlock"
group = "sanlock"
host_id = 1
auto_disk_leases = 1
disk_lease_dir = "/var/lib/libvirt/sanlock"

# tail -1 /etc/libvirt/qemu.conf 
lock_manager = "sanlock"

# getsebool -a | grep sanlock
sanlock_use_fusefs --> off
sanlock_use_nfs --> on
sanlock_use_samba --> off
virt_use_sanlock --> on


2.Kill and start sanlock service 
# ps aux | grep sanlock
root      1740  0.0  0.0 103244   824 pts/0    S+   17:11   0:00 grep sanlock
root      1773  0.0  0.0  13548  3316 ?        SLs  Dec17   0:00 wdmd -G sanlock
sanlock   1795  0.0  0.3 342108 23068 ?        SLsl Dec17   0:16 sanlock daemon -U sanlock -G sanlock
root      1796  0.0  0.0  23076   288 ?        S    Dec17   0:00 sanlock daemon -U sanlock -G sanlock
root      1810  0.0  0.0  18832   396 ?        Ss   Dec17   0:00 fence_sanlockd -w

#kill -9  1795
#service sanlock start
#service libvirtd restart


3.Check it
//use a non relevent virsh command
#virsh nodeinfo
//It will hang here , and there are lots of log below  appeared
wdmd[1819]: test failed rem 59 now 229186 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1

BTW if no virsh command , the log also appear , it's just a dirct way to check if hang.And the keep request will lead to host auto reboot . On my test , two 8G memory machines appeared the situation . For another 12G's , it will ok after hang a long  while.

4.
If virsh command works well , just force kill sanlock and restart it again

Actual results:
hang and reboot


Expected results:
Should work well

Additional info:
A part of /var/log/message log:
wdmd[1819]: test warning now 229185 ping 229175 close 0 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:20 localhost wdmd[1819]: /dev/watchdog closed unclean
Dec 17 10:36:20 localhost kernel: iTCO_wdt: Unexpected close, not stopping watchdog!
Dec 17 10:36:21 localhost wdmd[1819]: test failed rem 59 now 229186 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:22 localhost wdmd[1819]: test failed rem 58 now 229187 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:36:23 localhost wdmd[1819]: test failed rem 57 now 229188 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1

........snip

Dec 17 10:37:14 localhost wdmd[1819]: test failed rem 6 now 229239 ping 229175 close 229185 renewal 229106 expire 229186 client 1841 sanlock___LIBVIRT__DISKS__:1
Dec 17 10:37:15 localhost wdmd[1819]: test failed rem 5 now 229240 ping 229175 close 229185 renewal 229106 expire 

------------------------Here should be the time line of reboot---------------

Dec 17 10:38:09 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started.
Dec 17 10:38:09 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1666" x-info="http://www.rsyslog.com"] start
Dec 17 10:38:09 localhost kernel: Initializing cgroup subsys cpuset
Dec 17 10:38:09 localhost kernel: Initializing cgroup subsys cpu
Dec 17 10:38:09 localhost kernel: Linux version 2.6.32-345.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Nov 28 21:10:19 EST 2012

Comment 2 David Teigland 2012-12-18 15:09:22 UTC

libvirt is using sanlock when you restart it, so the reboot is expected.
If you don't want a reboot, then you need to cleanly shut down libvirt
before restarting sanlock.

(Also, you should not be running fence_sanlockd, that is *only* for use with the fence_sanlock agent in the cluster product.)

Comment 3 David Teigland 2012-12-18 15:16:14 UTC

Sorry, after thinking about this again, comment 2 was probably not entirely correct. If a pid holding a lease (like libvirt) restarts, sanlock should simply release the leases, it should not cause a wdmd reboot.  To get a wdmd reboot, the access to the lease storage (the __LIBVIRT_DISKS__ file) must have been lost.  If you include sanlock errors/warnings in /var/log/messages or /var/log/sanlock.log, then I was probably explain what happened.

Comment 4 David Teigland 2012-12-18 15:25:13 UTC

Ah, I think I see now -- you appear to be doing kill -9 on the sanlock daemon while it's being used.  That correctly causes wdmd to reboot the machine.

Comment 5 Luwen Su 2012-12-19 10:13:09 UTC

(In reply to comment #4)
> Ah, I think I see now -- you appear to be doing kill -9 on the sanlock
> daemon while it's being used.  That correctly causes wdmd to reboot the
> machine.

Thanks your replay , David
En..Looks like this can be closed as Not bug though it's not nice enough for using and testing...
Are there some plans to improvement it? As some users like me just want to restart the sanlock service after change the config files , but when sanlock service locked by something others ,  force kill and restart it is a fast and simple method ,  however i know it's not a suggessed action :)

Comment 6 David Teigland 2012-12-19 15:10:09 UTC

To shut down sanlock without causing a wdmd reboot, you can run the following command:  "sanlock client shutdown -f 1"
This will cause sanlock to kill any pid's that are holding leases, release those leases, and then exit.

Comment 7 RHEL Program Management 2012-12-24 06:49:43 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Note You need to log in before you can comment on or make changes to this bug.