Bug 878119

Summary: wdmd/sanlock incompatibility with modern F17 kernel(s)
Product: [Fedora] Fedora Reporter: jrd <jrd>
Component: sanlockAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Haim <hateya>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: abaron, acathrow, bazulay, cfeist, dyasny, fsimonce, iheim, mgoldboi, teigland, yeylon, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-15 00:09:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jrd 2012-11-19 17:09:07 UTC
Description of problem:

[This is probably not really a vdsm bug, but I couldn't identify a better category]

I was trying to test something else, using ovirt installed on a collection of F17 machines.  I discovered that in kernell versions newer than 3.3.4-5 wdmd fails to start, which causes sanlock to fail, which causes vdsm to be unable to act as a storage controller.

On a newer kernel, such as 3.6.6-1, attempting to start wdmd causes the following to appear in /var/log/messages:

Nov 19 10:46:36 f17z systemd-wdmd[9900]: Starting wdmd: [  OK  ]
Nov 19 10:46:36 f17z wdmd[9921]: could not set RR|RESET_ON_FORK priority 99 err 1
Nov 19 10:46:36 f17z wdmd[9921]: /dev/watchdog failed to set timeout
Nov 19 10:46:36 f17z wdmd[9921]: /dev/watchdog disarmed

After that wdmd is not running, causing sanlock to fail.  

Running the same node with kernel 3.3.4-5 allows wdmd to start, and everything else works.

The same appears to be true with all the 3.6 series kernels, but I cannot claim to have tested them exhaustively.

Version-Release number of selected component (if applicable):

[root@f17z ~]# rpm -qa kernel sanlock vdsm
vdsm-4.10.0-10.fc17.x86_64
kernel-3.3.4-5.fc17.x86_64
kernel-3.6.6-1.fc17.x86_64
sanlock-2.4-2.fc17.x86_64


How reproducible:

Always

Steps to Reproduce:
1.  Install vdsm on a vanilla F17 machine
2.  Configure a storage domain with that machine as controller (local or remote)
3.  
  
Actual results:

wdmd/sanlock won't start in 3.6+ kernel

Expected results:

wdmd/sanlock should start :-)

Additional info:

Comment 1 Federico Simoncelli 2012-11-20 12:30:47 UTC
We had several issues caused by selinux, my first suggestion is to update the selinux-policy package to the latest version (3.10.0-160.fc17):

https://koji.fedoraproject.org/koji/buildinfo?buildID=366128

You can also try to temporarily disable selinux and if it works it means that you're hitting something new (in selinux).

The error:

/dev/watchdog failed to set timeout

Might also be related to your watchdog driver (I found that some laptops have a watchdog that seems to reject the timeout configuration).

Please report:

# ls -l /dev/watchdog*

and try to find what is the watchdog driver that is loaded.

Comment 2 jrd 2012-11-20 13:30:55 UTC
I have two machines on which the problem manifests; one is a dell optiplex 755, the other is an Intel Piketon SDP.  Both work with kernel 3.3.4-5 and fail with kernel 3.6.6-1.  Both machines have selinux set to permissive mode.

[root@f17z ~]# rpm -qa selinux-policy
selinux-policy-3.10.0-159.fc17.noarch

yum upgrade selinux-policy yields no updates; perhaps I need to subscribe to a newer channel?

But remember that with no changes to selinux-policy, it works in 3.3 kernel and fails in 3.6.  Do you suspect that selinux itself is behaving differently w/r/t /dev/watchdog in the newer kernel?

On the piketon box, running 3.6

[root@f17z ~]# ls -l /dev/watchdog*
crw-------. 1 root root  10, 130 Nov 20 08:23 /dev/watchdog
crw-------. 1 root root 253,   0 Nov 20 08:23 /dev/watchdog0
crw-------. 1 root root 253,   1 Nov 20 08:23 /dev/watchdog1

On the same box, running 3.3

[root@f17z ~]#  ls -l /dev/watchdog*
crw-------. 1 root root 10, 130 Nov 20 08:25 /dev/watchdog

So something different is happening w/r/t the watchdog device initialization

I poked around in lsmod output, but nothing jumped out at me about watchdog driver.  What are good candidates for me to look for?

Comment 3 Federico Simoncelli 2012-11-20 14:32:21 UTC
After further debugging we discovered that the culprit is the iTCO_wdt module:

 iTCO_wdt               17948  0 
 iTCO_vendor_support    13419  1 iTCO_wdt

It exposes two watchdog (one of which is unusable):

[root@f17z ~]# ls -l /dev/watchdog*
crw-------. 1 root root  10, 130 Nov 20 08:23 /dev/watchdog
crw-------. 1 root root 253,   0 Nov 20 08:23 /dev/watchdog0
crw-------. 1 root root 253,   1 Nov 20 08:23 /dev/watchdog1

wdmd is in fact able to use /dev/watchdog1

One workaround (while we wait iTCO_wdt to be fixed) is to blacklist iTCO_wdt/iTCO_vendor_support and use the softdog module.

I will also go ahead and add an additional option to sanlock to select the preferred watchdog device (so that it could be possible to select /dev/watchdog1 eventually).

Comment 4 Federico Simoncelli 2013-02-21 10:40:19 UTC
This has been fixed in sanlock-2.6-7.fc18:

* Sun Jan 13 2013 Federico Simoncelli <fsimonce> 2.6-6
- wdmd: dynamically select working watchdog device

Comment 5 Fedora Update System 2013-02-21 10:45:09 UTC
sanlock-2.6-7.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/sanlock-2.6-7.fc18

Comment 6 Fedora Update System 2013-02-23 00:56:37 UTC
Package sanlock-2.6-7.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing sanlock-2.6-7.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-2857/sanlock-2.6-7.fc18
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2013-03-15 00:09:56 UTC
sanlock-2.6-7.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.