Original upstream c/s: http://xenbits.xensource.com/xen-unstable.hg/rev/21529 Discussion: http://lists.xensource.com/archives/html/xen-devel/2010-06/msg00077.html Identify/backport further changes. Test with Fedora 16 guest.
Changeset 21809:1f7c2418e58c is an unrelated commit that also changed whitespaces in "common/keyhandler.c" (found with "hg blame"). Checking the commit log for "watchdog" from the current tip (23900:3d1664cc9e45) backwards returned nothing interesting. So I think c/s 21529 minus the userspace hunks should suffice.
(In reply to comment #0) > Test with Fedora 16 guest. That won't be trivial, see bug 740657 and bug 742896.
Created attachment 526066 [details] Watchdog timers for domains Each domain is allowed to set, reset and disable its timers; when any timer runs out the domain is killed. Backport c/s 21529. Also backport a minuscule whitespace change from c/s 21809. SCHEDOP_watchdog is 6, we don't have SCHEDOP_shutdown_code (5). In domain_create(), upstream introduced a new INIT_watchdog state (for cleanup), but they also have INIT_rangeset and INIT_xsm before INIT_evtchn (the first we have). We don't need those because watchdog_domain_init() always succeeds, just like rangeset_domain_initialise(). In domain_create() and complete_domain_destroy(), the order of destructor calls is a mess. Upstream: domain_create(): 1. arch_domain_destroy() 2. rangeset_domain_destroy() 3. watchdog_domain_destroy() this is the inverse of the constructors' order, so OKAY complete_domain_destroy(): 1. arch_domain_destroy() 2. watchdog_domain_destroy() 3. rangeset_domain_destroy() suggests that watchdog_domain_destroy() and rangeset_domain_destroy() can be reordered RHEL-5: domain_create(): 1. arch_domain_destroy() 2. rangeset_domain_destroy() 3. watchdog_domain_destroy() matches upstream and is the reverse of the construction, so OK complete_domain_destroy(): 1. watchdog_domain_destroy() 2. rangeset_domain_destroy() 3. arch_domain_destroy() Here the order of rangeset_domain_destroy() and arch_domain_destroy() is the reverse of what I'd like. Anyway I'm adding watchdog_domain_destroy() before rangeset_domain_destroy(), like in upstream complete_domain_destroy().
(In reply to comment #3) > In domain_create(), upstream introduced a new INIT_watchdog state (for > cleanup), but they also have INIT_rangeset and INIT_xsm before INIT_evtchn > (the first we have). We don't need those because watchdog_domain_init() > always succeeds, just like rangeset_domain_initialise(). ... more precisely, there's nothing before them that could fail (and still stay inside the function).
Looks like there is "interesting" behavior (instant crash) when passing (id=0,timeout=0). Should probably return EINVAL in that case.
(In reply to comment #5) > Looks like there is "interesting" behavior (instant crash) when passing > (id=0,timeout=0). Should probably return EINVAL in that case. Do you mean that the requesting domU will immediately time out and be shut down? (That is, not crash.) I think whoever can issue hypercalls in the guest should be free to do that. (Otherwise I'd have to bring this change to upstream, and I'm not sure it's worth our collective times.) I believe a 1 second watchdog would not necessarily receive a kick from the domU either before it expires.
Yes, it's just that timeout==0 is documented as removing a watchdog. I agree anyway, don't worry much about it.
"With id != 0 and timeout == 0, destroy domain watchdog timer"
After installing a Fedora 15 guest and upgrading it to Fedora 16 (kernel: 3.1.0-0.rc8.git0.1.fc16.x86_64), I tried to test the watchdog with the watchdog(8) daemon. Unfortunately that daemon seems to be completely dead, it doesn't do anything at all (logs nothing after the startup messages, even though VERBOSE=yes is set in " /etc/sysconfig/watchdog"), doesn't even keep /dev/watchdog open. I experimented with "interval", "test-binary", "test-output" and "realtime" in "/etc/watchdog.conf"; nothing worked. (I had to look up the systemctl (systemd) commands to work with watchdog(8): the service name is "watchdog.service".) So instead of watchdog(8), I used plain "cat >/dev/watchdog" after "modprobe xen-wdt timeout=15", and based my testing on <http://embeddedfreak.wordpress.com/2010/08/23/howto-use-linux-watchdog/>: - press Enter -> means a kick - press ^D -> close the chardev unexpectedly (guest syslog complains too), watchdog is kicked one final time, but it is kept running, - input "V" (+Enter), then ^D -> clean closure, watchdog is stopped If the watchdog times out (not kicking it after starting "cat", or closing it without "V", or passing "nowayout=1" to the module at load time additionally and closing the device in any way -- I tested all of these), then the domain forced off. (Ie. no controlled shutdown; I was wrong in comment 6 in that regard.) The hypervisor also logs (XEN) Watchdog timer fired for domain 17 Removing the module stops the watchdog, except when nowayout=1 was passed to the module (tested both cases). I started to test the ioctl() interface with "Documentation/watchdog/src/watchdog-test.c". I think there's a cosmetic bug in the xen_wdt driver (ie. not in the hypervisor part): when the test tool is invoked as "./watchdog-test -d", it exercizes WDIOC_SETOPTIONS/WDIOS_DISABLECARD, prints "Watchdog card disabled.", and then *closes* /dev/watchdog. That ioctl stops the watchdog alright via the hypercall and zeroes out the guets's "wdt", but in repsonse to the closure (since "expect_release" is not set) the xen_wdt_release() function prints a critical message to the console ("unexpected close, not stopping watchdog!"), and tries to kick it again. (At this time the guest's wdt.id is 0, and so xen_wdt_kick() returns -ENXIO, but xen_wdt_release() ignorese it.) The watchdog is stopped for real and the domain is not killed, but the KERN_CRIT message is misleading.
Additionally, the WDIOC_SETOPTIONS/WDIOS_DISABLECARD ioctl manages to turn off the watchdog, even if nowayout=1 was passed. Anyhow, I think the hypervisor code is good enough. I also tested "xm debug-key q" while the watchdog was ticking, it printed (XEN) General information for domain 22: [...] (XEN) watchdog 0 expires in 13 seconds
Created attachment 526487 [details] Watchdog timers for domains (v2) v1->v2: Since RHEL-5 xend and libvirt don't know about SHUTDOWN_watchdog, mask it as SHUTDOWN_crash for domUs. For dom0, keep SHUTDOWN_watchdog, as the hypervisor reboots the machine in that case.
"domain pause vs. watchdog timer" http://lists.xensource.com/archives/html/xen-devel/2011-10/msg00943.html
"return -EINVAL when trying to kick/kill a nonexistent domain watchdog" http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01033.html
Patch(es) available in kernel-2.6.18-296.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Back to POST, as we're going to take in one small follow-up fix to correct a compile warning on ia64 introduced by the initial patch.
Patch(es) available in kernel-2.6.18-298.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Hi Laszlo, According to comment 9 and comment 10, you have tested the feature with F16 guest, but I'm still not clear how to test it. Could you give us some instructions for testing this feature properly?
First, add on_crash="reboot" to the guest config. Then, in the guest do the following: modprobe xen-wdt timeout=15 cat >/dev/watchdog Then: * press Enter once every few seconds for 20-30 seconds. The guest will keep running. * type "V"+Enter, then ^D. The guest will keep running. * after you checked that the guest has kept running (again, 20-30 seconds are enough), reboot. * redo modprobe + cat. * press ^D. You have 15 seconds to check syslog for messages, then the guest will reboot. "xm dmesg" will also report the crash on the host. After reboot: modprobe xen-wdt timeout=15 nowayout=1 cat >/dev/watchdog * press Enter once every few seconds. The guest will keep running. * type "V"+Enter, then ^D. The guest will reboot after 15 seconds. "xm dmesg" will also report the crash on the host. Now turn off the guest, and repeat with on_crash="destroy" and on_crash="preserve".
Verified with kernel-xen-2.6.18-300.el5. Tested with Fedora 16 x86_64 (both PV and HVM), all the behaviors of on_crash = "preserve|restart|destroy" work well. For the on_crash = "restart" case: (1) load xen-wdt with nowayout=0 (default) # modprobe xen-wdt timeout=15 # cat >/dev/watchdog [1] press Enter once every few seconds, guest kept running. [2] don't type anything, guest rebooted. [3] type "V" + Enter then "^D", guest kept running. [4] type "^D", guest complained "wdt: unexpected close, not stopping watchdog!" and rebooted. (2) load xen-wdt with nowayout=1 # modprobe xen-wdt timeout=15 nowayout=1 # cat >/dev/watchdog [1] press Enter once every few seconds, guest kept running. [2] don't type anything, guest rebooted. [3] type "V" + Enter then "^D", guest complained "wdt: unexpected close, not stopping watchdog!" and rebooted. [4] type "^D", guest complained "wdt: unexpected close, not stopping watchdog!" and rebooted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html