Hide Forgot
Problem Statement: The current implementation of watchdogd is not sufficient for usage in environments where predictable and rapid system halt is required. This is because watchdogd attempts to do an orderly shutdown via the do_shutdown method. This method does things like sends email to administrators, kills core processes and unmounts filesystems. At each step in the process, the do_shutdown method pets the watchdog device, extending the timeout. In addition, watchdogd calls a variable number of user binaries/scripts and with each script resets the watchdog timer. This means that the current watchdogd is deterministic in how long it takes to shut the system down, but the max time to do so is a multiple of the watchdog device timeout (default 1 minute) so predictable node death is on the order of 5+ minutes. Changing watchdogd to not pet the watchdog between each invocation of a check script and also to not call do_shutdown would be a significant change to how that daemon operates, and we don't want to disrupt users of the daemon who are happy with the current operation. Solution: For RHEL HA and sanlock (RHEV storage) use cases, we need a daemon that pets the watchdog device and checks for status from external scripts, but on script failure does not try to do an orderly shutdown (do_shutdown). Instead, this daemon would immediately stop petting the watchdog and attempt to call 'poweroff --force'. So this would mean that the maximum time to system shutdown is the watchdog device timeout value and we could base other systems around knowledge of this maximum timeout. This daemon should also provide an API so that additional outside daemons can register and pet it in turn, providing a secondary interface (other than calling external scripts/binaries) to provide health status. This use case would satisfy the libSAM component of Corosync in RHEL HA. Requirements: * simple/small C daemon that pets the watchdog at user specified intervals * provides a C API for external processes to register with it so that these external processes can pet this new daemon * will execute a set of scripts/binaries in a well defined location (/etc/blah.d) to provide health status of the system * a failure of any externally registered process or failure of any check script will cause the daemon to issue the poweroff --force command and stop petting the watchdog. * Might want a configurable option for 'number of failures before system halt' in case there are use cases for doing a doublecheck before killing the system The intent is not to initially replace the watchdogd package for general usage. This new daemon is meant for the targeted use cases above. If desired, general support of the package could eventually be extended. It is also not intended for watchdogd and this daemon to run in parallel as only one device can own the watchdog device at a time. The name of this package is TBD and $subject will be modified when a suitable name is found. This bz is also a blocker for bug # 509056
The following are notes from an email discussion with dteigland re: watchdogd behavior and why it is not ideal for sanlock or RHEL HA usage: > Every second the watchdog daemon does > > result = test(); > if (result) > do_shutdown(); /* its own version of "shutdown now" */ > keep_alive(); /* pet wd device */ > > In our case, test() calls "sanlock wdtest" which does: > if our host_id renewal was less than 30 seconds ago, return 0, else return 1 > > Notably, the test result does *not* influence the petting of the wd device > (keep_alive). What a non-zero test result does cause, is the wd daemon to > immediately begin shutting down the machine. While doing the shutdown, > the wd daemon tries its best to continue keeping the wd device alive by > continuing to pet it. If the shutdown process hangs at some point, then > the wd device will come to its rescue and do the reset. Specifically, > do_shutdown() roughly does: > > send email if configured > syslog("shutting down because of error"); > keek_alive(); > sleep(10); /* make sure log is written and mail is send */ > keep_alive(); > kill(1, SIGTSTP); > kill(-1, SIGSTOP); > go through each pid, calling kill(pid, SIGTERM); > kill(-1, SIGCONT); > sleep(5); > kill(-1, SIGSTOP); > go through each pid, calling kill(pid, SIGKILL); > kill(-1, SIGCONT); > keep_alive(); > more logging to wtmp > acct(NULL); > keep_alive(); > swapoff(); > quotactl(); > unmount filesystems > shut down network interfaces > reboot(RB_AUTOBOOT); > reboot(RB_HALT_SYSTEM); > sleep waiting for hard reset from wd device (not supposed to get here) > > > What this amounts to is not terribly deterministic: the system is shut > down any time from 10 seconds after the first test failure, to 60 seconds > after the last keep_alive() in do_shutdown(). If we allow 60 seconds > between each keep_alive() in do_shutdown(), the latest possible time the > machine becomes safe is 300 seconds after the test failure. (This is a > rough analysis, I may be missing some things, and any extra keep_alive in > the shutdown path will extend the safe time.) > > When using a watchdog device as a replacement for fencing, we're not > mainly interested in the system shutdown process, which seems to be the > main point of the wd daemon. We're mainly interested in making things as > deterministic as possible within a short timespan. > > What we're after is a known, fixed time when we can safely assume the > machine is reset/off (therefore not modifying storage). The recovery > logic I'd been expecting to use is: if the machine only pets its wd > device after it successfully renews its lease (visible to everyone), then > another machine can assume the wd device has fired at last_renewal+120, > given the standard 60 sec wd device timeout.
If a test fails, the daemon should do nothing, and continue running the tests If all tests go back to returning success, it can resume petting the wd device. It should not try to poweroff/reboot/etc itself, it should only pet the wd device according to the test results. One problem multiple tests can create is independently staggered "windows" of not petting the wd device. Even if none of the tests exceed the wd timeout, the overlap of multiple test window can eventually trigger spurious resets. There are some strategies to avoid this in an old email that I will try to reference.
(In reply to comment #3) > If a test fails, the daemon should do nothing, and continue running the tests > If all tests go back to returning success, it can resume petting the wd device. ack > It should not try to poweroff/reboot/etc itself, it should only pet the wd > device according to the test results. why not? running poweroff --force simply kills the host quicker. External hosts/devices would still need to assume the full minute timeout of the watchdog device for safety but at least running poweroff --force syncs the disks before the machine dies.
A test can return a failure (causing the daemon to not pet the wd), and then return success again (causing the daemon to pet the wd.) That's obviously not possible if you reboot at the failure. Also, sanlock wants to do its own recovery in that time period before the wd device does a reset; it takes the reset time and works backward to determine when to begin its stages of recovery. Always using only the hardware wd device also makes us independent of whatever the software might do now or change to do in the future. If the reboot succeeds, the wd device won't fire. If we're using the wd device in place of fencing, I think our goal should be to always ensure that the wd device does fire. Another possible advantage to only being reset by the wd device, is that it allows other software to continue running all the way up to the device reset. This means we can build extra verification into the system. A program like sanlock continues writing to shared storage until the wd device resets the machine. If the wd device doesn't reset (misconfiguration or bug), other nodes will see those writes and know that the wd device has not failed. Basically, I always want the wd device to fire if that's what we're depending on.
(In reply to comment #5) > A test can return a failure (causing the daemon to not pet the wd), and then > return success again (causing the daemon to pet the wd.) That's obviously not > possible if you reboot at the failure. Also, sanlock wants to do its own > recovery in that time period before the wd device does a reset; it takes the > reset time and works backward to determine when to begin its stages of > recovery. Ok points taken. So sanlock would know that once it fails it would have 1 minute (or slightly less) to do it's internal recovery actions? Can sanlock guarantee that it will always complete its necessary recovery prior to the system halt? What if the system is being killed because it is thrashing? In that case it's probably true that sanlock would not be able to complete recovery. So sanlock recovery is 'nice' but not 'required'? (Fine with that, just trying to understand) > Always using only the hardware wd device also makes us independent of whatever > the software might do now or change to do in the future. If the reboot > succeeds, the wd device won't fire. If we're using the wd device in place of > fencing, I think our goal should be to always ensure that the wd device does > fire. Well, you're not relying on the wd device strictly speaking. You're relying on SOMETHING taking the host down in a short deterministic timeframe. How the shutdown is accomplished is not all that important. > Another possible advantage to only being reset by the wd device, is that it > allows other software to continue running all the way up to the device reset. > This means we can build extra verification into the system. A program like > sanlock continues writing to shared storage until the wd device resets the > machine. If the wd device doesn't reset (misconfiguration or bug), other nodes > will see those writes and know that the wd device has not failed. > > Basically, I always want the wd device to fire if that's what we're depending > on. Not opposed to this philosophy, so ack.
> So sanlock would know that once it fails it would have 1 > minute (or slightly less) to do it's internal recovery actions? Right > Can sanlock guarantee that it will always complete its necessary recovery prior > to the system halt? No, but the goal is for sanlock to successfully kill all the necessary vm's in time. If it can, then it will return success to the wd, and the system can avoid being reset at all. > So sanlock recovery is 'nice' but not 'required'? Right. The common example is loss of storage access. If this happens, sanlock will enter recovery mode where it stops petting the wd device (or tells the daemon to stop petting it), attempts to stop/kill all the necessary vm's, and if it can do that before the wd fires, then the system is safe, the wd can continue being petted and the system can avoid being reset.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. It has been proposed for the next release. If you would like it considered as an exception in the current release, please ask your support representative.
wdmd is already a part of the sanlock package