| Summary: | NewPkg: Daemon to provide watchdog device multiplexing for RHEL HA and RHEV Storage use cases | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Perry Myers <pmyers> |
| Component: | distribution | Assignee: | David Teigland <teigland> |
| Status: | CLOSED NOTABUG | QA Contact: | Dean Jansa <djansa> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 6.1 | CC: | abaron, cfeist, cluster-maint, syeghiay |
| Target Milestone: | rc | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-02-14 16:11:58 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 509056, 845337, 858964, 877098, 877112, 999680, 999681, 999683 | ||
|
Description
Perry Myers
2011-01-18 17:49:55 UTC
The following are notes from an email discussion with dteigland re: watchdogd behavior and why it is not ideal for sanlock or RHEL HA usage: > Every second the watchdog daemon does > > result = test(); > if (result) > do_shutdown(); /* its own version of "shutdown now" */ > keep_alive(); /* pet wd device */ > > In our case, test() calls "sanlock wdtest" which does: > if our host_id renewal was less than 30 seconds ago, return 0, else return 1 > > Notably, the test result does *not* influence the petting of the wd device > (keep_alive). What a non-zero test result does cause, is the wd daemon to > immediately begin shutting down the machine. While doing the shutdown, > the wd daemon tries its best to continue keeping the wd device alive by > continuing to pet it. If the shutdown process hangs at some point, then > the wd device will come to its rescue and do the reset. Specifically, > do_shutdown() roughly does: > > send email if configured > syslog("shutting down because of error"); > keek_alive(); > sleep(10); /* make sure log is written and mail is send */ > keep_alive(); > kill(1, SIGTSTP); > kill(-1, SIGSTOP); > go through each pid, calling kill(pid, SIGTERM); > kill(-1, SIGCONT); > sleep(5); > kill(-1, SIGSTOP); > go through each pid, calling kill(pid, SIGKILL); > kill(-1, SIGCONT); > keep_alive(); > more logging to wtmp > acct(NULL); > keep_alive(); > swapoff(); > quotactl(); > unmount filesystems > shut down network interfaces > reboot(RB_AUTOBOOT); > reboot(RB_HALT_SYSTEM); > sleep waiting for hard reset from wd device (not supposed to get here) > > > What this amounts to is not terribly deterministic: the system is shut > down any time from 10 seconds after the first test failure, to 60 seconds > after the last keep_alive() in do_shutdown(). If we allow 60 seconds > between each keep_alive() in do_shutdown(), the latest possible time the > machine becomes safe is 300 seconds after the test failure. (This is a > rough analysis, I may be missing some things, and any extra keep_alive in > the shutdown path will extend the safe time.) > > When using a watchdog device as a replacement for fencing, we're not > mainly interested in the system shutdown process, which seems to be the > main point of the wd daemon. We're mainly interested in making things as > deterministic as possible within a short timespan. > > What we're after is a known, fixed time when we can safely assume the > machine is reset/off (therefore not modifying storage). The recovery > logic I'd been expecting to use is: if the machine only pets its wd > device after it successfully renews its lease (visible to everyone), then > another machine can assume the wd device has fired at last_renewal+120, > given the standard 60 sec wd device timeout. If a test fails, the daemon should do nothing, and continue running the tests If all tests go back to returning success, it can resume petting the wd device. It should not try to poweroff/reboot/etc itself, it should only pet the wd device according to the test results. One problem multiple tests can create is independently staggered "windows" of not petting the wd device. Even if none of the tests exceed the wd timeout, the overlap of multiple test window can eventually trigger spurious resets. There are some strategies to avoid this in an old email that I will try to reference. (In reply to comment #3) > If a test fails, the daemon should do nothing, and continue running the tests > If all tests go back to returning success, it can resume petting the wd device. ack > It should not try to poweroff/reboot/etc itself, it should only pet the wd > device according to the test results. why not? running poweroff --force simply kills the host quicker. External hosts/devices would still need to assume the full minute timeout of the watchdog device for safety but at least running poweroff --force syncs the disks before the machine dies. A test can return a failure (causing the daemon to not pet the wd), and then return success again (causing the daemon to pet the wd.) That's obviously not possible if you reboot at the failure. Also, sanlock wants to do its own recovery in that time period before the wd device does a reset; it takes the reset time and works backward to determine when to begin its stages of recovery. Always using only the hardware wd device also makes us independent of whatever the software might do now or change to do in the future. If the reboot succeeds, the wd device won't fire. If we're using the wd device in place of fencing, I think our goal should be to always ensure that the wd device does fire. Another possible advantage to only being reset by the wd device, is that it allows other software to continue running all the way up to the device reset. This means we can build extra verification into the system. A program like sanlock continues writing to shared storage until the wd device resets the machine. If the wd device doesn't reset (misconfiguration or bug), other nodes will see those writes and know that the wd device has not failed. Basically, I always want the wd device to fire if that's what we're depending on. (In reply to comment #5) > A test can return a failure (causing the daemon to not pet the wd), and then > return success again (causing the daemon to pet the wd.) That's obviously not > possible if you reboot at the failure. Also, sanlock wants to do its own > recovery in that time period before the wd device does a reset; it takes the > reset time and works backward to determine when to begin its stages of > recovery. Ok points taken. So sanlock would know that once it fails it would have 1 minute (or slightly less) to do it's internal recovery actions? Can sanlock guarantee that it will always complete its necessary recovery prior to the system halt? What if the system is being killed because it is thrashing? In that case it's probably true that sanlock would not be able to complete recovery. So sanlock recovery is 'nice' but not 'required'? (Fine with that, just trying to understand) > Always using only the hardware wd device also makes us independent of whatever > the software might do now or change to do in the future. If the reboot > succeeds, the wd device won't fire. If we're using the wd device in place of > fencing, I think our goal should be to always ensure that the wd device does > fire. Well, you're not relying on the wd device strictly speaking. You're relying on SOMETHING taking the host down in a short deterministic timeframe. How the shutdown is accomplished is not all that important. > Another possible advantage to only being reset by the wd device, is that it > allows other software to continue running all the way up to the device reset. > This means we can build extra verification into the system. A program like > sanlock continues writing to shared storage until the wd device resets the > machine. If the wd device doesn't reset (misconfiguration or bug), other nodes > will see those writes and know that the wd device has not failed. > > Basically, I always want the wd device to fire if that's what we're depending > on. Not opposed to this philosophy, so ack. > So sanlock would know that once it fails it would have 1 > minute (or slightly less) to do it's internal recovery actions? Right > Can sanlock guarantee that it will always complete its necessary recovery prior > to the system halt? No, but the goal is for sanlock to successfully kill all the necessary vm's in time. If it can, then it will return success to the wd, and the system can avoid being reset at all. > So sanlock recovery is 'nice' but not 'required'? Right. The common example is loss of storage access. If this happens, sanlock will enter recovery mode where it stops petting the wd device (or tells the daemon to stop petting it), attempts to stop/kill all the necessary vm's, and if it can do that before the wd fires, then the system is safe, the wd can continue being petted and the system can avoid being reset. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. It has been proposed for the next release. If you would like it considered as an exception in the current release, please ask your support representative. wdmd is already a part of the sanlock package |