Red Hat Bugzilla – Bug 461948
Provide a method to wait for kdump to complete from fencing
Last modified: 2016-04-26 10:38:26 EDT
Description of problem:
We would like to see a way to delay fencing of a node if the node is in the kdump environment.
Quasi-similar functionality exists today in the form of "post_fail_delay". Unfortunately, this feature is a hard-stop, and does not allow:
* early termination if the dump completes quicker than the timeout, or
* extension of the timeout if the dump has not completed.
This request has two parts:
(1) Kernel (kdump environment) side
* Provide method to periodically send network packets to notify cluster of kdump status.
* (optional) Provide method to indicate kdump completion to the cluster.
(2) Cluster (fence) side
* Provide a method to listen for kdump "I am still dumping" packets
* Extend timeout when these packets are received
* If timeout exceeded, proceed with fencing.
* (optional) Proceed with fencing immediately if kdump completion packet is received.
This feature should be *disabled by default*, as it delays recovery processing in a high availability cluster. Delaying cluster recovery restricts or interrupts access to SANs, cluster file systems. Additionally, it negatively affects application availability (in HA failover environments). Customers wishing to use this feature must carefully weigh the benefits of obtaining crash dumps versus the benefits of faster cluster recovery.
The listener on the cluster side could be implemented as a fence agent which listens for the packets (rather than complicating fenced). If the listener is implemented as a fence agent, no core cluster infrastructure changes should be necessary and this option could be enabled/disabled in the cluster configuration at run-time.
* It can take 15-20 seconds before the kdump environment boots to the point of being able to send "don't taze me bro" packets (vgoyal).
* The cluster's default timeout is 10 seconds in RHEL5
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
This is a duplicate of bug 309991, I don't know which to close.
Same suggestion as the RHEL5 equivalent, bz 309991 -- use SAN fencing. And if
someone wants a form of automatic reboot, call "reboot" at the end of the
kdump routine, or set up a watchdog to monitor the kdump and reboot when it's
Modifying Lon's original description a bit based on irc discussion:
(1) kdump side
kdump environment is required to have all the same networks configured in the same way as the normal cluster environment
kdump environment starts a new daemon fence_dump_ack which broadcasts DONTTAZEMEBRO packets on all networks every N seconds, as long as kdump continues to run properly.
fence_dump_ack needs to not be starved by the heavy disk i/o
I don't know what the complicating factors would be on this side.
(2) Cluster (fence) side
new fence agent fence_dump_check listens for broadcast messages on the cluster network
fence_dump_check upon receiving a dump_ack message, compares the source ip to the ip of the victim node name, if it matches it returns 0 (success) and the fencing is complete
fence_dump_check waits for up to M seconds to receive a matching dump_ack, if it doesn't it returns 1 (failure), the fence operation fails, and fenced moves on to the next configured fence agent.
Honestly, this doesn't seem as complicated as I feared it might be, although there could be some nasty situations that come up in implementation and testing.
One complicating factor will be non-crash fencing situations, where the failed/victim node is not kdumping, e.g. startup fencing, fencing due to network disruption or partition. In each of these cases we'll be running fence_dump_check for M seconds for each victim that's not actually dumping before getting to the "real" fence agent. SAN fencing still seems like a better solution in general, but if that's not available, this may work passably.
Questions for the kdump experts here...
1. How do we ensure that kdump network environment mirrors the real environment?
2. What can/can't we run as part of the kdump pre script? Will running a daemon that sends out broadcast packets on all interfaces work or will that be problematic?
Question for dct:
Your described implementation of fence_dump_ack means that if kdump is successful the machine is never rebooted via a power fence. Would it be useful for fence_dump_ack to return success so that the cluster can continue but also spawn a separate thread/process that waits for the final broadcast message from the dumping node so that it can call the secondary fence agent that will actually power cycle the node?
Sorry I've never used kdump, but doesn't it reboot the machine when the dump completes?
Spawning something on the fencing node to monitor the remote kdump and power cycle the machine when it finishes sounds really ugly. The machine doing the kdump should really be responsible for rebooting itself one way or another, even if that means using a watchdog for the worst situations.
Agreed. Assuming kdump reboots the machine then we don't need what I described above. Just the simple solution you outlined.
If you want fencing to reboot the node after kdump completes, you can simply have two different sorts of packets which fence_dump_check can wait for:
- WAIT - sent periodically when kdump is running
- DONE - sent after kdump completes
If fence_dump_check is not configured to honor WAIT packets, then fence_dump_check will exit immediately (success) after first WAIT packet is received. At that point, the node will have entered the kdump environment, replacing the old kernel.
If fence_dump_check is configured to honor 'WAIT' packets (e.g. fence_dump_check wait="1" or whatever), then WAIT can extend wait time. We don't return until (a) timeout or (b) DONE is received. If users decide to use kdump in this way, other fencing agents at the same level may be used to reboot the node after the DONE packet is received without relying on further modifications to or requirements on the kdump environment.
Also, we could have a meta attribute passed in to fencing agents generated by fenced - for example, the number of remaining fencing agents for this method/level - making the requirement to configure a special attribute for fence_dump_check obsolete.
E.g. if meta_remaining > 0, then honor WAIT packets, else don't.
If we really need to insert a reboot into the process, just have fence_dump_ack call reboot when it sees the dump finish... but I still think kdump itself may prefer to be in charge of that.
(BTW, I've been thinking for a while that fence_dump_ack is looking an aweful lot like a watchdog monitor. I wonder if we could just configure the watchdog program to: monitor the kdump, reboot if the dump stalls, reboot after the dump completes, periodically call fence_dump_ack which would just send the broadcast.)
This feature would require substantial work if it was to be done.
*** Bug 309991 has been marked as a duplicate of this bug. ***
Andrew pointed me at this:
It's not the same design we had in mind, but it's a lot less code to write.
This one creates a special user in the kdump environment and copies in a SSH key. The cluster then uses ssh to connect to the kdump environment and check whether the host is dumping or not.
There are some issues with the implementation:
1) the stonith module appears to wait forever if it fails to connect
2) the stonith module is a stonith module and not usable by fenced at this point
I have not checked whether the patch to mkdumprd sill works; the patch was submitted to linux-ha-dev in 2008 shortly after the cluster summit in Prague.
Thats not a bad idea, although I'm little hesitant to start an sshd server while we're kdumping. Nominally the kdump kernel/initrd is going to be operating in a very restrictive memory environment (typically 128Mb). If we're doing dump filtering (which is memory intentisve), and we need to service ssh operations in parallel, we might be looking at out of memory conditions, or at least some failed allocations. It would be better if the dumping system could initiate an action to prevent it from being fenced. That way we can serialize the fence prevention and dump capture operations, and save on memory use
That, said, it won;t hurt to test this patch out and see how it goes.
Notes from Lon in a discussion he had with Neil/Subhendu:
Issues on the ssh implementation that Andrew noted.
It is very sensitive to:
- ssh key synchronization
- UID changes
- key changes
Also, adding sshd greatly increases:
- memory & dump image footprints
Now, turns out the model may be preferable - that is, have the cluster
connect to the dumping machine rather than wait for a special packet.
Specifically, we can likely simply use nc to implement a simple server
on a predefined IP port. This will save a lot on memory and on-disk
footprint for dump ramdisk because nc is a built-in for busybox.
No body to work on this, kick down the road.
Pushed new fence_kdump agent to fence-agents repo, both master and RHEL6 branch.
Created attachment 516371 [details]
Using and testing the fence_kdump agent
Here is a write-up about how to use/test the new fence_kdump agent. I've described how you can test fence_kdump and fence_kdump_send independently as well as how to test it in a cluster with the kdump service enabled. Please send questions and comments.
Created attachment 516744 [details]
Using and testing the fence_kdump agent
Updated the usage/testing write-up to describe how to change the behavior of fence_kdump_send.
This would be handy on RHEL5 too.
Dave, the issue with fencing (power fencing) when a crashed node is running kdump is that half the time the node is rebooted before vmcore is completed.
Kdump supposedly reboots the machine, but that doesn't always happen...
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.