Bug 461948 - Provide a method to wait for kdump to complete from fencing
Provide a method to wait for kdump to complete from fencing
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: fence-agents (Show other bugs)
6.2
All Linux
medium Severity medium
: rc
: ---
Assigned To: Ryan O'Hara
Cluster QE
: FutureFeature
Depends On:
Blocks: 309991 585266 585332 918795 1081175
  Show dependency treegraph
 
Reported: 2008-09-11 12:14 EDT by Lon Hohberger
Modified: 2016-04-26 10:38 EDT (History)
25 users (show)

See Also:
Fixed In Version: fence-agents-3.1.5-5.el6
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 918795 (view as bug list)
Environment:
Last Closed: 2011-12-06 07:22:19 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Using and testing the fence_kdump agent (7.51 KB, text/plain)
2011-08-02 12:58 EDT, Ryan O'Hara
no flags Details
Using and testing the fence_kdump agent (8.22 KB, text/plain)
2011-08-04 12:17 EDT, Ryan O'Hara
no flags Details

  None (edit)
Description Lon Hohberger 2008-09-11 12:14:33 EDT
Description of problem:

We would like to see a way to delay fencing of a node if the node is in the kdump environment.

Quasi-similar functionality exists today in the form of "post_fail_delay".  Unfortunately, this feature is a hard-stop, and does not allow:
 * early termination if the dump completes quicker than the timeout, or
 * extension of the timeout if the dump has not completed.

This request has two parts:

(1) Kernel (kdump environment) side

 * Provide method to periodically send network packets to notify cluster of kdump status.
 * (optional) Provide method to indicate kdump completion to the cluster.

(2) Cluster (fence) side

 * Provide a method to listen for kdump "I am still dumping" packets
   * Extend timeout when these packets are received
   * If timeout exceeded, proceed with fencing.
   * (optional) Proceed with fencing immediately if kdump completion packet is received.

Additional info:

This feature should be *disabled by default*, as it delays recovery processing in a high availability cluster.  Delaying cluster recovery restricts or interrupts access to SANs, cluster file systems.  Additionally, it negatively affects application availability (in HA failover environments).  Customers wishing to use this feature must carefully weigh the benefits of obtaining crash dumps versus the benefits of faster cluster recovery.

The listener on the cluster side could be implemented as a fence agent which listens for the packets (rather than complicating fenced).  If the listener is implemented as a fence agent, no core cluster infrastructure changes should be necessary and this option could be enabled/disabled in the cluster configuration at run-time.

Details:

 * It can take 15-20 seconds before the kdump environment boots to the point of being able to send "don't taze me bro" packets (vgoyal).
 * The cluster's default timeout is 10 seconds in RHEL5
Comment 1 RHEL Product and Program Management 2009-02-05 18:34:11 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 2 David Teigland 2009-07-31 12:50:24 EDT
This is a duplicate of bug 309991, I don't know which to close.
Comment 4 David Teigland 2009-09-23 12:42:33 EDT
Same suggestion as the RHEL5 equivalent, bz 309991 -- use SAN fencing.  And if
someone wants a form of automatic reboot, call "reboot" at the end of the
kdump routine, or set up a watchdog to monitor the kdump and reboot when it's
done.
Comment 5 David Teigland 2009-09-23 15:10:39 EDT
Modifying Lon's original description a bit based on irc discussion:

(1) kdump side

kdump environment is required to have all the same networks configured in the same way as the normal cluster environment

kdump environment starts a new daemon fence_dump_ack which broadcasts DONTTAZEMEBRO packets on all networks every N seconds, as long as kdump continues to run properly.

fence_dump_ack needs to not be starved by the heavy disk i/o

I don't know what the complicating factors would be on this side.

(2) Cluster (fence) side

new fence agent fence_dump_check listens for broadcast messages on the cluster network

fence_dump_check upon receiving a dump_ack message, compares the source ip to the ip of the victim node name, if it matches it returns 0 (success) and the fencing is complete

fence_dump_check waits for up to M seconds to receive a matching dump_ack, if it doesn't it returns 1 (failure), the fence operation fails, and fenced moves on to the next configured fence agent.

Honestly, this doesn't seem as complicated as I feared it might be, although there could be some nasty situations that come up in implementation and testing.
One complicating factor will be non-crash fencing situations, where the failed/victim node is not kdumping, e.g. startup fencing, fencing due to network disruption or partition.  In each of these cases we'll be running fence_dump_check for M seconds for each victim that's not actually dumping before getting to the "real" fence agent.   SAN fencing still seems like a better solution in general, but if that's not available, this may work passably.
Comment 6 Perry Myers 2009-09-23 15:22:59 EDT
Questions for the kdump experts here...

1. How do we ensure that kdump network environment mirrors the real environment?
2. What can/can't we run as part of the kdump pre script?  Will running a daemon that sends out broadcast packets on all interfaces work or will that be problematic?

Question for dct:
Your described implementation of fence_dump_ack means that if kdump is successful the machine is never rebooted via a power fence.  Would it be useful for fence_dump_ack to return success so that the cluster can continue but also spawn a separate thread/process that waits for the final broadcast message from the dumping node so that it can call the secondary fence agent that will actually power cycle the node?
Comment 7 David Teigland 2009-09-23 16:40:14 EDT
Sorry I've never used kdump, but doesn't it reboot the machine when the dump completes?

Spawning something on the fencing node to monitor the remote kdump and power cycle the machine when it finishes sounds really ugly.  The machine doing the kdump should really be responsible for rebooting itself one way or another, even if that means using a watchdog for the worst situations.
Comment 8 Perry Myers 2009-09-23 17:05:13 EDT
Agreed.  Assuming kdump reboots the machine then we don't need what I described above.  Just the simple solution you outlined.
Comment 9 Lon Hohberger 2009-09-24 11:22:05 EDT
If you want fencing to reboot the node after kdump completes, you can simply have two different sorts of packets which fence_dump_check can wait for:

- WAIT - sent periodically when kdump is running
- DONE - sent after kdump completes

If fence_dump_check is not configured to honor WAIT packets, then fence_dump_check will exit immediately (success) after first WAIT packet is received.  At that point, the node will have entered the kdump environment, replacing the old kernel.

If fence_dump_check is configured to honor 'WAIT' packets (e.g. fence_dump_check wait="1" or whatever), then WAIT can extend wait time.  We don't return until (a) timeout or (b) DONE is received.  If users decide to use kdump in this way, other fencing agents at the same level may be used to reboot the node after the DONE packet is received without relying on further modifications to or requirements on the kdump environment.
Comment 10 Lon Hohberger 2009-09-24 11:25:22 EDT
Also, we could have a meta attribute passed in to fencing agents generated by fenced - for example, the number of remaining fencing agents for this method/level - making the requirement to configure a special attribute for fence_dump_check obsolete.

E.g. if meta_remaining > 0, then honor WAIT packets, else don't.
Comment 11 David Teigland 2009-09-24 13:00:25 EDT
If we really need to insert a reboot into the process, just have fence_dump_ack call reboot when it sees the dump finish... but I still think kdump itself may prefer to be in charge of that.

(BTW, I've been thinking for a while that fence_dump_ack is looking an aweful lot like a watchdog monitor.  I wonder if we could just configure the watchdog program to: monitor the kdump, reboot if the dump stalls, reboot after the dump completes, periodically call fence_dump_ack which would just send the broadcast.)
Comment 12 David Teigland 2010-01-11 12:21:06 EST
This feature would require substantial work if it was to be done.
Comment 13 Lon Hohberger 2010-01-27 10:40:56 EST
*** Bug 309991 has been marked as a duplicate of this bug. ***
Comment 16 Lon Hohberger 2010-04-30 10:42:08 EDT
Andrew pointed me at this:

http://www.gossamer-threads.com/lists/linuxha/dev/51968

It's not the same design we had in mind, but it's a lot less code to write.

This one creates a special user in the kdump environment and copies in a SSH key.  The cluster then uses ssh to connect to the kdump environment and check whether the host is dumping or not.

There are some issues with the implementation:
1) the stonith module appears to wait forever if it fails to connect
2) the stonith module is a stonith module and not usable by fenced at this point

I have not checked whether the patch to mkdumprd sill works; the patch was submitted to linux-ha-dev in 2008 shortly after the cluster summit in Prague.
Comment 17 Neil Horman 2010-04-30 11:43:14 EDT
Thats not a bad idea, although I'm little hesitant to start an sshd server while we're kdumping.  Nominally the kdump kernel/initrd is going to be operating in a very restrictive memory environment (typically 128Mb).  If we're doing dump filtering (which is memory intentisve), and we need to service ssh operations in parallel, we might be looking at out of memory conditions, or at least some failed allocations.  It would be better if the dumping system could initiate an action to prevent it from being fenced.  That way we can serialize the fence prevention and dump capture operations, and save on memory use

That, said, it won;t hurt to test this patch out and see how it goes.
Comment 20 Perry Myers 2010-07-15 17:03:24 EDT
Notes from Lon in a discussion he had with Neil/Subhendu:

Issues on the ssh implementation that Andrew noted.
It is very sensitive to:
- ssh key synchronization
- UID changes
- key changes

Also, adding sshd greatly increases:
- memory & dump image footprints

Now, turns out the model may be preferable - that is, have the cluster
connect to the dumping machine rather than wait for a special packet.  

Specifically, we can likely simply use nc to implement a simple server
on a predefined IP port.  This will save a lot on memory and on-disk
footprint for dump ramdisk because nc is a built-in for busybox.
Comment 21 David Teigland 2010-11-22 11:57:06 EST
No body to work on this, kick down the road.
Comment 26 Ryan O'Hara 2011-08-01 06:46:29 EDT
Pushed new fence_kdump agent to fence-agents repo, both master and RHEL6 branch.
Comment 27 Ryan O'Hara 2011-08-02 12:58:36 EDT
Created attachment 516371 [details]
Using and testing the fence_kdump agent

Here is a write-up about how to use/test the new fence_kdump agent. I've described how you can test fence_kdump and fence_kdump_send independently as well as how to test it in a cluster with the kdump service enabled. Please send questions and comments.
Comment 30 Ryan O'Hara 2011-08-04 12:17:57 EDT
Created attachment 516744 [details]
Using and testing the fence_kdump agent

Updated the usage/testing write-up to describe how to change the behavior of fence_kdump_send.
Comment 33 Alan Brown 2011-11-04 12:24:57 EDT
This would be handy on RHEL5 too.

Dave, the issue with fencing (power fencing) when a crashed node is running kdump is that half the time the node is rebooted before vmcore is completed.

Kdump supposedly reboots the machine, but that doesn't always happen...
Comment 34 errata-xmlrpc 2011-12-06 07:22:19 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1599.html

Note You need to log in before you can comment on or make changes to this bug.