Red Hat Bugzilla – Bug 309991
Add coordination between Kdump and Cluster Fencing for long kernel panic dumps
Last modified: 2016-04-26 10:48:21 EDT
With large memory configurations, some machines take a long time to dump state
when a panic occurs. The cluster software may well force a reboot as a fence
operation before the dump completes. This cause the loss of important data to
diagnose the root problem.
Cluster fencing needs a mechanism to hold off fencing until the dump completes
or assurance from the failed node that it will not re-awaken and cause data
corruption of shared information.
I've added, as part of bz 269761, the ability to run an arbitrary script from
the kdump initrd prior to capturing a vmcore. My thought was that we could use
this ability to fork a process that spoke to the cluster suite peer daemons in
such a way as to stall the fencing process. This obviously requires that the
fencing suite contain some utility to drive the communication appropriately,
which can then be added to kdump via /etc/kdump.conf. Thoughts Jim?
Is there any status on this feature request Jim / Rob? What are the next steps?
This is now targeted for RHEL 5.3
But, when a node is crashing, you want it DOWN. That is the reason for fencing
in the first place. Are you really willing to risk your data?
Please read the initial comment on this bug. The reason this bug was opened was
because the fencing functionality of our cluster suite was power-cycling a
crashed box before kdump could complete running on it and collect the system
vmcore for an post mortem analysis. I agree that you want to fence nodes that
are crashed, but you also want to figure out why they crashed in the first
place, and you cant do that if the cluster reboots your crashed system before it
can record its memory image. Including a utility/script with the cluster suite
to prevent fencing for use by the kdump_pre directive is, In my view, a good way
to do that. As to weather or not it makes sense in a given environment is up
I think the best way to deal with this problem is to use storage fencing rather than power fencing.
If there really is a need to do power fencing and delay it until kdump is done, then I like the idea of using a kdump hook to start a program that will broadcast periodic status messages on the progress of the kdump. The other cluster nodes would monitor this and delay fencing. I'm not sure if fencing would be considered successful once kdump was done without doing anything else (as done by stonith plugin below), or if we'd want to do the power fencing after kdump completed.
A third option is for remaining nodes to log into the kdumping node to monitor its progress. NTT implemented this as a pacemaker stonith plugin:
This is a duplicate of bug 461948, I don't know which to close.
From bz 461948:
Not a duplicate. One is for RHEL5 and one is for RHEL6.
Closing as a duplicate of 461948 since bug 461948 has a more complete set of design ideas.
*** This bug has been marked as a duplicate of bug 461948 ***
Instead of closing this one, this should be blocked by 461948 since the ticket is for a different version of the OS. It stands to reason that this has to be developed in RHEL 6 before deciding whether to backport to RHEL 5 but don't agree that this ticket should be closed until decision as to whether this will be fixed in RHEL 5 at all is made.
Calvin, you're right.
Isn't the node in kdumping state already effectively fenced?
It loads completely new kernel and mounts only local storage (or the one defined in the kdump.conf in general). In this state it can hardly touch any shared resource unless you set it up this way.
I'm thinking of these conditions:
* kdump has a way to notify the cluster nodes that it has booted
* cluster fencing is configurable so it waits up to XX secs or until kdump echo, whichever comes first
* kdump echo is configurable and off by default
Then the customer willing to capture kdumps is aware of what's he doing, as he must be able to configure fencing and know the consequences, enable and configure kdump. More, the recovery is not limited by kdump in any other way than the configurable timeout until kdump starts.
Kdump restarts the machine at the end allowing normal operations to continue (cluster rejoin, migrate back the services, etc.). It can also hang and not restart the machine, but that's the risk you are willing to take when you enable it.
What do you think?
I think you are right that once we have exec'd the new kernel we should probably not have access to any cluster resources, but I wonder if that is always the case. For example, while I expect that the kdump kernel would not have support for gfs/gfs2, it probably needs access to ext3 fs in order to write out the dump. Is it all possible that the kexec'd kernel still has access to a cluster resource, like an ext3 fs that is managed by RHCS? If so, this could be a problem.
Overall though, the reason that people want kdump is to determine root cause on failures. Many customers have strict policies on when to return systems to service after failures and want to have as much information as they can about the failure in order to understand the risk and exposure of the bug they may have just experienced. I would expect most customers running RHCS in a high-availability type of situation would want to do so.
It seems like there should be some guidelines for configurations and some limitations documented for the combination of kdump and RHCS.
There was already a design in the other bugzilla. I figure the design would be used in both places.