Red Hat Bugzilla – Bug 144386
cman should log the reason for fence actions before executing them
Last modified: 2009-04-16 16:29:59 EDT
Description of problem:
Here's another item that I think is important for the support and
usability of this product. The cluster manager should log the reason
for fence actions before they are executed, much like Gulm does in the
6.0 version (e.g. Missed n heartbeats, gonna exec fence_apc on ...).
A couple of instances where this is particularly confusing are:
- One node doesn't get into the cman quorum before the others do and
the others create a fence domain. You suddenly have a rebooting node
(using fence_apc) and don't really know why. All you see in the log
of one of the quorate members is:
fenced: fencing node "link-12"
Why? I can already hear the support calls.
Something like, "post_join_delay (6) timeout exceeded and node-XX has
not joined cluster. Fencing node-XX" would result in less head
scratching, I think.
- When a node starts missing heartbeats it should be logged as well.
And a message before the fence action like, "deadnode_timeout (21)
exceeded. Fencing node-XX" too.
So there are probably other scenarios in which a fence action is
taken. The long and short of it is the logs should reflect why every
fence action is taken so it can be diagnosed.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
There are probably two requests here (since cman has no clue about
1. better cman logging (reason for a node being evicted)
2. better fenced logging (reason for fencing)
Patrick can do the former; I'll do the later.
Made CMAN much chattier.
Checking in src/cnxman-socket.h;
new revision: 1.8; previous revision: 1.7
Checking in src/cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c
new revision: 1.43; previous revision: 1.42
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c
new revision: 1.45; previous revision: 1.44
Logging as much info as I think fenced can provide wrt an explanation
for fencing. You now get something like:
kernel: CMAN: removing node va16 from the cluster : Missed too many
va15 fenced: va16 not a cluster member after 0 sec post_fail_delay
va15 fenced: fencing node "va16"