Bug 144386 - cman should log the reason for fence actions before executing them
cman should log the reason for fence actions before executing them
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-06 13:17 EST by Derek Anderson
Modified: 2009-04-16 16:29 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-31 16:09:31 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Derek Anderson 2005-01-06 13:17:30 EST
Description of problem:
Here's another item that I think is important for the support and
usability of this product.  The cluster manager should log the reason
for fence actions before they are executed, much like Gulm does in the
6.0 version (e.g. Missed n heartbeats, gonna exec fence_apc on ...).

A couple of instances where this is particularly confusing are:
- One node doesn't get into the cman quorum before the others do and
the others create a fence domain.  You suddenly have a rebooting node
(using fence_apc) and don't really know why.  All you see in the log
of one of the quorate members is: 

fenced[2522]: fencing node "link-12"

Why?  I can already hear the support calls.
Something like, "post_join_delay (6) timeout exceeded and node-XX has
not joined cluster.  Fencing node-XX" would result in less head
scratching, I think.

- When a node starts missing heartbeats it should be logged as well. 
And a message before the fence action like, "deadnode_timeout (21)
exceeded.  Fencing node-XX" too.

So there are probably other scenarios in which a fence action is
taken.  The long and short of it is the logs should reflect why every
fence action is taken so it can be diagnosed.

Thanks.

Version-Release number of selected component (if applicable):
6.1

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Teigland 2005-01-06 21:37:25 EST
There are probably two requests here (since cman has no clue about
fencing):
1. better cman logging (reason for a node being evicted)
2. better fenced logging (reason for fencing)
Patrick can do the former; I'll do the later.
Comment 2 Christine Caulfield 2005-01-07 10:31:11 EST
Made CMAN much chattier.

Checking in src/cnxman-socket.h;
/cvs/cluster/cluster/cman-kernel/src/cnxman-socket.h,v  <-- 
cnxman-socket.h
new revision: 1.8; previous revision: 1.7
done
Checking in src/cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.43; previous revision: 1.42
done
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.45; previous revision: 1.44
done
Comment 3 David Teigland 2005-01-16 23:39:54 EST
Logging as much info as I think fenced can provide wrt an explanation
for fencing.  You now get something like:

kernel: CMAN: removing node va16 from the cluster : Missed too many
              heartbeats
va15 fenced[2515]: va16 not a cluster member after 0 sec post_fail_delay
va15 fenced[2515]: fencing node "va16"
Comment 4 Corey Marthaler 2005-03-31 16:09:31 EST
fix verified.

Note You need to log in before you can comment on or make changes to this bug.