144386 – cman should log the reason for fence actions before executing them

Bug 144386 - cman should log the reason for fence actions before executing them

Summary: cman should log the reason for fence actions before executing them

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-06 18:17 UTC by Derek Anderson
Modified:	2009-04-16 20:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-03-31 21:09:31 UTC
Embargoed:

Attachments	(Terms of Use)

Description Derek Anderson 2005-01-06 18:17:30 UTC

Description of problem:
Here's another item that I think is important for the support and
usability of this product.  The cluster manager should log the reason
for fence actions before they are executed, much like Gulm does in the
6.0 version (e.g. Missed n heartbeats, gonna exec fence_apc on ...).

A couple of instances where this is particularly confusing are:
- One node doesn't get into the cman quorum before the others do and
the others create a fence domain.  You suddenly have a rebooting node
(using fence_apc) and don't really know why.  All you see in the log
of one of the quorate members is: 

fenced[2522]: fencing node "link-12"

Why?  I can already hear the support calls.
Something like, "post_join_delay (6) timeout exceeded and node-XX has
not joined cluster.  Fencing node-XX" would result in less head
scratching, I think.

- When a node starts missing heartbeats it should be logged as well. 
And a message before the fence action like, "deadnode_timeout (21)
exceeded.  Fencing node-XX" too.

So there are probably other scenarios in which a fence action is
taken.  The long and short of it is the logs should reflect why every
fence action is taken so it can be diagnosed.

Thanks.

Version-Release number of selected component (if applicable):
6.1

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2005-01-07 02:37:25 UTC

There are probably two requests here (since cman has no clue about
fencing):
1. better cman logging (reason for a node being evicted)
2. better fenced logging (reason for fencing)
Patrick can do the former; I'll do the later.

Comment 2 Christine Caulfield 2005-01-07 15:31:11 UTC

Made CMAN much chattier.

Checking in src/cnxman-socket.h;
/cvs/cluster/cluster/cman-kernel/src/cnxman-socket.h,v  <-- 
cnxman-socket.h
new revision: 1.8; previous revision: 1.7
done
Checking in src/cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.43; previous revision: 1.42
done
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.45; previous revision: 1.44
done

Comment 3 David Teigland 2005-01-17 04:39:54 UTC

Logging as much info as I think fenced can provide wrt an explanation
for fencing.  You now get something like:

kernel: CMAN: removing node va16 from the cluster : Missed too many
              heartbeats
va15 fenced[2515]: va16 not a cluster member after 0 sec post_fail_delay
va15 fenced[2515]: fencing node "va16"

Comment 4 Corey Marthaler 2005-03-31 21:09:31 UTC

fix verified.

Note You need to log in before you can comment on or make changes to this bug.