Description of problem: Here's another item that I think is important for the support and usability of this product. The cluster manager should log the reason for fence actions before they are executed, much like Gulm does in the 6.0 version (e.g. Missed n heartbeats, gonna exec fence_apc on ...). A couple of instances where this is particularly confusing are: - One node doesn't get into the cman quorum before the others do and the others create a fence domain. You suddenly have a rebooting node (using fence_apc) and don't really know why. All you see in the log of one of the quorate members is: fenced[2522]: fencing node "link-12" Why? I can already hear the support calls. Something like, "post_join_delay (6) timeout exceeded and node-XX has not joined cluster. Fencing node-XX" would result in less head scratching, I think. - When a node starts missing heartbeats it should be logged as well. And a message before the fence action like, "deadnode_timeout (21) exceeded. Fencing node-XX" too. So there are probably other scenarios in which a fence action is taken. The long and short of it is the logs should reflect why every fence action is taken so it can be diagnosed. Thanks. Version-Release number of selected component (if applicable): 6.1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
There are probably two requests here (since cman has no clue about fencing): 1. better cman logging (reason for a node being evicted) 2. better fenced logging (reason for fencing) Patrick can do the former; I'll do the later.
Made CMAN much chattier. Checking in src/cnxman-socket.h; /cvs/cluster/cluster/cman-kernel/src/cnxman-socket.h,v <-- cnxman-socket.h new revision: 1.8; previous revision: 1.7 done Checking in src/cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.43; previous revision: 1.42 done Checking in src/membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.45; previous revision: 1.44 done
Logging as much info as I think fenced can provide wrt an explanation for fencing. You now get something like: kernel: CMAN: removing node va16 from the cluster : Missed too many heartbeats va15 fenced[2515]: va16 not a cluster member after 0 sec post_fail_delay va15 fenced[2515]: fencing node "va16"
fix verified.