From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.8) Gecko/20050524 Fedora/1.0.4-4 Firefox/1.0.4 Description of problem: There are a couple of things that can cause this message. The easist is to simply down the interface that cman is using and watch the messages scroll up until the node gets fenced. In some more extreme circumstances it can prevent reboot of the machine (though I don't seem to be able to reproduce this with more recent kernels). In any case it's a tatty message. cman should either wait quietly to be fenced or quit if all its channels of communication have been cut. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. start cman 2. ifconfig eth0 down 3. watch messages Additional info: Normally this not a problem, if it gets the point where it prevents a reboot that's usually a configuration error (cman not being shut down by the init scripts).
I think it's CMAN being shut down with no network connectivity, which causes the problem. The simulation is a bonded interface losing all connectivity simultaneously, followed by a non-powercycle-fence. Non-powercycle-fence events are generally non-recoverable. That is, the node can't rejoin the cluster by itself -- it requires manual intervention of some form, because it could still have things waiting to be flushed (which are only prevented by the fact that the node has been fenced off...). Here's how to get around this: (a) Instead of typing "reboot", try "reboot -fn" (b) Press the power button and hold it for 5 seconds, release, then press it again for 1 second. (c) Press the reset button ;)
I've checked in a fix to the STABLE branch. If you can, plese let me know how you get on with it. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.12.4.1.2.1; previous revision: 1.42.2.12.4.1 done
Also committed to RHEL4 branch Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.13; previous revision: 1.42.2.12 done
I'm still seeing these messages running revolver (see bz165160), is there still a case which can cause this message to occur?
errno -101 is EUNETNREACH. I've only see it when the network interface is downed but it can happen if the route or IP address is changed I suppose, such that cman can't send a packet to its broadcast address. So, the message isn't going to go away completely because the condition that causes it to happen is external to CMAN. What this bug was (originally) was the looping and prevention of a clean reboot. cman now shuts itself down. Even without this it would be fenced out of the cluster by the other nodes because the heartbeat messages are not reaching the network.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-734.html