Bug 163587 - kernel: CMANsendmsg failed: -101
Summary: kernel: CMANsendmsg failed: -101
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
low
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-07-19 07:39 UTC by Christine Caulfield
Modified: 2009-04-16 20:00 UTC (History)
1 user (show)

Fixed In Version: RHBA-2005-734
Clone Of:
Environment:
Last Closed: 2005-10-07 16:46:48 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2005:734 0 normal SHIPPED_LIVE cman-kernel bug fix update 2005-10-07 04:00:00 UTC

Description Christine Caulfield 2005-07-19 07:39:55 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.8) Gecko/20050524 Fedora/1.0.4-4 Firefox/1.0.4

Description of problem:
There are a couple of things that can cause this message. The easist is to simply down the interface that cman is using and watch the messages scroll up until the node gets fenced.

In some more extreme circumstances it can prevent reboot of the machine (though I don't seem to be able to reproduce this with more recent kernels).

In any case it's a tatty message. cman should either wait quietly to be fenced or quit if all its channels of communication have been cut.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. start cman
2. ifconfig eth0 down
3. watch messages
  

Additional info:

Normally this not a problem, if it gets the point where it prevents a reboot that's usually a configuration error (cman not being shut down by the init scripts).

Comment 1 Lon Hohberger 2005-07-19 20:24:21 UTC
I think it's CMAN being shut down with no network connectivity, which causes the
problem.  The simulation is a bonded interface losing all connectivity
simultaneously, followed by a non-powercycle-fence.

Non-powercycle-fence events are generally non-recoverable.  That is, the node
can't rejoin the cluster by itself -- it requires manual intervention of some
form, because it could still have things waiting to be flushed (which are only
prevented by the fact that the node has been fenced off...).

Here's how to get around this:
(a) Instead of typing "reboot", try "reboot -fn"

(b) Press the power button and hold it for 5 seconds, release, then press it
again for 1 second.

(c) Press the reset button ;)


Comment 2 Christine Caulfield 2005-07-20 14:42:59 UTC
I've checked in a fix to the STABLE branch. If you can, plese let me know how
you get on with it.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.1; previous revision: 1.42.2.12.4.1
done


Comment 3 Christine Caulfield 2005-08-02 15:14:21 UTC
Also committed to RHEL4 branch

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.13; previous revision: 1.42.2.12
done


Comment 4 Corey Marthaler 2005-08-31 19:42:45 UTC
I'm still seeing these messages running revolver (see bz165160), is there still
a case which can cause this message to occur?

Comment 5 Christine Caulfield 2005-09-01 07:19:49 UTC
errno -101 is EUNETNREACH. I've only see it when the network interface is downed
but it can happen if the route or IP address is changed I suppose, such that
cman can't send a packet to its broadcast address.

So, the message isn't going to go away completely because the condition that
causes it to happen is external to CMAN. 

What this bug was (originally) was the looping and prevention of a clean reboot.
cman now shuts itself down. Even without this it would be fenced out of the
cluster by the other nodes because the heartbeat messages are not reaching the
network.


Comment 7 Red Hat Bugzilla 2005-10-07 16:46:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-734.html



Note You need to log in before you can comment on or make changes to this bug.