163587 – kernel: CMANsendmsg failed: -101

Bug 163587 - kernel: CMANsendmsg failed: -101

Summary: kernel: CMANsendmsg failed: -101

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-19 07:39 UTC by Christine Caulfield
Modified:	2009-04-16 20:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2005-734
Clone Of:
Environment:
Last Closed:	2005-10-07 16:46:48 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:734	0	normal	SHIPPED_LIVE	cman-kernel bug fix update	2005-10-07 04:00:00 UTC

Description Christine Caulfield 2005-07-19 07:39:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.8) Gecko/20050524 Fedora/1.0.4-4 Firefox/1.0.4

Description of problem:
There are a couple of things that can cause this message. The easist is to simply down the interface that cman is using and watch the messages scroll up until the node gets fenced.

In some more extreme circumstances it can prevent reboot of the machine (though I don't seem to be able to reproduce this with more recent kernels).

In any case it's a tatty message. cman should either wait quietly to be fenced or quit if all its channels of communication have been cut.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. start cman
2. ifconfig eth0 down
3. watch messages
  

Additional info:

Normally this not a problem, if it gets the point where it prevents a reboot that's usually a configuration error (cman not being shut down by the init scripts).

Comment 1 Lon Hohberger 2005-07-19 20:24:21 UTC

I think it's CMAN being shut down with no network connectivity, which causes the
problem.  The simulation is a bonded interface losing all connectivity
simultaneously, followed by a non-powercycle-fence.

Non-powercycle-fence events are generally non-recoverable.  That is, the node
can't rejoin the cluster by itself -- it requires manual intervention of some
form, because it could still have things waiting to be flushed (which are only
prevented by the fact that the node has been fenced off...).

Here's how to get around this:
(a) Instead of typing "reboot", try "reboot -fn"

(b) Press the power button and hold it for 5 seconds, release, then press it
again for 1 second.

(c) Press the reset button ;)

Comment 2 Christine Caulfield 2005-07-20 14:42:59 UTC

I've checked in a fix to the STABLE branch. If you can, plese let me know how
you get on with it.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.1; previous revision: 1.42.2.12.4.1
done

Comment 3 Christine Caulfield 2005-08-02 15:14:21 UTC

Also committed to RHEL4 branch

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.13; previous revision: 1.42.2.12
done

Comment 4 Corey Marthaler 2005-08-31 19:42:45 UTC

I'm still seeing these messages running revolver (see bz165160), is there still
a case which can cause this message to occur?

Comment 5 Christine Caulfield 2005-09-01 07:19:49 UTC

errno -101 is EUNETNREACH. I've only see it when the network interface is downed
but it can happen if the route or IP address is changed I suppose, such that
cman can't send a packet to its broadcast address.

So, the message isn't going to go away completely because the condition that
causes it to happen is external to CMAN. 

What this bug was (originally) was the looping and prevention of a clean reboot.
cman now shuts itself down. Even without this it would be fenced out of the
cluster by the other nodes because the heartbeat messages are not reaching the
network.

Comment 7 Red Hat Bugzilla 2005-10-07 16:46:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-734.html

Note You need to log in before you can comment on or make changes to this bug.