129710 – Heatbeat fails with multiple IP addresses on Ethernet device

Bug 129710 - Heatbeat fails with multiple IP addresses on Ethernet device

Summary: Heatbeat fails with multiple IP addresses on Ethernet device

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	131576
TreeView+	depends on / blocked

Reported:	2004-08-11 22:44 UTC by Royce Brown
Modified:	2009-04-16 20:15 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-11-09 17:51:46 UTC
Embargoed:

Attachments	(Terms of Use)
clumemdb.c file (14.02 KB, text/plain) 2004-08-12 21:46 UTC, Royce Brown	no flags	Details
Fix specified in comment #2 (570 bytes, patch) 2004-08-12 22:09 UTC, Lon Hohberger	no flags	Details \| Diff
Patch that actually builds and doesn't cause clumembd to break. (1.21 KB, patch) 2004-08-27 16:44 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:491	0	high	SHIPPED_LIVE	Updated clumanager and redhat-config-cluster packages	2004-12-20 05:00:00 UTC

Description Royce Brown 2004-08-11 22:44:47 UTC

From Bugzilla Helper:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3

Description of problem:
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.

If you have a Ethernet interface with an alias and are using multicast
the source address may contain the main IP address or the alias
address. If it contains the alias address the message is then rejected
by all other nodes as it now contains the wrong IP address.

The software correctly binds a socket to the main interface and at
first the correct IP address is send.
Some time later on sending messages using this socket the alias
address seems to get into the packets.

I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.

Computer has  
   Interfaces: bond0   addr 10.10.197.11
               bond0:0 addr 10.10.197.6

         Multcast set up     
clumembd[2]: <debug> add_interface fd:4 name:bond0
clumembd[2]: <debug> Interface IP is 10.10.197.11
clumembd[2]: <debug> Setting up multicast 225.0.0.11 on 10.10.197.11
clumembd[2]: <debug> Multicast send fd:5 (10.10.197.11)
clumembd[2]: <debug> Multicast receive fd:6

	   Sending and receiving message (Correct behaviour)
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
            ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
         
After a while you get. sinp = source address, nsp = expected address
  
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
             ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on
our subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11


The source address now has bond0:0 address when it did have bond0's
address.
The socket has not changed.


Version-Release number of selected component (if applicable):
clumanager-1.2.16-1

How reproducible:
Always

Steps to Reproduce:
1.Set up cluster two nodes using bonded ethernet ports
2. set up a second address on one node on the bonded interface
3. start clumanager
4. wait. Fails quicker if you fail a service with another ip alias
between the computers.
    

Actual Results:  The node with out the IP alias eventually thinks the
other node has
failed and tries to kill it.

Expected Results:  Nothing the cluster should not fail for no good reason

Additional info:

Comment 1 Royce Brown 2004-08-12 21:44:48 UTC

I think I have found the fix to this problem.
in clumembd.c add_multicast the send socket uses

mreq.imr_multiaddr.s_addr = hb_if->hb_mcast_addr.sin_addr.s_addr;
mreq.imr_interface.s_addr = hb_if->hb_ip_addr.sin_addr.s_addr;
if (setsockopt(hb_if->hb_mcast_send, IPPROTO_IP,IP_ADD_MEMBERSHIP,
           &mreq,sizeof(mreq)) == -1) {

From what I can tell the IP_ADD_MEMBERSHIP binds the socket to a 
interface with  mreq.imr_interface.s_addr only for receiving messages.
If you send on this socket the kernel can choose any interface it 
likes to actually send the message.

Solution is use IP_MULTICAST_IF to bind the socket to the interface
containing the IP address you want to send out on.
The replacement lines in the code is

if (setsockopt(
   hb_if->hb_mcast_send, IPPROTO_IP, IP_MULTICAST_IF, 
   &hb_if->hb_ip_addr.sin_addr,sizeof(hb_if->hb_ip_addr.sin_addr))
   == -1) {

Have send in clumembd.c code as have added a lot more clulog
(LOG_DEBUG ...  lines. Do what you want with it.

Royce

Comment 2 Royce Brown 2004-08-12 21:46:15 UTC

Created attachment 102676 [details]
clumemdb.c file

Comment 3 Lon Hohberger 2004-08-12 22:05:08 UTC

The attachment didn't work, but it seems like a simple fix.

Comment 4 Lon Hohberger 2004-08-12 22:09:28 UTC

Created attachment 102678 [details]
Fix specified in comment #2

Comment 5 Lon Hohberger 2004-08-12 22:10:31 UTC

Note: Patch untested at this point.

Comment 6 Royce Brown 2004-08-12 22:47:08 UTC

Sorry attached file was a zip file of the complete clumemdb.c file it
was not a patch file. But yes the fix is simple and you don't need 
the extra debug lines.

Comment 7 Lon Hohberger 2004-08-26 15:43:28 UTC

Will include patch in next errata (assuming it doesn't break anything
else =) ).

Comment 8 Lon Hohberger 2004-08-27 16:44:45 UTC

Created attachment 103169 [details]
Patch that actually builds and doesn't cause clumembd to break.

Please test this on your configuration.

Comment 9 Lon Hohberger 2004-09-02 15:57:50 UTC

1.2.18pre1 patch (unsupported; test only, etc.)

http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch

This includes the fix for this bug and a few others.

Comment 10 Royce Brown 2004-09-02 19:57:41 UTC

Have been running patched clumembd on my configure for a few days now
and there has been no problems. Will try 1.2.18pre1 patch
Royce brown

Comment 11 Royce Brown 2004-10-10 21:52:54 UTC

Have now been running 1.2.18pre1 patch for several weeks and have had
no problems.

Comment 12 Lon Hohberger 2004-10-11 13:55:54 UTC

Thanks for the feedback.  When we release a new erratum, we'll be sure
to point to it in this bugzilla.

Comment 13 Derek Anderson 2004-11-09 17:51:46 UTC

Closing based on Verification by the reporter.  This will be in
RHEL3-U4, clumanager-1.2.22-2,

Comment 14 John Flanagan 2004-12-21 03:40:13 UTC

An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-491.html

Comment 15 Lon Hohberger 2007-12-21 15:09:55 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.