Bug 129710

Summary: Heatbeat fails with multiple IP addresses on Ethernet device
Product: [Retired] Red Hat Cluster Suite Reporter: Royce Brown <rbrown>
Component: clumanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-11-09 17:51:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 131576    
Attachments:
Description Flags
clumemdb.c file
none
Fix specified in comment #2
none
Patch that actually builds and doesn't cause clumembd to break. none

Description Royce Brown 2004-08-11 22:44:47 UTC
From Bugzilla Helper:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3

Description of problem:
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.

If you have a Ethernet interface with an alias and are using multicast
the source address may contain the main IP address or the alias
address. If it contains the alias address the message is then rejected
by all other nodes as it now contains the wrong IP address.

The software correctly binds a socket to the main interface and at
first the correct IP address is send.
Some time later on sending messages using this socket the alias
address seems to get into the packets.

I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.

Computer has  
   Interfaces: bond0   addr 10.10.197.11
               bond0:0 addr 10.10.197.6

         Multcast set up     
clumembd[2]: <debug> add_interface fd:4 name:bond0
clumembd[2]: <debug> Interface IP is 10.10.197.11
clumembd[2]: <debug> Setting up multicast 225.0.0.11 on 10.10.197.11
clumembd[2]: <debug> Multicast send fd:5 (10.10.197.11)
clumembd[2]: <debug> Multicast receive fd:6

	   Sending and receiving message (Correct behaviour)
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
            ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
         
After a while you get. sinp = source address, nsp = expected address
  
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
             ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on
our subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11


The source address now has bond0:0 address when it did have bond0's
address.
The socket has not changed.


Version-Release number of selected component (if applicable):
clumanager-1.2.16-1

How reproducible:
Always

Steps to Reproduce:
1.Set up cluster two nodes using bonded ethernet ports
2. set up a second address on one node on the bonded interface
3. start clumanager
4. wait. Fails quicker if you fail a service with another ip alias
between the computers.
    

Actual Results:  The node with out the IP alias eventually thinks the
other node has
failed and tries to kill it.

Expected Results:  Nothing the cluster should not fail for no good reason

Additional info:

Comment 1 Royce Brown 2004-08-12 21:44:48 UTC
I think I have found the fix to this problem.
in clumembd.c add_multicast the send socket uses

mreq.imr_multiaddr.s_addr = hb_if->hb_mcast_addr.sin_addr.s_addr;
mreq.imr_interface.s_addr = hb_if->hb_ip_addr.sin_addr.s_addr;
if (setsockopt(hb_if->hb_mcast_send, IPPROTO_IP,IP_ADD_MEMBERSHIP,
           &mreq,sizeof(mreq)) == -1) {

From what I can tell the IP_ADD_MEMBERSHIP binds the socket to a 
interface with  mreq.imr_interface.s_addr only for receiving messages.
If you send on this socket the kernel can choose any interface it 
likes to actually send the message.

Solution is use IP_MULTICAST_IF to bind the socket to the interface
containing the IP address you want to send out on.
The replacement lines in the code is

if (setsockopt(
   hb_if->hb_mcast_send, IPPROTO_IP, IP_MULTICAST_IF, 
   &hb_if->hb_ip_addr.sin_addr,sizeof(hb_if->hb_ip_addr.sin_addr))
   == -1) {

Have send in clumembd.c code as have added a lot more clulog
(LOG_DEBUG ...  lines. Do what you want with it.

Royce 

Comment 2 Royce Brown 2004-08-12 21:46:15 UTC
Created attachment 102676 [details]
clumemdb.c file

Comment 3 Lon Hohberger 2004-08-12 22:05:08 UTC
The attachment didn't work, but it seems like a simple fix.

Comment 4 Lon Hohberger 2004-08-12 22:09:28 UTC
Created attachment 102678 [details]
Fix specified in comment #2

Comment 5 Lon Hohberger 2004-08-12 22:10:31 UTC
Note: Patch untested at this point.

Comment 6 Royce Brown 2004-08-12 22:47:08 UTC
Sorry attached file was a zip file of the complete clumemdb.c file it
was not a patch file. But yes the fix is simple and you don't need 
the extra debug lines.

Comment 7 Lon Hohberger 2004-08-26 15:43:28 UTC
Will include patch in next errata (assuming it doesn't break anything
else =) ).

Comment 8 Lon Hohberger 2004-08-27 16:44:45 UTC
Created attachment 103169 [details]
Patch that actually builds and doesn't cause clumembd to break.

Please test this on your configuration.

Comment 9 Lon Hohberger 2004-09-02 15:57:50 UTC
1.2.18pre1 patch (unsupported; test only, etc.)

http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch

This includes the fix for this bug and a few others.

Comment 10 Royce Brown 2004-09-02 19:57:41 UTC
Have been running patched clumembd on my configure for a few days now
and there has been no problems. Will try 1.2.18pre1 patch
Royce brown

Comment 11 Royce Brown 2004-10-10 21:52:54 UTC
Have now been running 1.2.18pre1 patch for several weeks and have had
no problems.

Comment 12 Lon Hohberger 2004-10-11 13:55:54 UTC
Thanks for the feedback.  When we release a new erratum, we'll be sure
to point to it in this bugzilla.


Comment 13 Derek Anderson 2004-11-09 17:51:46 UTC
Closing based on Verification by the reporter.  This will be in
RHEL3-U4, clumanager-1.2.22-2,

Comment 14 John Flanagan 2004-12-21 03:40:13 UTC
An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-491.html


Comment 15 Lon Hohberger 2007-12-21 15:09:55 UTC
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3