Bug 129710 - Heatbeat fails with multiple IP addresses on Ethernet device
Summary: Heatbeat fails with multiple IP addresses on Ethernet device
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: clumanager   
(Show other bugs)
Version: 3
Hardware: i386 Linux
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact:
Depends On:
Blocks: 131576
TreeView+ depends on / blocked
Reported: 2004-08-11 22:44 UTC by Royce Brown
Modified: 2009-04-16 20:15 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-11-09 17:51:46 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
clumemdb.c file (14.02 KB, text/plain)
2004-08-12 21:46 UTC, Royce Brown
no flags Details
Fix specified in comment #2 (570 bytes, patch)
2004-08-12 22:09 UTC, Lon Hohberger
no flags Details | Diff
Patch that actually builds and doesn't cause clumembd to break. (1.21 KB, patch)
2004-08-27 16:44 UTC, Lon Hohberger
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:491 high SHIPPED_LIVE Updated clumanager and redhat-config-cluster packages 2004-12-20 05:00:00 UTC

Description Royce Brown 2004-08-11 22:44:47 UTC
From Bugzilla Helper:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3

Description of problem:
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.

If you have a Ethernet interface with an alias and are using multicast
the source address may contain the main IP address or the alias
address. If it contains the alias address the message is then rejected
by all other nodes as it now contains the wrong IP address.

The software correctly binds a socket to the main interface and at
first the correct IP address is send.
Some time later on sending messages using this socket the alias
address seems to get into the packets.

I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.

Computer has  
   Interfaces: bond0   addr
               bond0:0 addr

         Multcast set up     
clumembd[2]: <debug> add_interface fd:4 name:bond0
clumembd[2]: <debug> Interface IP is
clumembd[2]: <debug> Setting up multicast on
clumembd[2]: <debug> Multicast send fd:5 (
clumembd[2]: <debug> Multicast receive fd:6

	   Sending and receiving message (Correct behaviour)
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
After a while you get. sinp = source address, nsp = expected address
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on
our subnet... msg from nodeid:1 sinp: nsp:

The source address now has bond0:0 address when it did have bond0's
The socket has not changed.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.Set up cluster two nodes using bonded ethernet ports
2. set up a second address on one node on the bonded interface
3. start clumanager
4. wait. Fails quicker if you fail a service with another ip alias
between the computers.

Actual Results:  The node with out the IP alias eventually thinks the
other node has
failed and tries to kill it.

Expected Results:  Nothing the cluster should not fail for no good reason

Additional info:

Comment 1 Royce Brown 2004-08-12 21:44:48 UTC
I think I have found the fix to this problem.
in clumembd.c add_multicast the send socket uses

mreq.imr_multiaddr.s_addr = hb_if->hb_mcast_addr.sin_addr.s_addr;
mreq.imr_interface.s_addr = hb_if->hb_ip_addr.sin_addr.s_addr;
if (setsockopt(hb_if->hb_mcast_send, IPPROTO_IP,IP_ADD_MEMBERSHIP,
           &mreq,sizeof(mreq)) == -1) {

From what I can tell the IP_ADD_MEMBERSHIP binds the socket to a 
interface with  mreq.imr_interface.s_addr only for receiving messages.
If you send on this socket the kernel can choose any interface it 
likes to actually send the message.

Solution is use IP_MULTICAST_IF to bind the socket to the interface
containing the IP address you want to send out on.
The replacement lines in the code is

if (setsockopt(
   hb_if->hb_mcast_send, IPPROTO_IP, IP_MULTICAST_IF, 
   == -1) {

Have send in clumembd.c code as have added a lot more clulog
(LOG_DEBUG ...  lines. Do what you want with it.


Comment 2 Royce Brown 2004-08-12 21:46:15 UTC
Created attachment 102676 [details]
clumemdb.c file

Comment 3 Lon Hohberger 2004-08-12 22:05:08 UTC
The attachment didn't work, but it seems like a simple fix.

Comment 4 Lon Hohberger 2004-08-12 22:09:28 UTC
Created attachment 102678 [details]
Fix specified in comment #2

Comment 5 Lon Hohberger 2004-08-12 22:10:31 UTC
Note: Patch untested at this point.

Comment 6 Royce Brown 2004-08-12 22:47:08 UTC
Sorry attached file was a zip file of the complete clumemdb.c file it
was not a patch file. But yes the fix is simple and you don't need 
the extra debug lines.

Comment 7 Lon Hohberger 2004-08-26 15:43:28 UTC
Will include patch in next errata (assuming it doesn't break anything
else =) ).

Comment 8 Lon Hohberger 2004-08-27 16:44:45 UTC
Created attachment 103169 [details]
Patch that actually builds and doesn't cause clumembd to break.

Please test this on your configuration.

Comment 9 Lon Hohberger 2004-09-02 15:57:50 UTC
1.2.18pre1 patch (unsupported; test only, etc.)


This includes the fix for this bug and a few others.

Comment 10 Royce Brown 2004-09-02 19:57:41 UTC
Have been running patched clumembd on my configure for a few days now
and there has been no problems. Will try 1.2.18pre1 patch
Royce brown

Comment 11 Royce Brown 2004-10-10 21:52:54 UTC
Have now been running 1.2.18pre1 patch for several weeks and have had
no problems.

Comment 12 Lon Hohberger 2004-10-11 13:55:54 UTC
Thanks for the feedback.  When we release a new erratum, we'll be sure
to point to it in this bugzilla.

Comment 13 Derek Anderson 2004-11-09 17:51:46 UTC
Closing based on Verification by the reporter.  This will be in
RHEL3-U4, clumanager-1.2.22-2,

Comment 14 John Flanagan 2004-12-21 03:40:13 UTC
An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.


Comment 15 Lon Hohberger 2007-12-21 15:09:55 UTC
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.