Red Hat Bugzilla – Bug 16276
Heartbeat problems on sparc/rh6.2 with piranha-0.4.16-3
Last modified: 2007-04-18 12:28:13 EDT
After much fiddling I have been unable to get piranha to work on two Sparc
systems running Redhat 6.2 and the latest piranha rpms. The sympton is
that pulse is issuing the following error:
pulse: recvfrom() failed: Invalid argument
As soon as the backup node starts sending heartbeats.
I am using a config file generated with the piranha gui with no
This is a simple two-node failover cluster. The primary is called "frodo"
and is at 192.168.1.3; the secondary is called "bilbo" and is at
192.168.1.4. The failover service is a web server running on
| 192.168.1.3 | | 1922.214.171.124 |
| +-----+ |
+---------| hub |---------+
The config file and the output of "pulse -v -n" from both the primary and
secondary system are attached to this report.
Created attachment 2518 [details]
lvs.cf used on failover cluster
Created attachment 2519 [details]
Output of "pulse -v -n" on the primary server
Created attachment 2520 [details]
output of "pulse -v -n" on the secondary (backup) server
Note that because of the heartbeat failure, the second node will activate the
web server and floating ip address because it believes the primary is dead.
This leads to confusion and mayhem.
[Just to bring this entry up-to-date with the exchange on the discussion HA
Lars Kellogg-Stedman wrote:
> I'm not sure how much piranha has been tested on sparc systems;
Actually, this could be a problem. It did not register in me that
you are using sparcs. I HIGHLY suspect that your problem is
sparc-only. We do not see it at all on intel or alpha. This is
probably why you are the only piranha heartbeat problem report I
have ever seen.
Red Hat doesn't support piranha on sparc, except perhaps as a
real-server in an LVS setup. The HA product is intel-only, and I
have no sparc systems in house to test this on. I suspect that
I'm going to find it very difficult to address this problem
(especially since it is unlikely I can get the equipment or be
given the OK to spend official time to debug a non-supported
platform). I'm also concerned about how many other sparc-releated
bugs might be lurking in the mud once we dig. All data exchange
should be already be occurring in network byte order.
> if for any reason you need access to my test systems that can be arranged
> (caveat: ssh only, unreliable cable-modem connection, limited disk space,
> and slow, slow, slow).
I am willing to help as much as I can. I dare say though that a fair amount
of the effort may be in your hands, as this falls into the "find the
time where I can" category, and there's no guarantee that piranha's
next release won't be sparc-broken again.
No matter how you look at this, it will take time.
When Bryce returns from LWE, I'll talk to him and see what we can do.
Is there any possiblility that you could debug the situation and simply
forward the patches here for review and possible inclusion?
OK. If you perform an "strace -o" to an output file while the pulse program runs
and fails, and send me the resulting capture, I have been told that we have a
sparc guy here that can decipher that in momenets and tell us what's wrong.
Can you provide this?
Created attachment 2533 [details]
Output of "strace -o /tmp/trace pulse -v -n" on primary system
I've attached the strace output. The first recvfrom() yielding an "invalid
argument" error is on line 154.
OK, I'll forward it on.
The invalid argument usually means that a receive was attempted specifying
an ip address or port that is not available for use. It can happen when the
environment outside of piranha is in error. Here are some things
to try while we wait for a response to the strace...
1. Use a different heartbeat port than 1050. Try something below 1024.
2. Make SURE SURE that the config files are the same on both systems. I have
seen this message when the files had mismatched ip address or devices or
ports (can't remember which). For example, make sure the primary and
backup ip addresses are the same in both config files rather than reversed.
3. Make sure the VIP and IP addresses are not already in use. Try turning off
the systems and making sure that something somewhere doesn's answer a
4. Make sure the network is correct. Make sure than name translations for
the ip addresses are correct (use localhost files perhaps instead of dns).
Make sure that the ip addresses you show match the addresses on eth0: for
1. Use a different heartbeat port than 1050. Try something below
Tried this. No change.
2. Make SURE SURE that the config files are the same on both systems.
seen this message when the files had mismatched ip address or
ports (can't remember which). For example, make sure the primary
backup ip addresses are the same in both config files rather than
These two systems are kept in sync with hourly runs of rsync. No hand editing,
at all, is performed on the secondary system. The config files are
byte-for-byte identical, up to and including their MD5 checksums.
3. Make sure the VIP and IP addresses are not already in use. Try
the systems and making sure that something somewhere doesn's
There are only five other systems on this network with static addresses. .1 is
the router, .2 is the wireless station, and all the others are above .20. The
remaining systems are configured with DHCP, which hands out addresses starting
at .50. There's just nothing else out there! I have verified that with the
systems off, none of the ip addresses answer to pings.
4. Make sure the network is correct. Make sure than name translations
the ip addresses are correct (use localhost files perhaps instead
Make sure that the ip addresses you show match the addresses on
Everything is configured correctly. The systems have had static hosts files
since the outset, and since you suggested taking the hostnames out of the config
file and using straight IP addresses this should be a non-issue. I can telnet
and ssh to and from the systems by name or ip from each other or from other
systems on the network. I can retrieve remote URLs on either system.
Just for kicks, here's ifconfig output:
On frodo, the primary, 192.168.1.3:
eth0 Link encap:Ethernet HWaddr 08:00:20:12:B0:DE
inet addr:192.168.1.3 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
On bilbo, the secondary, 192.168.1.4:
eth0 Link encap:Ethernet HWaddr 08:00:20:12:AE:B4
inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
And here's /etc/hosts:
127.0.0.1 localhost.localdomain localhost
192.168.1.1 gateway.house.larsshack.org gateway
192.168.1.3 frodo.house.larsshack.org frodo
192.168.1.4 bilbo.house.larsshack.org bilbo
192.168.1.5 hobbits.house.larsshack.org hobbits
The only other pysical interface on either system is the loopback interface.
There are no interface aliases other than the one configured by piranha.
I've spotted a possible problem in the pulse code, in the run() function.
Here's the call to recvfrom():
rc = recvfrom(sock, &magic, sizeof(magic), 0,
(struct sockaddr *)&sender, &size);
And here's portion of the recvfrom(2) man page:
The argument fromlen is a value-result parameter, initial-
ized to the size of the buffer associated with from, and
modified on return to indicate the actual size of the
address stored there.
The variable "size" is never initialized, and this appears to be causing the
problem. Adding the following line immediately prior to the recvfrom() call
makes things work just fine:
size = sizeof(struct sockaddr);
With this code in place, the error goes away, and pulse works as expected.
This doesn't appear to be an architecture-specific bug. It appears to be the
sort of situtation in which you're just lucky it was working anywhere :).
Created attachment 2584 [details]
Patch for "size" initialization
I totally agree with you, it was pretty unattractive. As you say, how did this
ever work originally?
Thanks for finding it! Your patch, and credit, are in the code now. I'll post
updated RPMs soon.
Note: (unsupported) sparc RPMs have been posted to