Bug 16276 - Heartbeat problems on sparc/rh6.2 with piranha-0.4.16-3
Heartbeat problems on sparc/rh6.2 with piranha-0.4.16-3
Status: CLOSED ERRATA
Product: Red Hat High Availability Server
Classification: Retired
Component: piranha (Show other bugs)
1.0
sparc Linux
medium Severity medium
: ---
: ---
Assigned To: Phil Copeland
Phil Copeland
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2000-08-15 14:19 EDT by lars
Modified: 2007-04-18 12:28 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2000-08-16 22:41:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lvs.cf used on failover cluster (433 bytes, text/plain)
2000-08-15 14:20 EDT, lars
no flags Details
Output of "pulse -v -n" on the primary server (3.82 KB, text/plain)
2000-08-15 14:21 EDT, lars
no flags Details
output of "pulse -v -n" on the secondary (backup) server (1.43 KB, text/plain)
2000-08-15 14:22 EDT, lars
no flags Details
Output of "strace -o /tmp/trace pulse -v -n" on primary system (24.36 KB, text/plain)
2000-08-15 22:22 EDT, lars
no flags Details
Patch for "size" initialization (526 bytes, patch)
2000-08-16 22:41 EDT, lars
no flags Details | Diff

  None (edit)
Description Red Hat Bugzilla 2000-08-15 14:19:45 EDT
After much fiddling I have been unable to get piranha to work on two Sparc
systems running Redhat 6.2 and the latest piranha rpms.  The sympton is
that pulse is issuing the following error:

  pulse: recvfrom() failed: Invalid argument

As soon as the backup node starts sending heartbeats.

I am using a config file generated with the piranha gui with no
hand-editing.

This is a simple two-node failover cluster.  The primary is called "frodo"
and is at 192.168.1.3; the secondary is called "bilbo" and is at
192.168.1.4.  The failover service is a web server running on
192.168.1.5:80:

  +-------------+           +--------------+
  | 192.168.1.3 |           | 1922.168.1.4 |
  +-------------+           +--------------+
         |         +-----+         |
         +---------| hub |---------+
                   +-----+

The config file and the output of "pulse -v -n" from both the primary and
secondary system are attached to this report.
Comment 1 Red Hat Bugzilla 2000-08-15 14:20:15 EDT
Created attachment 2518 [details]
lvs.cf used on failover cluster
Comment 2 Red Hat Bugzilla 2000-08-15 14:21:28 EDT
Created attachment 2519 [details]
Output of "pulse -v -n" on the primary server
Comment 3 Red Hat Bugzilla 2000-08-15 14:22:02 EDT
Created attachment 2520 [details]
output of "pulse -v -n" on the secondary (backup) server
Comment 4 Red Hat Bugzilla 2000-08-15 14:22:56 EDT
Note that because of the heartbeat failure, the second node will activate the
web server and floating ip address because it believes the primary is dead. 
This leads to confusion and mayhem.
Comment 5 Red Hat Bugzilla 2000-08-15 15:16:04 EDT
[Just to bring this entry up-to-date with the exchange on the discussion HA
mailing-list ...]


Lars Kellogg-Stedman wrote:
> 
> I'm not sure how much piranha has been tested on sparc systems; 

None.

Actually, this could be a problem. It did not register in me that
you are using sparcs. I HIGHLY suspect that your problem is
sparc-only. We do not see it at all on intel or alpha. This is
probably why you are the only piranha heartbeat problem report I
have ever seen.

Red Hat doesn't support piranha on sparc, except perhaps as a
real-server in an LVS setup. The HA product is intel-only, and I
have no sparc systems in house to test this on. I suspect that
I'm going to find it very difficult to address this problem
(especially since it is unlikely I can get the equipment or be
given the OK to spend official time to debug a non-supported
platform).  I'm also concerned about how many other sparc-releated
bugs might be lurking in the mud once we dig. All data exchange
should be already be occurring in network byte order.

.
.
.


> if for any reason you need access to my test systems that can be arranged
> (caveat: ssh only, unreliable cable-modem connection, limited disk space,
> and slow, slow, slow).

I am willing to help as much as I can. I dare say though that a fair amount
of the effort may be in your hands, as this falls into the "find the
time where I can" category, and there's no guarantee that piranha's
next release won't be sparc-broken again.

----

No matter how you look at this, it will take time.
When Bryce returns from LWE, I'll talk to him and see what we can do.
Is there any possiblility that you could debug the situation and simply
forward the patches here for review and possible inclusion?


Comment 6 Red Hat Bugzilla 2000-08-15 20:45:51 EDT
OK. If you perform an "strace -o" to an output file while the pulse program runs
and fails, and send me the resulting capture, I have been told that we have a
sparc guy here that can decipher that in momenets and tell us what's wrong.

Can you provide this?

Comment 7 Red Hat Bugzilla 2000-08-15 22:22:27 EDT
Created attachment 2533 [details]
Output of "strace -o /tmp/trace pulse -v -n" on primary system
Comment 8 Red Hat Bugzilla 2000-08-15 22:26:21 EDT
I've attached the strace output.  The first recvfrom() yielding an "invalid
argument" error is on line 154.
Comment 9 Red Hat Bugzilla 2000-08-16 17:59:08 EDT
OK, I'll forward it on.

The invalid argument usually means that a receive was attempted specifying
an ip address or port that is not available for use. It can happen when the
environment outside of piranha is in error. Here are some things
to try while we wait for a response to the strace...

1. Use a different heartbeat port than 1050. Try something below 1024.

2. Make SURE SURE that the config files are the same on both systems. I have
   seen this message when the files had mismatched ip address or devices or
   ports (can't remember which). For example, make sure the primary and
   backup ip addresses are the same in both config files rather than reversed.

3. Make sure the VIP and IP addresses are not already in use. Try turning off
   the systems and making sure that something somewhere doesn's answer a
   ping

4. Make sure the network is correct. Make sure than name translations for
   the ip addresses are correct (use localhost files perhaps instead of dns).
   Make sure that the ip addresses you show match the addresses on eth0: for
   those systems.

Comment 10 Red Hat Bugzilla 2000-08-16 19:51:55 EDT
           1. Use a different heartbeat port than 1050. Try something below
1024.

Tried this.  No change.

           2. Make SURE SURE that the config files are the same on both systems.
I have
              seen this message when the files had mismatched ip address or
devices or
              ports (can't remember which). For example, make sure the primary
and
              backup ip addresses are the same in both config files rather than
reversed.

These two systems are kept in sync with hourly runs of rsync.  No hand editing,
at all, is performed on the secondary system.  The config files are
byte-for-byte identical, up to and including their MD5 checksums.

           3. Make sure the VIP and IP addresses are not already in use. Try
turning off
              the systems and making sure that something somewhere doesn's
answer a
              ping

There are only five other systems on this network with static addresses.  .1 is
the router, .2 is the wireless station, and all the others are above .20.  The
remaining systems are configured with DHCP, which hands out addresses starting
at .50.  There's just nothing else out there!   I have verified that with the
systems off, none of the ip addresses answer to pings.

           4. Make sure the network is correct. Make sure than name translations
for
              the ip addresses are correct (use localhost files perhaps instead
of dns).
              Make sure that the ip addresses you show match the addresses on
eth0: for
              those systems.

Everything is configured correctly.  The systems have had static hosts files
since the outset, and since you suggested taking the hostnames out of the config
file and using straight IP addresses this should be a non-issue.  I can telnet
and ssh to and from the systems by name or ip from each other or from other
systems on the network.  I can retrieve remote URLs on either system.

Just for kicks, here's ifconfig output:

On frodo, the primary, 192.168.1.3:

eth0      Link encap:Ethernet  HWaddr 08:00:20:12:B0:DE  
          inet addr:192.168.1.3  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1


On bilbo, the secondary, 192.168.1.4:

eth0      Link encap:Ethernet  HWaddr 08:00:20:12:AE:B4  
          inet addr:192.168.1.4  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

And here's /etc/hosts:

127.0.0.1               localhost.localdomain localhost
192.168.1.1             gateway.house.larsshack.org gateway
192.168.1.3             frodo.house.larsshack.org frodo
192.168.1.4             bilbo.house.larsshack.org bilbo
192.168.1.5             hobbits.house.larsshack.org hobbits

The only other pysical interface on either system is the loopback interface. 
There are no interface aliases other than the one configured by piranha.
Comment 11 Red Hat Bugzilla 2000-08-16 22:31:16 EDT
I've spotted a possible problem in the pulse code, in the run() function. 
Here's the call to recvfrom():

                    rc = recvfrom(sock, &magic, sizeof(magic), 0,
                    (struct sockaddr *)&sender, &size);

And here's portion of the recvfrom(2) man page:

       The argument fromlen is a value-result parameter, initial-
       ized  to  the size of the buffer associated with from, and
       modified on return to indicate  the  actual  size  of  the
       address stored there.

The variable "size" is never initialized, and this appears to be causing the
problem.  Adding the following line immediately prior to the recvfrom() call
makes things work just fine:

  size = sizeof(struct sockaddr);

With this code in place, the error goes away, and pulse works as expected.

This doesn't appear to be an architecture-specific bug.  It appears to be the
sort of situtation in which you're just lucky it was working anywhere :).

-- Lars
Comment 12 Red Hat Bugzilla 2000-08-16 22:41:24 EDT
Created attachment 2584 [details]
Patch for "size" initialization
Comment 13 Red Hat Bugzilla 2000-08-17 14:49:26 EDT
I totally agree with you, it was pretty unattractive. As you say, how did this
ever work originally?

Thanks for finding it! Your patch, and credit, are in the code now. I'll post
updated RPMs soon.



Comment 14 Red Hat Bugzilla 2000-09-13 20:44:52 EDT
Note: (unsupported) sparc RPMs have been posted to
ftp://people.redhat.com/kbarrett/HA/experimental/

Note You need to log in before you can comment on or make changes to this bug.