Bug 83454

Summary:	Backup lvs box messages log shows partner dead and starts lvs, then stops it again inside the same second of time.
Product:	[Retired] Red Hat High Availability Server	Reporter:	Peter Baitz <baitzph>
Component:	piranha	Assignee:	Mike McLean <mikem>
Status:	CLOSED NOTABUG	QA Contact:	Brock Organ <borgan>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.0
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-03-07 17:14:45 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Baitz 2003-02-04 16:39:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Description of problem:
Using Red Hat Linux 7.2 with all errata updates, and currently at kernel
kernel-smp-2.4.18-18.7.x, ipvsadm-1.21-4.i386.rpm, piranha-0.7.0-3.i386.rpm,
scsi_reserve-0.7-6.i386.rpm, scsi_reserve-devel-0.7-6.i386.rpm.  Running on two
Dell 1650 dual pentium with eepro100 dual-port card.  

We've run for months with no indications of this problem using pulse for
heartbeat/high-availaibility.  Recently the two Piranha/LVS systems were up for
52 days and 72 days each. We noted the following two occurances in the messages
log in the backup box:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 31 18:55:54 piranha2 pulse[13171]: partner dead: activating lvs   
Jan 31 18:55:54 piranha2 pulse[13171]: partner active: deactivating   
lvs   
Feb  2 23:04:08 piranha2 pulse[23987]: partner dead: activating lvs
Feb  2 23:04:08 piranha2 pulse[23987]: partner active: deactivating
lvs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

And wonder why within the SAME SECOND OF TIME pulse thinks the partner is dead,
then alive again, and starts and stops lvs.  

We believe lvs+nannies are not actually getting stopped or fully stopped, and
therefore LVS & NAT IP Address conflictes between the primary and secondary
cause routing issues.  

When we kill and restart the pulse daemon on the backup LVS box,
then all is well, and the primary starts routing properly again with no further
work done to either system.

The above two entries are the first I've seen this issue, and we've used Piranha
for 6 months with pulse for heartbeat, and 6 months prior w/o pulse.



Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
Don't believe this is a forceably reproducable issue. Seems spurious, but has
occured only twice in 3 days after 6 months no such issue seen in messages logs.

Expected Results:  pulse should never think the partner is dead and try to start
and stop lvs daemon in the SAME SECOND OF TIME.

Additional info:

messages log found on our backup lvs box:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 31 18:55:54 piranha2 pulse[13171]: partner dead: activating lvs   
Jan 31 18:55:54 piranha2 pulse[13171]: partner active: deactivating   
lvs   
Feb  2 23:04:08 piranha2 pulse[23987]: partner dead: activating lvs
Feb  2 23:04:08 piranha2 pulse[23987]: partner active: deactivating
lvs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


lvs.cf file header (lvs.cf identical on both lvs boxes, and the IP addresses
have been altered to protect the innocent):

serial_no = 168
primary = 188.166.18.25
service = lvs
backup_active = 1
backup = 188.166.18.180
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 18
network = nat
nat_router = 188.166.17.22 eth1:1
nat_nmask = 255.255.255.0
reservation_conflict_action = preempt
debug_level = NONE


Our DAEMONS look like this running on the primary:

 /usr/sbin/pulse
  \_ /usr/sbin/lvs --nofork -c /etc/sysconfig/ha/lvs.cf
     \_ /usr/sbin/nanny -c -h 188.166.17.56 -p 80 -s GET / ..
     \_ /usr/sbin/nanny -c -h 188.166.17.57 -p 80 -s GET / ..
     \_ /usr/sbin/nanny -c -h 188.166.17.37 -p 25 -e /usr/ ..
     \_ /usr/sbin/nanny -c -h 188.166.17.38 -p 25 -e /usr/ ..
     \_ /usr/sbin/nanny -c -h 188.166.17.27 -p 443 -a 15   ..
     \_ /usr/sbin/nanny -c -h 188.166.17.28 -p 443 -a 15   ..

Only the  /usr/sbin/pulse   normally runs on the backup box.

The eth0:1 (lvs public device) and eth1:1 (nat router private device)
normally runs only on the primary, and only shows up on backup box during a
normal failover.

The NAT Router table looks like this:

[piranha2 root]# ipvsadm -Ln
IP Virtual Server version 1.0.4 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  188.166.18.26:25 wlc
  -> 188.166.17.38:25             Masq    1      1          40
  -> 188.166.17.37:25             Masq    1      1          41
TCP  188.166.18.26:80 wlc
  -> 188.166.17.57:80             Masq    1      0          0
  -> 188.166.17.56:80             Masq    1      0          1
TCP  188.166.18.26:443 wlc persistent 900
  -> 188.166.17.27:443            Masq    1      1          10
  -> 188.126.17.28:443            Masq    1      0          3


The "iptables -L -t nat"  output looks like this:

root@piranha2 root]# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  188.166.17.0/24      anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


The backup box was up for over 50 days and the primary for over 70 days, and we
rebooted them since the log message shown above, and in last couple days have no
longer seen this kind of log entry.  

We also found a 100/half-duplex setting on switch to which the primary lvs box
is connected, even though mii-tool showed all eth devices to be running 100/full
as far as Linux kernel was concerned.   We just today corrected the 100/half at
the port on the switch.  Could this have anything to do with it?

Comment 1 Peter Baitz 2003-02-09 14:11:43 UTC

This original bugzilla post was done for our "email cluster". However this 
Sunday Feb 9 before 8AM we saw identical behavior in from our "LDAP cluster" 
which is running Red Hat 7.3 plus all errata updates and same kernel and 
piranha rpm's, etc. as the original post.  Only difference is one cluster is 
routing sendmail/webmail and the other ldap/ldaps.  This AM we saw the backup 
piranha/lvs box think its partner was dead, and activate lvs, then either 
immediately or within a couple minutes deactivate lvs.  PROBLEM IS, WHILE LVS 
DAEMON AND NANNY DAEMONS DO GET DEACTIVATED AND NO LONGER RUNNING, THE VIRTUAL 
ETHERNETS STAY UP AND CONFUSE THE PARTNER.  Stopping the NAT and LVS virtual 
devices on the backup Piranha/lvs box solves the issue for the momoent. 
Also, our LDAP cluster boxes were both up over 50 days, so we rebooted them 
this AM.   Rebooting and refreshing the boxes helped our email cluster is stop 
having this problem so far this week, and I assume for another 50 days. 
 
~~~~~~~~~~~~~~messages on Piranha4 which was running as backup~~~~~~    
/var/log/messages:Feb  9 05:22:15 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 05:23:03 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 05:23:51 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 05:23:51 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 05:25:51 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 05:26:26 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 05:26:56 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 05:27:40 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 05:28:46 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 05:29:20 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 06:34:14 piranha4 pulse[14578]: partner dead:    
activating lvs    
/var/log/messages:Feb  9 06:34:14 piranha4 pulse[14578]: partner    
active: deactivating lvs    
/var/log/messages:Feb  9 07:58:21 piranha4 pulse[1415]: partner dead:    
activating lvs    
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~    
 
Red Hat please review this issue, which has occured on two Piranha clusters 
and let me know what the fix is.

Comment 2 Peter Baitz 2003-02-10 12:26:34 UTC

As a side note, I noted that under Red Hat 7.3 (initscripts-6.67-1) I cannot 
do a "ifdown eth0:1" or "ifdown eth1:1" to force the LVS and NAT virtual 
devices down, like I used to under Red Hat 7.2 (initscripts-6.43-1).  Can you 
guys please fix this.  I wind up doing a network restart instead for now.  
Reason I mention this is because of the reported "partner dead / partner 
alive" issues above, and I wind up having to force the virtual devices to go 
away from the piranha backup box under this problem.

Comment 3 Mike McLean 2003-02-10 16:49:18 UTC

As for the initscripts problem, please file a separate bug against initscripts.

Comment 4 Mike McLean 2003-02-10 16:55:26 UTC

Looking into, but please be aware that we did not ship piranha with 7.3.

Comment 5 Peter Baitz 2003-02-11 16:29:40 UTC

Addendum to our Piranha "LDAP cluster" issue posted "2003-02-09 09:11".
We found out our network engineers did some work this weekend which
coincided with the Piranha backup box attempting to start up lvs
and virtual ethernets.  The backup box did subsequently see the partner
alive again, and stopped lvs (and nannies), but the problem is the
VIRTUAL ETHERNETS (like LVS eth0:1 and NAT eth1:1, etc.) DID NOT
GET STOPPED.  As such routing was confused between the two lvs boxes.

And to point out the Piranha "email cluster" same issue, we believe
it might have been an erroneous 100/HALF-duplex setting on our 
switch which was subsequently corrected.  

So, we have reasons for the failovers, but no good reason why when lvs
was stopped on the backup, did not properly stop the virtual ethernets.  Anyway
we rebooted all piranha boxes (all of which at the times of these issues were up
over 50+ days) to make sure uptimes were minimized, in case that caused the
virtual ethernets to not be stopped on the backup. Thus far, all is well.

Comment 6 Peter Baitz 2003-02-11 16:38:50 UTC

I realize 7.3 does not come with Piranha in the distribution, but the kernels
used with 7.3 and 7.2 have ipvs module compiled in, so should work fine, and has
for 6 months, with exceptions of what I've reported above.   Some folks on the
lvs-list  indicated they think it is a R.H. kernel issue.
 I don't know if I agree. What do you say?

Comment 7 Peter Baitz 2003-02-12 12:13:44 UTC

I tested our newest RH7.3 + Piranha/LVS system (which consists of two Piranha 
boxes with pulse) for failover and recovery behavior. (This is NOT our email 
or LDAP cluster, it is totally new.)   
 
BOX GOES DOWN BEHAVIOR 
One of the Piranha boxes goes down, or pulse is killed on it.  The Piranha box 
still up and running takes over. When the downed box comes back up it just 
runs pulse, allowing the already live box to continue services. 
 
NETWORK DROPS BEHAVIOR  
Public network connection is broken from one of the Piranha boxes. The one 
which was running pulse only activates its lvs and virtual ethernets.  Then 
restore the network.  Now the two boxes negotiate and find out who is the 
actual backup, and the actual backup relinquishes control by stopping lvs and 
stopping the virtual ethernets (eth0:1,eth1:1,etc.), and just runs pulse. 
 
Comment Related to this Bug Report:  
Thus I can only think that something about longer uptime 50+ days contributes 
to the virtual ethernets erroneously staying up even though pulse stops lvs on 
one box.

Comment 8 Peter Baitz 2003-03-07 13:24:12 UTC

Ok, I believe you can close this report.  I discovered a couple old Virtual 
Ethernet device scripts ifcfg-eth0:1-test and ifcfg-eth1:1-test scripts 
(despite configured NOT to RUN and BOOT) were in fact running anyway, at least 
intermittently. So there might be a boot script logic issue or something since 
only ifcfg-eth0 and ifcfg0-eth1 were configured to run on bootup.   
 
I believe this would have caused our issue and I've deleted those VIP scripts 
which we do not need since Piranha's Pulse/LVS daemons dynamically handles VIP 
devices.

Comment 9 Mike McLean 2003-03-07 17:14:45 UTC

Closing based on reporter's comment.