From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 Description of problem: Using Red Hat Linux 7.2 with all errata updates, and currently at kernel kernel-smp-2.4.18-18.7.x, ipvsadm-1.21-4.i386.rpm, piranha-0.7.0-3.i386.rpm, scsi_reserve-0.7-6.i386.rpm, scsi_reserve-devel-0.7-6.i386.rpm. Running on two Dell 1650 dual pentium with eepro100 dual-port card. We've run for months with no indications of this problem using pulse for heartbeat/high-availaibility. Recently the two Piranha/LVS systems were up for 52 days and 72 days each. We noted the following two occurances in the messages log in the backup box: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jan 31 18:55:54 piranha2 pulse[13171]: partner dead: activating lvs Jan 31 18:55:54 piranha2 pulse[13171]: partner active: deactivating lvs Feb 2 23:04:08 piranha2 pulse[23987]: partner dead: activating lvs Feb 2 23:04:08 piranha2 pulse[23987]: partner active: deactivating lvs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ And wonder why within the SAME SECOND OF TIME pulse thinks the partner is dead, then alive again, and starts and stops lvs. We believe lvs+nannies are not actually getting stopped or fully stopped, and therefore LVS & NAT IP Address conflictes between the primary and secondary cause routing issues. When we kill and restart the pulse daemon on the backup LVS box, then all is well, and the primary starts routing properly again with no further work done to either system. The above two entries are the first I've seen this issue, and we've used Piranha for 6 months with pulse for heartbeat, and 6 months prior w/o pulse. Version-Release number of selected component (if applicable): How reproducible: Didn't try Steps to Reproduce: Don't believe this is a forceably reproducable issue. Seems spurious, but has occured only twice in 3 days after 6 months no such issue seen in messages logs. Expected Results: pulse should never think the partner is dead and try to start and stop lvs daemon in the SAME SECOND OF TIME. Additional info: messages log found on our backup lvs box: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jan 31 18:55:54 piranha2 pulse[13171]: partner dead: activating lvs Jan 31 18:55:54 piranha2 pulse[13171]: partner active: deactivating lvs Feb 2 23:04:08 piranha2 pulse[23987]: partner dead: activating lvs Feb 2 23:04:08 piranha2 pulse[23987]: partner active: deactivating lvs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ lvs.cf file header (lvs.cf identical on both lvs boxes, and the IP addresses have been altered to protect the innocent): serial_no = 168 primary = 188.166.18.25 service = lvs backup_active = 1 backup = 188.166.18.180 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = nat nat_router = 188.166.17.22 eth1:1 nat_nmask = 255.255.255.0 reservation_conflict_action = preempt debug_level = NONE Our DAEMONS look like this running on the primary: /usr/sbin/pulse \_ /usr/sbin/lvs --nofork -c /etc/sysconfig/ha/lvs.cf \_ /usr/sbin/nanny -c -h 188.166.17.56 -p 80 -s GET / .. \_ /usr/sbin/nanny -c -h 188.166.17.57 -p 80 -s GET / .. \_ /usr/sbin/nanny -c -h 188.166.17.37 -p 25 -e /usr/ .. \_ /usr/sbin/nanny -c -h 188.166.17.38 -p 25 -e /usr/ .. \_ /usr/sbin/nanny -c -h 188.166.17.27 -p 443 -a 15 .. \_ /usr/sbin/nanny -c -h 188.166.17.28 -p 443 -a 15 .. Only the /usr/sbin/pulse normally runs on the backup box. The eth0:1 (lvs public device) and eth1:1 (nat router private device) normally runs only on the primary, and only shows up on backup box during a normal failover. The NAT Router table looks like this: [piranha2 root]# ipvsadm -Ln IP Virtual Server version 1.0.4 (size=65536) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 188.166.18.26:25 wlc -> 188.166.17.38:25 Masq 1 1 40 -> 188.166.17.37:25 Masq 1 1 41 TCP 188.166.18.26:80 wlc -> 188.166.17.57:80 Masq 1 0 0 -> 188.166.17.56:80 Masq 1 0 1 TCP 188.166.18.26:443 wlc persistent 900 -> 188.166.17.27:443 Masq 1 1 10 -> 188.126.17.28:443 Masq 1 0 3 The "iptables -L -t nat" output looks like this: root@piranha2 root]# iptables -L -t nat Chain PREROUTING (policy ACCEPT) target prot opt source destination Chain POSTROUTING (policy ACCEPT) target prot opt source destination MASQUERADE all -- 188.166.17.0/24 anywhere Chain OUTPUT (policy ACCEPT) target prot opt source destination The backup box was up for over 50 days and the primary for over 70 days, and we rebooted them since the log message shown above, and in last couple days have no longer seen this kind of log entry. We also found a 100/half-duplex setting on switch to which the primary lvs box is connected, even though mii-tool showed all eth devices to be running 100/full as far as Linux kernel was concerned. We just today corrected the 100/half at the port on the switch. Could this have anything to do with it?
This original bugzilla post was done for our "email cluster". However this Sunday Feb 9 before 8AM we saw identical behavior in from our "LDAP cluster" which is running Red Hat 7.3 plus all errata updates and same kernel and piranha rpm's, etc. as the original post. Only difference is one cluster is routing sendmail/webmail and the other ldap/ldaps. This AM we saw the backup piranha/lvs box think its partner was dead, and activate lvs, then either immediately or within a couple minutes deactivate lvs. PROBLEM IS, WHILE LVS DAEMON AND NANNY DAEMONS DO GET DEACTIVATED AND NO LONGER RUNNING, THE VIRTUAL ETHERNETS STAY UP AND CONFUSE THE PARTNER. Stopping the NAT and LVS virtual devices on the backup Piranha/lvs box solves the issue for the momoent. Also, our LDAP cluster boxes were both up over 50 days, so we rebooted them this AM. Rebooting and refreshing the boxes helped our email cluster is stop having this problem so far this week, and I assume for another 50 days. ~~~~~~~~~~~~~~messages on Piranha4 which was running as backup~~~~~~ /var/log/messages:Feb 9 05:22:15 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:23:03 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:23:51 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:23:51 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:25:51 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:26:26 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:26:56 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:27:40 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:28:46 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:29:20 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 06:34:14 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 06:34:14 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 07:58:21 piranha4 pulse[1415]: partner dead: activating lvs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Red Hat please review this issue, which has occured on two Piranha clusters and let me know what the fix is.
As a side note, I noted that under Red Hat 7.3 (initscripts-6.67-1) I cannot do a "ifdown eth0:1" or "ifdown eth1:1" to force the LVS and NAT virtual devices down, like I used to under Red Hat 7.2 (initscripts-6.43-1). Can you guys please fix this. I wind up doing a network restart instead for now. Reason I mention this is because of the reported "partner dead / partner alive" issues above, and I wind up having to force the virtual devices to go away from the piranha backup box under this problem.
As for the initscripts problem, please file a separate bug against initscripts.
Looking into, but please be aware that we did not ship piranha with 7.3.
Addendum to our Piranha "LDAP cluster" issue posted "2003-02-09 09:11". We found out our network engineers did some work this weekend which coincided with the Piranha backup box attempting to start up lvs and virtual ethernets. The backup box did subsequently see the partner alive again, and stopped lvs (and nannies), but the problem is the VIRTUAL ETHERNETS (like LVS eth0:1 and NAT eth1:1, etc.) DID NOT GET STOPPED. As such routing was confused between the two lvs boxes. And to point out the Piranha "email cluster" same issue, we believe it might have been an erroneous 100/HALF-duplex setting on our switch which was subsequently corrected. So, we have reasons for the failovers, but no good reason why when lvs was stopped on the backup, did not properly stop the virtual ethernets. Anyway we rebooted all piranha boxes (all of which at the times of these issues were up over 50+ days) to make sure uptimes were minimized, in case that caused the virtual ethernets to not be stopped on the backup. Thus far, all is well.
I realize 7.3 does not come with Piranha in the distribution, but the kernels used with 7.3 and 7.2 have ipvs module compiled in, so should work fine, and has for 6 months, with exceptions of what I've reported above. Some folks on the lvs-list indicated they think it is a R.H. kernel issue. I don't know if I agree. What do you say?
I tested our newest RH7.3 + Piranha/LVS system (which consists of two Piranha boxes with pulse) for failover and recovery behavior. (This is NOT our email or LDAP cluster, it is totally new.) BOX GOES DOWN BEHAVIOR One of the Piranha boxes goes down, or pulse is killed on it. The Piranha box still up and running takes over. When the downed box comes back up it just runs pulse, allowing the already live box to continue services. NETWORK DROPS BEHAVIOR Public network connection is broken from one of the Piranha boxes. The one which was running pulse only activates its lvs and virtual ethernets. Then restore the network. Now the two boxes negotiate and find out who is the actual backup, and the actual backup relinquishes control by stopping lvs and stopping the virtual ethernets (eth0:1,eth1:1,etc.), and just runs pulse. Comment Related to this Bug Report: Thus I can only think that something about longer uptime 50+ days contributes to the virtual ethernets erroneously staying up even though pulse stops lvs on one box.
Ok, I believe you can close this report. I discovered a couple old Virtual Ethernet device scripts ifcfg-eth0:1-test and ifcfg-eth1:1-test scripts (despite configured NOT to RUN and BOOT) were in fact running anyway, at least intermittently. So there might be a boot script logic issue or something since only ifcfg-eth0 and ifcfg0-eth1 were configured to run on bootup. I believe this would have caused our issue and I've deleted those VIP scripts which we do not need since Piranha's Pulse/LVS daemons dynamically handles VIP devices.
Closing based on reporter's comment.