Bug 83454
Summary: | Backup lvs box messages log shows partner dead and starts lvs, then stops it again inside the same second of time. | ||
---|---|---|---|
Product: | [Retired] Red Hat High Availability Server | Reporter: | Peter Baitz <baitzph> |
Component: | piranha | Assignee: | Mike McLean <mikem> |
Status: | CLOSED NOTABUG | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.0 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-03-07 17:14:45 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Baitz
2003-02-04 16:39:55 UTC
This original bugzilla post was done for our "email cluster". However this Sunday Feb 9 before 8AM we saw identical behavior in from our "LDAP cluster" which is running Red Hat 7.3 plus all errata updates and same kernel and piranha rpm's, etc. as the original post. Only difference is one cluster is routing sendmail/webmail and the other ldap/ldaps. This AM we saw the backup piranha/lvs box think its partner was dead, and activate lvs, then either immediately or within a couple minutes deactivate lvs. PROBLEM IS, WHILE LVS DAEMON AND NANNY DAEMONS DO GET DEACTIVATED AND NO LONGER RUNNING, THE VIRTUAL ETHERNETS STAY UP AND CONFUSE THE PARTNER. Stopping the NAT and LVS virtual devices on the backup Piranha/lvs box solves the issue for the momoent. Also, our LDAP cluster boxes were both up over 50 days, so we rebooted them this AM. Rebooting and refreshing the boxes helped our email cluster is stop having this problem so far this week, and I assume for another 50 days. ~~~~~~~~~~~~~~messages on Piranha4 which was running as backup~~~~~~ /var/log/messages:Feb 9 05:22:15 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:23:03 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:23:51 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:23:51 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:25:51 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:26:26 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:26:56 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:27:40 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 05:28:46 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 05:29:20 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 06:34:14 piranha4 pulse[14578]: partner dead: activating lvs /var/log/messages:Feb 9 06:34:14 piranha4 pulse[14578]: partner active: deactivating lvs /var/log/messages:Feb 9 07:58:21 piranha4 pulse[1415]: partner dead: activating lvs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Red Hat please review this issue, which has occured on two Piranha clusters and let me know what the fix is. As a side note, I noted that under Red Hat 7.3 (initscripts-6.67-1) I cannot do a "ifdown eth0:1" or "ifdown eth1:1" to force the LVS and NAT virtual devices down, like I used to under Red Hat 7.2 (initscripts-6.43-1). Can you guys please fix this. I wind up doing a network restart instead for now. Reason I mention this is because of the reported "partner dead / partner alive" issues above, and I wind up having to force the virtual devices to go away from the piranha backup box under this problem. As for the initscripts problem, please file a separate bug against initscripts. Looking into, but please be aware that we did not ship piranha with 7.3. Addendum to our Piranha "LDAP cluster" issue posted "2003-02-09 09:11". We found out our network engineers did some work this weekend which coincided with the Piranha backup box attempting to start up lvs and virtual ethernets. The backup box did subsequently see the partner alive again, and stopped lvs (and nannies), but the problem is the VIRTUAL ETHERNETS (like LVS eth0:1 and NAT eth1:1, etc.) DID NOT GET STOPPED. As such routing was confused between the two lvs boxes. And to point out the Piranha "email cluster" same issue, we believe it might have been an erroneous 100/HALF-duplex setting on our switch which was subsequently corrected. So, we have reasons for the failovers, but no good reason why when lvs was stopped on the backup, did not properly stop the virtual ethernets. Anyway we rebooted all piranha boxes (all of which at the times of these issues were up over 50+ days) to make sure uptimes were minimized, in case that caused the virtual ethernets to not be stopped on the backup. Thus far, all is well. I realize 7.3 does not come with Piranha in the distribution, but the kernels used with 7.3 and 7.2 have ipvs module compiled in, so should work fine, and has for 6 months, with exceptions of what I've reported above. Some folks on the lvs-list indicated they think it is a R.H. kernel issue. I don't know if I agree. What do you say? I tested our newest RH7.3 + Piranha/LVS system (which consists of two Piranha boxes with pulse) for failover and recovery behavior. (This is NOT our email or LDAP cluster, it is totally new.) BOX GOES DOWN BEHAVIOR One of the Piranha boxes goes down, or pulse is killed on it. The Piranha box still up and running takes over. When the downed box comes back up it just runs pulse, allowing the already live box to continue services. NETWORK DROPS BEHAVIOR Public network connection is broken from one of the Piranha boxes. The one which was running pulse only activates its lvs and virtual ethernets. Then restore the network. Now the two boxes negotiate and find out who is the actual backup, and the actual backup relinquishes control by stopping lvs and stopping the virtual ethernets (eth0:1,eth1:1,etc.), and just runs pulse. Comment Related to this Bug Report: Thus I can only think that something about longer uptime 50+ days contributes to the virtual ethernets erroneously staying up even though pulse stops lvs on one box. Ok, I believe you can close this report. I discovered a couple old Virtual Ethernet device scripts ifcfg-eth0:1-test and ifcfg-eth1:1-test scripts (despite configured NOT to RUN and BOOT) were in fact running anyway, at least intermittently. So there might be a boot script logic issue or something since only ifcfg-eth0 and ifcfg0-eth1 were configured to run on bootup. I believe this would have caused our issue and I've deleted those VIP scripts which we do not need since Piranha's Pulse/LVS daemons dynamically handles VIP devices. Closing based on reporter's comment. |