| Summary: | pulse leaves load balancer in a half active state after brief network outage | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Stuart Auchterlonie <stuart.auchterlonie> |
| Component: | piranha | Assignee: | Ryan O'Hara <rohara> |
| Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.6 | CC: | cluster-maint, fdinitto, sauchter, uwe.knop |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-03-13 14:34:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Stuart Auchterlonie
2012-01-10 15:07:40 UTC
(In reply to comment #0) > Description of problem: > > I have an active/passive load balancer controlled using > pulse. The load balancers are on different sites connected > by 1G ethernet. There are 17 virtual services defined. > > From the following log: > (passive node) --- > Jan 10 11:19:43 gm-sh-lb-01 pulse[14130]: partner dead: activating lvs > Jan 10 11:19:43 gm-sh-lb-01 pulse[14130]: partner active: deactivating LVS > Jan 10 11:19:48 gm-sh-lb-01 pulse[12787]: gratuitous lvs arps finished > > (active node) --- > Jan 10 11:19:43 gm-th-lb-01 pulse[5910]: partner active: resending arps > Jan 10 11:19:43 gm-th-lb-01 pulse[5910]: partner active: resending arps > Jan 10 11:19:48 gm-th-lb-01 pulse[17407]: gratuitous lvs arps finished > Jan 10 11:19:48 gm-th-lb-01 pulse[17411]: gratuitous lvs arps finished > > > we can see that the passive load balancer detected a failure of > the active load balancer and started to go active (CORRECT!) > > whilst it is setting up the VIP's and going active, it sees the > active node return, and starts the deactivation process. > However it appears that the activation was not completely cancelled > as the gratuitous arps are sent *AFTER* it deactivates. > > Analyzing the resulting state. > - 9 of the VIP's were present on the passive node and were > not cleaned up on deactivation. > - the arp cache's on the network switches confirm that > for those VIP's the traffic was being sent to the passive > node. > > This resulted in a service outage for those virtual services. > > > Version-Release number of selected component (if applicable): > > piranha-0.8.4-16.el5 > > > How reproducible: > > The network outage needs to be long enough to trigger > the passive node to go active, but short enough to > allow the active node to announce itself back to the > passive node, whilst the passive node is going active. How long is the network outage in your case? > Steps to Reproduce: > 1. Create network break between nodes > 2. Repair network break while passive node is going active. > 3. Can you provide a detailed summary of how you are reproducing this problem? > Actual results: > > Traffic is left being sent to a load balancer that has > stopped servicing the virtual services, resulting in an > outage > > Expected results: > > All traffic is sent to the active load balancer and none > to the passive load balancer. > > > Additional info: (In reply to comment #0) > Description of problem: > > I have an active/passive load balancer controlled using > pulse. The load balancers are on different sites connected > by 1G ethernet. There are 17 virtual services defined. Also, can you define what you mean by "different sites? Thanks. (In reply to comment #1) >> How reproducible: >> >> The network outage needs to be long enough to trigger >> the passive node to go active, but short enough to >> allow the active node to announce itself back to the >> passive node, whilst the passive node is going active. > >How long is the network outage in your case? After looking at the logs from the network core, I cannot say for certain but it's sufficiently quick that the "partner dead" and "partner active" messages occur on the same second. You can see this in the logs from original report. > > Steps to Reproduce: > > 1. Create network break between nodes > > 2. Repair network break while passive node is going active. > > 3. > > Can you provide a detailed summary of how you are reproducing this problem? > It's exceedingly difficult to reproduce, as it appears to be very timing dependent. The other problem is that this is a live service and i can't go creating a network outage between the load balancers. I suspect it may be easier to reproduce if you put lots of services on the load balancer. That way the time from starting to activate the VIPs and finally pushing the gratuitous arps will increase significantly. >> Description of problem: >> >> I have an active/passive load balancer controlled using >> pulse. The load balancers are on different sites connected >> by 1G ethernet. There are 17 virtual services defined. > >Also, can you define what you mean by "different sites? Thanks. They are in different data centres about 2 miles apart. Connected by dark fibre running at 1Gbs. Ping response times between the two load balancers are as follows 10 packets transmitted, 10 received, 0% packet loss, time 9003ms rtt min/avg/max/mdev = 0.253/0.281/0.319/0.025 ms For reference pinging from 1 of the load balancers to another server directly connected to the same switch (ie. local) the ping times are as follows 10 packets transmitted, 10 received, 0% packet loss, time 9006ms rtt min/avg/max/mdev = 0.263/0.290/0.355/0.035 ms So despite being in different datacentres the connectivity is no different to being locally connected. Regard Stuart Are you using monitor_links option? This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. |