Bug 441716
Summary: | Fake ARP dropped after migration leading to loss of network connectivity | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bill Braswell <bbraswel> |
Component: | kernel-xen | Assignee: | Herbert Xu <herbert.xu> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | low | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.2 | CC: | bdevouge, casmith, ddutile, hklein, ijc, jplans, mmatsuya, nstrug, pbonzini, rene.schaffrath+rhbz, rlerch, sdodson, sputhenp, sysadmin, tao, xen-maint |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | GSSApproved | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 19:37:04 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 251527 | ||
Bug Blocks: | 409971, 447684, 448753 | ||
Attachments: |
Comment 2
Chris Lalancette
2008-04-10 13:13:31 UTC
The problem with the original patch is that it put in an Xen-specific solution to a generic networking problem. That is why it wasn't taken since we already had the generic upstream patch. Now unfortunately it appears that the Xen netfront driver in RHEL5 does not send the gratuitous ARP packet at the correct time which is a separate issue that was not identified in time. So what we will do is fix the second issue as part of RHEL 5.3 and zstream. Created attachment 302697 [details]
[XEN] netfront: Send fake arp when link gets carrier
As it is the Xen netfront driver will transmit a fake ARP when the link gets an
IP address and when the link is brought up administratively. However, this
overlooks the case when the first two events occur without a link carrier.
Thus to be sure that the packet makes it out we also need to attempt a transmit
when the carrier comes up which can be detected through the NETDEV_CHANGE
event.
This is what this patch does.
Created attachment 303426 [details]
Send fake ARP when link gets carrier
The patch posted didn't really work because of a few silly cut-n-paste errors.
Here's a patch that actually does work, for me anyway.
in kernel-2.6.18-93.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Created attachment 310783 [details]
Delay carrier on until dom0 has carrier.
I've looked at it further and in fact both upstream and RHEL5 are still subject
to a race between domU and dom0. The problem is that when the ARP packet in
domU is sent dom0's backend interface may not be ready yet. In particular, the
backend's carrier flag may not have come up yet which means that the bridge
won't let it transmit any packets since it is yet to enter the forwarding
state.
Here is a patch that attempts to delay domU from sending any packets until dom0
is ready.
Note that even this isn't perfect since dom0's bridge might have further
unpredictable delays in it. But in the long term all this should get folded in
to dom0 anyway since only it is in a position to do this reliably. As it is
this patch implicitly adds a delay in domU too (through the link watch layer)
which should counter-act the delay in dom0 (Ick!).
Created attachment 310792 [details]
Only delay grat ARP
Unfortunately delaying the carrier on event doesn't quite work since the rest
of the driver requires it to function. This patch simply delays the fake ARP
instead which works for me. But please test this on your machines since I
never saw the race on my machines anyway.
Created attachment 312200 [details]
[BRIDGE]: eliminate workqueue for carrier check
This upstream patch eliminates the delay. So if your testing confirms my guess
then all we have to do is merge this patch. My testing on Hari's machines also
shows the previous patch isn't necessary (I wasn't able to reproduce the race
locally).
Author: Stephen Hemminger <shemminger>
Date: Thu Feb 22 01:10:18 2007 -0800
[BRIDGE]: eliminate workqueue for carrier check
Having a work queue for checking carrier leads to lots of race issues.
Simpler to just get the cost when data structure is created and
update on change.
Signed-off-by: Stephen Hemminger <shemminger>
Signed-off-by: David S. Miller <davem>
Created attachment 313096 [details]
[BRIDGE]: eliminate workqueue for carrier check
Just as before this patch is only needed in dom0. The two pervious domU
patches are still required.
This is a back-port of two upstream patches:
commit 269def7c505b4d229f9ad49bf88543d1e605533e
Author: Stephen Hemminger <shemminger>
Date: Thu Feb 22 01:10:18 2007 -0800
[BRIDGE]: eliminate workqueue for carrier check
Having a work queue for checking carrier leads to lots of race issues.
Simpler to just get the cost when data structure is created and
update on change.
Signed-off-by: Stephen Hemminger <shemminger>
Signed-off-by: David S. Miller <davem>
and
commit de79059ecd7cd650f3788ece978a64586921d1f1
Author: Aji Srinivas <emc.com>
Date: Wed Mar 7 16:10:53 2007 -0800
[BRIDGE]: adding new device to bridge should enable if up
One change introduced by the workqueue removal patch is that adding an
interface that is up to a bridge which is also up does not ever call
br_stp_enable_port(), leaving the port in DISABLED state until we do
ifconfig down and up or link events occur.
The following patch to the br_add_if function fixes it.
This is a regression introduced in 2.6.21.
Submitted-by: Aji_Srinivas
Signed-off-by: Stephen Hemminger <shemminger>
Signed-off-by: David S. Miller <davem>
Created attachment 313197 [details]
[BRIDGE]: eliminate workqueue for carrier check
Cool, I just noticed that this patch actually fixes an unrelated (and critical)
bug too in RHEL5. So we can push this regardless of the Xen issue.
I think this has been an elusive problem and some earlier fixes ended up not solving the issue completely or in all cases. But Herbert can answer better. Putting this into needinfo for Herbert. This bug has been marked for inclusion in the Red Hat Enterprise Linux 5.3 Release Notes. To aid in the development of relevant and accurate release notes, please fill out the "Release Notes" field above with the following 4 pieces of information: Cause: What actions or circumstances cause this bug to present. Consequence: What happens when the bug presents. Fix: What was done to fix the bug. Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html *** Bug 622574 has been marked as a duplicate of this bug. *** |