Bug 441716 - Fake ARP dropped after migration leading to loss of network connectivity
Fake ARP dropped after migration leading to loss of network connectivity
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.2
All Linux
urgent Severity low
: rc
: ---
Assigned To: Herbert Xu
Martin Jenner
GSSApproved
: ZStream
: 622574 (view as bug list)
Depends On: 251527
Blocks: 409971 447684 448753
  Show dependency treegraph
 
Reported: 2008-04-09 12:52 EDT by Bill Braswell
Modified: 2010-10-22 19:54 EDT (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:37:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
[XEN] netfront: Send fake arp when link gets carrier (1.58 KB, patch)
2008-04-17 00:11 EDT, Herbert Xu
no flags Details | Diff
Send fake ARP when link gets carrier (1.52 KB, patch)
2008-04-23 00:59 EDT, Herbert Xu
no flags Details | Diff
Delay carrier on until dom0 has carrier. (759 bytes, patch)
2008-07-02 08:18 EDT, Herbert Xu
no flags Details | Diff
Only delay grat ARP (679 bytes, patch)
2008-07-02 09:52 EDT, Herbert Xu
no flags Details | Diff
[BRIDGE]: eliminate workqueue for carrier check (4.92 KB, patch)
2008-07-19 08:13 EDT, Herbert Xu
no flags Details | Diff
[BRIDGE]: eliminate workqueue for carrier check (4.38 KB, patch)
2008-07-31 09:20 EDT, Herbert Xu
no flags Details | Diff
[BRIDGE]: eliminate workqueue for carrier check (4.47 KB, patch)
2008-08-01 09:30 EDT, Herbert Xu
no flags Details | Diff

  None (edit)
Comment 2 Chris Lalancette 2008-04-10 09:13:31 EDT
Matsuya-san,
     We already put in a different patch to 5.2 that should significantly reduce
the ARP delay.  Please have them test with the 5.2 latest snapshots and see if
they see better results (you should see approximately 1-2 seconds downtime after
migration).  Also, I'm not sure what the new feature you are talking about is;
live migration has been supported since 5.0, so this is just a bug that hasn't
been fixed up until this point.

Thanks,
Chris Lalancette
Comment 3 Herbert Xu 2008-04-10 13:44:17 EDT
The problem with the original patch is that it put in an Xen-specific solution
to a generic networking problem.  That is why it wasn't taken since we already
had the generic upstream patch.

Now unfortunately it appears that the Xen netfront driver in RHEL5 does not send
the gratuitous ARP packet at the correct time which is a separate issue that was
not identified in time.

So what we will do is fix the second issue as part of RHEL 5.3 and zstream.
Comment 5 Herbert Xu 2008-04-17 00:11:33 EDT
Created attachment 302697 [details]
[XEN] netfront: Send fake arp when link gets carrier

As it is the Xen netfront driver will transmit a fake ARP when the link gets an
IP address and when the link is brought up administratively.  However, this
overlooks the case when the first two events occur without a link carrier. 
Thus to be sure that the packet makes it out we also need to attempt a transmit
when the carrier comes up which can be detected through the NETDEV_CHANGE
event.

This is what this patch does.
Comment 6 Herbert Xu 2008-04-23 00:59:56 EDT
Created attachment 303426 [details]
Send fake ARP when link gets carrier

The patch posted didn't really work because of a few silly cut-n-paste errors. 
Here's a patch that actually does work, for me anyway.
Comment 16 Don Zickus 2008-05-20 15:20:48 EDT
in kernel-2.6.18-93.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 21 Herbert Xu 2008-07-02 08:18:33 EDT
Created attachment 310783 [details]
Delay carrier on until dom0 has carrier.

I've looked at it further and in fact both upstream and RHEL5 are still subject
to  a race between domU and dom0.  The problem is that when the ARP packet in
domU is sent dom0's backend interface may not be ready yet.  In particular, the
backend's carrier flag may not have come up yet which means that the bridge
won't let it transmit any packets since it is yet to enter the forwarding
state.

Here is a patch that attempts to delay domU from sending any packets until dom0
is ready.

Note that even this isn't perfect since dom0's bridge might have further
unpredictable delays in it.  But in the long term all this should get folded in
to dom0 anyway since only it is in a position to do this reliably.  As it is
this patch implicitly adds a delay in domU too (through the link watch layer)
which should counter-act the delay in dom0 (Ick!).
Comment 22 Herbert Xu 2008-07-02 09:52:54 EDT
Created attachment 310792 [details]
Only delay grat ARP

Unfortunately delaying the carrier on event doesn't quite work since the rest
of the driver requires it to function.	This patch simply delays the fake ARP
instead which works for me.  But please test this on your machines since I
never saw the race on my machines anyway.
Comment 31 Herbert Xu 2008-07-19 08:13:25 EDT
Created attachment 312200 [details]
[BRIDGE]: eliminate workqueue for carrier check

This upstream patch eliminates the delay.  So if your testing confirms my guess
then all we have to do is merge this patch.  My testing on Hari's machines also
shows the previous patch isn't necessary (I wasn't able to reproduce the race
locally).

Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:	Thu Feb 22 01:10:18 2007 -0800

    [BRIDGE]: eliminate workqueue for carrier check

    Having a work queue for checking carrier leads to lots of race issues.
    Simpler to just get the cost when data structure is created and
    update on change.

    Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Comment 45 Herbert Xu 2008-07-31 09:20:31 EDT
Created attachment 313096 [details]
[BRIDGE]: eliminate workqueue for carrier check

Just as before this patch is only needed in dom0.  The two pervious domU
patches are still required.

This is a back-port of two upstream patches:

commit 269def7c505b4d229f9ad49bf88543d1e605533e
Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:	Thu Feb 22 01:10:18 2007 -0800

    [BRIDGE]: eliminate workqueue for carrier check

    Having a work queue for checking carrier leads to lots of race issues.
    Simpler to just get the cost when data structure is created and
    update on change.

    Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

and

commit de79059ecd7cd650f3788ece978a64586921d1f1
Author: Aji Srinivas <emc.com>
Date:	Wed Mar 7 16:10:53 2007 -0800

    [BRIDGE]: adding new device to bridge should enable if up

    One change introduced by the workqueue removal patch is that adding an
    interface that is up to a bridge which is also up does not ever call
    br_stp_enable_port(), leaving the port in DISABLED state until we do
    ifconfig down and up or link events occur.

    The following patch to the br_add_if function fixes it.
    This is a regression introduced in 2.6.21.

    Submitted-by: Aji_Srinivas@emc.com
    Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Comment 50 Herbert Xu 2008-08-01 09:30:16 EDT
Created attachment 313197 [details]
[BRIDGE]: eliminate workqueue for carrier check

Cool, I just noticed that this patch actually fixes an unrelated (and critical)
bug too in RHEL5.  So we can push this regardless of the Xen issue.
Comment 57 Bill Burns 2008-08-08 08:24:37 EDT
I think this has been an elusive problem and some earlier fixes ended up not solving the issue completely or in all cases. But Herbert can answer better.
Putting this into needinfo for Herbert.
Comment 68 Ryan Lerch 2008-11-06 19:14:57 EST
This bug has been marked for inclusion in the Red Hat Enterprise Linux 5.3
Release Notes.

To aid in the development of relevant and accurate release notes, please fill
out the "Release Notes" field above with the following 4 pieces of information:


Cause:   What actions or circumstances cause this bug to present.

Consequence:  What happens when the bug presents.

Fix:   What was done to fix the bug.

Result:  What now happens when the actions or circumstances above occur. (NB:
this is not the same as 'the bug doesn't present anymore')
Comment 74 errata-xmlrpc 2009-01-20 14:37:04 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html
Comment 75 Paolo Bonzini 2010-08-10 23:38:57 EDT
*** Bug 622574 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.