After migration the Xen netfront sends a gratuitous ARP to cause arp caches to get refreshed and minimise network downtime. Unfortunately carrier detect is delayed in the kernel (up to a minute?) so the ARP is either dropped meaning there is a network blackout until the ARP caches expire or the guest generates an ARP for some other reason. This was fixed in the upstream Xen kernel by http://xenbits.xensource.com/xen-unstable.hg?rev/42b29f084c31 Or in the upstream mainline kernel by http://lkml.org/lkml/2007/5/8/179 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=572a103ded0ad880f75ce83e99f0512fbb80b5b0 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=294cc44b7e48a6e7732499eebcf409b231460d8e
Created attachment 288931 [details] xen-unstable 13763:8132bf3ddbef ported to 2.6.18-53.el5
Created attachment 288941 [details] xen-unstable 14280:42b29f084c31 ported to 2.6.18-53.el5
Created attachment 291769 [details] [NET] link_watch: Always schedule urgent events I have rolled up the commits d9568ba91b1fdd1ea4fdbf9fcc76b867cca6c1d5 and db0ccffed91e234cad99a35f07d5a322f410baa2 into one and backported it to RHEL5.
Assigning and setting flags.
Causing problems with at least one customer configuration where they are performing failback in clustered configuration. Would like this in 5.2. "This is really killing us because it makes zero-downtime failback impossible - we are seeing 30-60s loss of connectivity until the ARP cache expires."
This needs a matching bug for 4.6 as we are seeing it in 4.6 DomUs Nick
Nick: BZ 429930 is the rhel4 clone of this bug, and I've attached the rhel4 equiv. patch for it. We're in the process of doing live migration testing of rhel5.2-ish & rhel4.7-ish kernel with the respective patches. once verified, i'll post the rhel4 patch (for 4.7). if needed for 4.6.z, pls raise flags for that additional effort. - Don
in 2.6.18-74.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
We (our customer) have just tested updated kernels and can't confirm that this issue is fixed. Network blackout is shorter, about 15 sec (comparing to 1-3 minutes before updating the kernel), but we expect it to be much more shorter (1 second?)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html