Red Hat Bugzilla – Bug 479657
Loosing network after live migration.
Last modified: 2010-10-23 02:54:21 EDT
Description of problem:
After a live migration of a xen guest 100% packet loss for approx ~ 1 minute.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.ping <domU> , ping <from domU to other host>
2. xm migrate -l <domU> <to_host>
100% packet loss during ~ 1 minute from both ends.
0% packet loss.
Dont know if this is a duplicate of Bug 441716 however that report says that it should be fixed in rhel5.3.
The fixes to ensure ARP flush require that both the host & guest be running RHEL-5.3 kernels. Your guest kernel -92.1.17 is thus too old - there is at least one fix post that release that impacts the guest
* Thu Aug 28 2008 Don Zickus <email@example.com> [2.6.18-107.el5]
- [xen] xennet: coordinate ARP with backend network status (Herbert Xu ) 
Please upgrade your guest to latest kernels too & try migration again.
OK so now I've upgraded guest kernel:
[root@s0157 ~]# uname -a
Linux s0157.sss.se.scania.com 2.6.18-120.el5xen #1 SMP Fri Oct 17 18:17:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
Have done a live migration but still get 100% packet loss.
I'm running 2.6.18-128.el5xen on both host and PV guest and am experiencing packet loss during live migrations still. It's not 100%, which is good, but it's still enough to lose connections. 52 packets transmitted, 37 received, 28% packet loss
Mark, any chance you can let me into the boxes to have a look? Thanks!
Herbert, once again, these are "disconnected" systems. Fortunately it's easier at this .gov to get any logs, config files, or any other kinds of reports you need. This is another VLAN environment (without bonds) but I'm bridging to the hosts VLAN interface, so it's always up, running, and tagged.
OK, please do a packet dump on the host & guest for all the relevant interfaces to see whether we're generating a gratuitous ARP and if so, how far it got. You can test save/resume to make it easier. Thanks!
It seems like the guest never looses it's network link (ethtool eth0) and therefor never sends the fakeARP. We are running in a bridged environment.
Could you please show me the tcpdump commands you used and their respective when you resumed the guest? Thanks!
Created attachment 331687 [details]
tcpdump from Dom0 showing that no fake arp is sent.
Created attachment 331688 [details]
tcpdump from the guest showing that no fake arp is sent.
(In reply to comment #10)
> Created an attachment (id=331687) [details]
> tcpdump from Dom0 showing that no fake arp is sent.
I'm working with Christian at Scania with this problem.
Christian, when you said that you're using bridging, I presume the bridging is in dom0, right? In any case the fake ARP logic in domU is wired directly into the netfront driver so it should always send it if a suspend/resume occurs.
Jimmy, can you give me the actual commands you used to get those dumps? Also, what commands did you use to suspend and resume?
I used "tcpdump -ni br603 -w pause-unpause" from Dom0 and "tcpdump -ni eth0 -w guest-pause-unpause" in the DomU.
To pause I did 'xm pause s0220', waited a while and then 'xm unpause s0220'. Dumping on the two Dom0's when doing a migrating with the cluster tools (clusvcm? -M vm:s0220 -m xen4-1) showed no fake ARP's either.
(In reply to comment #13)
> Christian, when you said that you're using bridging, I presume the bridging is
> in dom0, right? In any case the fake ARP logic in domU is wired directly into
> the netfront driver so it should always send it if a suspend/resume occurs.
> Jimmy, can you give me the actual commands you used to get those dumps? Also,
> what commands did you use to suspend and resume?
Yes the bridging and vlan:ing is done in dom0.
Aha, please use save/restore instead of pause/unpause since the latter do not change link state and isn't a good simulation of live migration.
(In reply to comment #16)
> Aha, please use save/restore instead of pause/unpause since the latter do not
> change link state and isn't a good simulation of live migration.
We have now tried with save/restore instead and can see the fakeARP at vif28.0 interface in dom0.
However we can not see it at the bridge vlanbr610 and not at the vlan interface (xenbr0.610) nor at bond0 in domO.
below is an output of brctl
root@xen9-2:/sssjtz# brctl show
bridge name bridge id STP enabled interfaces
vlanbr610 8000.001f2956eb38 no vif28.0
xenbr0 8000.001f2956eb38 no bond0
Johan (working with Christian at Scania)
Johan, could you please get a timestampped version of the kernel logs (easiest is to strace klogd with -tt), plus a timestampped capture of the packet arriving at vif28.0. I'd like to confirm that the carrier event on vif28.0 was processed before the fakeARP is received. Thanks!
Ah! For live migration to function properly we need to set the bridge forward delay (using brctl or /sys/class/net/...) to 0. Otherwise the default is to wait 15 seconds.
I think we've now documented this sufficiently here:
Therefore, I'm going to close this as CURRENTRELEASE.