Xen live migration broken for HVM (Windows) instances with PV drivers. The script that runs on the target side that sets up networking assumes the guest has been using emulated network device even if the instance actually has PV network device setup. This has already been ran past Paolo Bonzini, who said to have a BZ opened on this. 08:18:42 >> bonzini<< godfather: seems wrong indeed... 08:18:48 >> bonzini<< godfather: open bz for xen Comments from the customer: The migrated instance never comes up. The problem is that the arguments that xend launches 'qemu-dm' are incorrect. Our network config is PV w/routing. There are NO bridges in out configuration. Yet 'xend' insists on specifying network params to create the tap/bridge config. The problem seems to be that the configuration transmitted from the source host to the dest host does not contain information that the vif type is 'front'. The 'xend' code for then says, "Oh, I must be ioemu!" and sets up incorrect args. The pertinent code appears to be: -- /usr/lib64/python2.4/site-packages/xen/xend/image.py -- 338 if name == 'vif': 339 type = sxp.child_value(info, 'type') 340 if type is None: 341 type = "ioemu" 342 if type != 'ioemu': 343 continue 344 nics += 1 345 mac = sxp.child_value(info, 'mac') 346 if mac == None: 347 mac = randomMAC() 348 bridge = sxp.child_value(info, 'bridge', 'xenbr0') 349 model = sxp.child_value(info, 'model', 'rtl8139') 350 ret.append("-net") 351 ret.append("nic,vlan=%d,macaddr=%s,model=%s" % 352 (nics, mac, model)) 353 ret.append("-net") 354 ret.append("tap,vlan=%d,bridge=%s" % (nics, bridge)) 355 return ret -- type is *definitely* 'None'. I believe that the code on the source host does not correctly specify the 'front' type when it is sending the configuration across to the target host. Reproducer available at The following lab is set up Caveat 1: Please try to not reboot 10.65.208.84. It has a few other tests going on. Caveat 2: The winxp is verified working from both xen Dom0. However its an FV. 1. NFS server at 10.65.210.68 ssh: root/redhat exports /var/lib/xen/images which has a Windows XP vm called winxp 2. Xen Dom0 at 10.65.208.84 ssh: root/redhat vnc: 10.65.208.84:2/redhat mounts /var/lib/xen/images from nfs server 3. Xen Dom0 at 10.65.210.208 ssh: root/redhat vnc: 10.65.210.208:1/redhat mounts /var/lib/xen/images from nfs server
I'm passing this to the Xen component, since it looks like qemu-dm is incorrectly invoked.
Bill, you're writing about lab machines at: 1. NFS server at 10.65.210.68 ssh: root/redhat exports /var/lib/xen/images which has a Windows XP vm called winxp 2. Xen Dom0 at 10.65.208.84 ssh: root/redhat vnc: 10.65.208.84:2/redhat mounts /var/lib/xen/images from nfs server 3. Xen Dom0 at 10.65.210.208 ssh: root/redhat vnc: 10.65.210.208:1/redhat mounts /var/lib/xen/images from nfs server Unfortunately although I can log into machine 2 (.84) I can't access the image for winxp. I can't even ping the machines 1 (.68) and 3 (.208). Can you give me access to environment where can I reproduce it? Thanks, Michal
Well, I did testing on this one and it doesn't really seem to have some issues. Here are the qemu-dm arguments. Guest started: /usr/lib64/xen/bin/qemu-dm -d 10 -m 1024 -boot c -serial pty -vcpus 1 -acpi -k en-us -domain-name WinXP-32fv -net nic,vlan=1,macaddr=00:16:36:61:5d:bf,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 127.0.0.1:10 -vncunused After migration to host B (non-live): /usr/lib64/xen/bin/qemu-dm -d 18 -m 1024 -boot c -serial pty -vcpus 1 -acpi -k en-us -domain-name WinXP-32fv -net nic,vlan=1,macaddr=00:16:36:61:5d:bf,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:18 -vncunused -loadvm /var/lib/xen/qemu-save-18.img After migration to host A again (live): /usr/lib64/xen/bin/qemu-dm -d 19 -m 1024 -boot c -serial pty -vcpus 1 -acpi -k en-us -domain-name WinXP-32fv -net nic,vlan=1,macaddr=00:16:36:61:5d:bf,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:19 -vncunused -loadvm /var/lib/xen/qemu-save-19.img I did try pinging the guest all the time and it was working fine. From the guest I tried to ping some machine and it was working fine as well so I can't really reproduce it when the guest is using PV drivers. Bill, could the customer please retest using the packages from: http://people.redhat.com/mrezanin/xen/ Thanks, Michal
You probably cannot see the bug because you're using type=ioemu. With type=netfront (anything but ioemu is the same) and bridges you should already be able to see the wrong qemu-dm command line. I suggest you do try with type=netfront and with the bridge, since that's much easier to setup, but if the bug does not reproduce that way please try without the bridge as well.
Well, I did try setting up type=netfront and those are the results. Guest started with: /usr/lib64/xen/bin/qemu-dm -d 12 -m 1024 -boot c -serial pty -vcpus 1 -acpi -k en-us -domain-name WinXP-32fv -vnc 127.0.0.1:12 -vncunused After migration: /usr/lib64/xen/bin/qemu-dm -d 20 -m 1024 -boot c -serial pty -vcpus 1 -acpi -k en-us -domain-name WinXP-32fv -net nic,vlan=1,macaddr=00:16:36:61:5d:bf,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:20 -vncunused -loadvm /var/lib/xen/qemu-save-20.img So this is not right. I'll work on this one. The problem seems to be in the save handling since only: (device (vif (backend 0) (script vif-bridge) (bridge xenbr0) (mac 00:16:36:61:5d:bf) (vifname vif4.0))) is being saved to the save file and there's no evidence of type so I'm going to work on this one. Michal
Created attachment 436325 [details] Patch to pass vif type to save handling Well, the issue is that when it's not set to something it falls to None on save and on restore the None value is being treated as ioemu which is bad. Therefore migrations and restores from the older versions of Xen works fine when this patch is applied on images saved using this patch applied or when migrating from newer versions (with this patch applied) to the older version of Xen. The attached patch fixes this. It's a backport of upstream c/s 15972. Michal
Test details: vif type=netfront WinXP guest with PV drivers Could reproduce this bug with xen-3.0.3-114.el5. And verify this bug with xen-3.0.3-115.el5. Details as follows: Guest started in host A with: /usr/lib64/xen/bin/qemu-dm -d 11 -m 512 -boot c -vcpus 4 -acpi -domain-name vm1 -vnc 0.0.0.0:11 -vncunused After migration(live) to host B: /usr/lib64/xen/bin/qemu-dm -d 4 -m 512 -boot c -vcpus 4 -acpi -domain-name vm1 -vnc 0.0.0.0:4 -vncunused -loadvm /var/lib/xen/qemu-save-4.img After migration back(non-live) to host A: /usr/lib64/xen/bin/qemu-dm -d 12 -m 512 -boot c -vcpus 4 -acpi -domain-name vm1 -vnc 0.0.0.0:12 -vncunused -loadvm /var/lib/xen/qemu-save-12.img So move the bug to VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html