Bug 1298552 - migration error for VMs with rhel6 machine types
migration error for VMs with rhel6 machine types
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
7.2
x86_64 Linux
unspecified Severity urgent
: rc
: ---
Assigned To: Dr. David Alan Gilbert
Virtualization Bugs
:
Depends On:
Blocks: RHEV3.6Upgrade 1293566 1302742
  Show dependency treegraph
 
Reported: 2016-01-14 07:26 EST by Michal Skrivanek
Modified: 2017-03-15 11:36 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1293566
Environment:
Last Closed: 2016-01-20 06:14:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Michal Skrivanek 2016-01-14 07:26:59 EST
This is originally a RHEV bug during upgrade from el6 to el7, when VMS using rhel-6.5.0 machine type are being run on 7.2 hosts

Perhaps a gpxe change in bug 1231931 is related, as this used to work just fine.


+++ This bug was initially created as a clone of Bug #1293566 +++

Description of problem:
Migration failed with error - libvirtError: internal error: early end of file from monitor: possible problem:
2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2015-12-22T07:29:05.023722Z qemu-kvm: load of migration failed: Invalid argument

- The migration failing from latest rhel 7.2 vdsm-4.17.13-1.el7ev.noarch to 
rhev-h 7.2(20151218.2.el7ev) vdsm-4.17.13-1.el7ev.noarch and vise versa.

- Note, both servers were upgraded from vdsm-4.16.30/31 and before the upgrade the migration between the servers was successful.(attaching screen shot with event log of successful migration between servers just before the upgrade)

Version-Release number of selected component (if applicable):
rhev-m 3.6.1.3-0.1.el6
vdsm-4.17.13-1.el7ev.noarch(both)
libvirt-1.2.17-13.el7_2.2.x86_64(both)
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64(both)

Steps to Reproduce:
1. Try migrate between rhel 7.2 latest vdsm 3.6.1.3 and rhev-h 7.2 latest vdsm 3.6.1.3 in both directions.


Actual results:
Failing with libvirt and numa errors in vdsm logs.

Expected results:
Should work as expected.

--- Additional comment from Michael Burman on 2015-12-22 09:14 CET ---

Note, the migration failing on both directions, so both servers are source and destination

--- Additional comment from Michael Burman on 2015-12-22 09:19 CET ---

screenshots(event log UI) of the successful migration just before rhev-h upgrade.

--- Additional comment from Yaniv Kaul on 2015-12-27 10:09:47 CET ---

Does it happen without upgrade? Is it reproducible? Anything interesting in the VM configuration?

--- Additional comment from Michael Burman on 2015-12-27 14:11:51 CET ---

(In reply to Yaniv Kaul from comment #3)
> Does it happen without upgrade? Is it reproducible? Anything interesting in
> the VM configuration?

I saw this issue only as reported and described above.(vdsm 3.5 > 3.6.1)
Didn't saw it on 3.5.6/3.5.7 and not on 3.6.1/3.6.2 without involving upgrade. 
Nothing special on my VMs configurations.

--- Additional comment from  on 2015-12-28 13:28:00 CET ---

running also into this bug, just without NUMA reference when trying to migrate VM from CentOS 6.7 to CentOS 7.2.

Dec 28 13:18:10 onode030231 journal: Domain id=10 name='vm-dtaffin-25796' uuid=86191e76-5765-4e77-b909-5d29150797b9 is tainted: hook-script
Dec 28 13:18:10 onode030231 systemd-machined: New machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 systemd: Started Virtual Machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 systemd: Starting Virtual Machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 kvm: 2 guests now active
Dec 28 13:18:11 onode030231 kernel: int312: port 3(vnet1) entered disabled state
Dec 28 13:18:11 onode030231 kernel: device vnet1 left promiscuous mode
Dec 28 13:18:11 onode030231 kernel: int312: port 3(vnet1) entered disabled state
Dec 28 13:18:11 onode030231 journal: internal error: End of file from monitor
Dec 28 13:18:11 onode030231 journal: internal error: early end of file from monitor: possible problem:#0122015-12-28T12:18:11.038163Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument#0122015-12-28T12:18:11.038230Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'#0122015-12-28T12:18:11.038334Z qemu-kvm: load of migration failed: Invalid argument
Dec 28 13:18:11 onode030231 kvm: 1 guest now active
Dec 28 13:18:11 onode030231 systemd-machined: Machine qemu-vm-dtaffin-25796 terminated.


Destination 7.2 host:
qemu-kvm-ev-2.3.0-31.el7_2.3.1.x86_64
vdsm-4.17.13-0.el7.centos.noarch
libvirt-daemon-1.2.17-13.el7.x86_64

Source 6.7 host:
qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
vdsm-4.16.27-0.el6.x86_64
libvirt-0.10.2-54.el6_7.3.x86_64

engine:
ovirt-engine-3.6.1.3-1.el6.noarch

--- Additional comment from  on 2015-12-29 11:32:38 CET ---

Just as additional information:

same issue occurs when trying to migrate VM between two CentOS 7.2 hosts.

both running identical versions:
qemu-kvm-ev-2.3.0-31.el7_2.3.1.x86_64
vdsm-4.17.13-0.el7.centos.noarch
libvirt-daemon-1.2.17-13.el7.x86_64

In case it matters: SELinux enforced on all hosts.

--- Additional comment from Michal Skrivanek on 2016-01-13 17:36:53 CET ---

(In reply to Michael Burman from comment #4)
> (In reply to Yaniv Kaul from comment #3)
> > Does it happen without upgrade? Is it reproducible? Anything interesting in
> > the VM configuration?
> 
> I saw this issue only as reported and described above.(vdsm 3.5 > 3.6.1)
> Didn't saw it on 3.5.6/3.5.7 and not on 3.6.1/3.6.2 without involving
> upgrade. 
> Nothing special on my VMs configurations.

so both hosts were upgraded, what about engine? if yes - it's in cluster level 3.5 or 3.6? 
VM were still running? No stop&start or anything?

--- Additional comment from Michal Skrivanek on 2016-01-13 17:39:19 CET ---

(In reply to dominique.taffin from comment #6)

similar question to you. But from your details it seems like you are running that VM in 3.5 cluster level. migrated from 6.7 to 7.2 and it failed (correct?); then a separate vm between two 7.2 hosts(correct?) - when and where did you launch that vm? any previous migrations?

--- Additional comment from Michael Burman on 2016-01-13 17:54:24 CET ---

Hi Michal,

Yes, both hosts were upgraded, as well the engine ^^ to rhev-m 3.6.1.3-0.1.el6.

It was part of a whole upgrade cycle in a very mixed environment, please note it's been a while since reported.

First the engine was upgraded(from 3.5.7), then i upgraded my 2 servers to 3.6 vdsm, and i think that i upgraded my cluster level to 3.6(but i can't really be sure, this setup no longer exists in the reported status and maybe the cluster level left on 3.5)

VMs were still running(no stop/start), 1 VM on each host.

--- Additional comment from  on 2016-01-14 08:36:12 CET ---

Hello,

(In reply to Michal Skrivanek from comment #8)

> similar question to you. But from your details it seems like you are running
> that VM in 3.5 cluster level. migrated from 6.7 to 7.2 and it failed
> (correct?); then a separate vm between two 7.2 hosts(correct?) - when and
> where did you launch that vm? any previous migrations?


correct. 

background: We do have a large infrastructure with several thousand VMs runinng on 3.5.7, cluster level 3.5. We do need to migrate those step by step without downtime to oVirt 3.6.x.

our migration step is: 
- update engine to latest 3.6.x
- move some CentOS 6 hosts of an old cluster (running in 3.5 level) to maintenance, reinstall them using CentOS 7.2 and 3.6.x ovirt packages.
- put CentOS 7.2 hosts in new cluster, migrate some VMs from old cluster to new one.
- repeat steps until all VMs / hosts are in new cluster.


Using the latest qemu-kvm-ev version we are now able to migrate VMs that have been launched on CentOS 7 between CentOS 7 hosts, but are still not able to migrate between CentOS 6 and CentOS 7 hosts, meaning we are blocked.

Please let me know what information I can provide in order to assist you.

--- Additional comment from Michal Skrivanek on 2016-01-14 10:09:43 CET ---

(In reply to dominique.taffin from comment #10)
 > background: We do have a large infrastructure with several thousand VMs
> runinng on 3.5.7, cluster level 3.5. We do need to migrate those step by
> step without downtime to oVirt 3.6.x.

that's quite a few - did you consider automation via REST API or everything manual only?
 
> our migration step is: 
> - update engine to latest 3.6.x
> - move some CentOS 6 hosts of an old cluster (running in 3.5 level) to
> maintenance, reinstall them using CentOS 7.2 and 3.6.x ovirt packages.
> - put CentOS 7.2 hosts in new cluster, migrate some VMs from old cluster to
> new one.

can you please confirm that cluster settings are exactly the same between both?
It needs to match not only the actual cluster level, but all other properties as well

> Using the latest qemu-kvm-ev version we are now able to migrate VMs that
> have been launched on CentOS 7 between CentOS 7 hosts, but are still not

so unlike Michael's case migration between 7.2 and 7.2 works ok? is that cross-cluster(3.5->3.5) or within cluster(3.5)?

--- Additional comment from  on 2016-01-14 10:19:41 CET ---

(In reply to Michal Skrivanek from comment #11)
> 
> that's quite a few - did you consider automation via REST API or everything
> manual only?
Mainly manual over several weeks as we do need to move host by host.

 
> can you please confirm that cluster settings are exactly the same between
> both?
> It needs to match not only the actual cluster level, but all other
> properties as well
I will recheck to verify and come back to you on this. AFAIK everything is identical.

 
> so unlike Michael's case migration between 7.2 and 7.2 works ok? is that
> cross-cluster(3.5->3.5) or within cluster(3.5)?
migration within cluster(3.5 level / 7.2 hosts). Cross Cluster 3.5/7.2 not tested.

--- Additional comment from  on 2016-01-14 10:29:12 CET ---

Verified again, all cluster settings are identical.

here a current libvirt log entry for an example VM that fails:

2016-01-14 09:25:24.171+0000: starting up libvirt version: 1.2.17, package: 13.el7 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-20-16:24:10, worker1.bsys.centos.org), qemu version: 2.3.0 (qemu-kvm-ev-2.3.0-31.el7_2.4.1)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm-dtaffin-25796 -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 86191e76-5765-4e77-b909-5d29150797b9 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6-7.el6.centos.12.3,serial=32393735-3733-5A43-3332-303235575250,uuid=86191e76-5765-4e77-b909-5d29150797b9 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm-dtaffin-25796/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2016-01-14T09:25:23,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-00000000037c/2dfc3bc7-ec09-4efa-82fb-0615b1f7c1d0/images/2d526a9a-43c4-4d5b-99bf-3460d2aceb01/d8ddaf83-3fec-439e-931b-a5d89eb1b05d,if=none,id=drive-virtio-disk0,format=raw,serial=2d526a9a-43c4-4d5b-99bf-3460d2aceb01,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:03:00:10,bus=pci.0,addr=0x3,bootindex=1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/86191e76-5765-4e77-b909-5d29150797b9.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/86191e76-5765-4e77-b909-5d29150797b9.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5902,tls-port=5903,addr=10.76.98.160,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
Domain id=7 is tainted: hook-script
2016-01-14T09:25:24.603034Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
2016-01-14T09:25:24.603108Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2016-01-14T09:25:24.603193Z qemu-kvm: load of migration failed: Invalid argument
2016-01-14 09:25:24.625+0000: shutting down

--- Additional comment from Michal Skrivanek on 2016-01-14 10:55:53 CET ---

(In reply to Michael Burman from comment #9)
> Hi Michal,
> 
> Yes, both hosts were upgraded, as well the engine ^^ to rhev-m
> 3.6.1.3-0.1.el6.
> 
> It was part of a whole upgrade cycle in a very mixed environment, please
> note it's been a while since reported.
> 
> First the engine was upgraded(from 3.5.7), then i upgraded my 2 servers to
> 3.6 vdsm, and i think that i upgraded my cluster level to 3.6(but i can't
> really be sure, this setup no longer exists in the reported status and maybe
> the cluster level left on 3.5)
> 
> VMs were still running(no stop/start), 1 VM on each host.

I've reviewed the logs and I wonder if it's the same issue or not. In your case the vms started with the new machine type (i.e. in upgraded cluster level 3.6) and were not running. E.g. vm-n2 was shut down as 3.5 VM and then started as a 3.6 VM properly (not via migration)
Also, your hosts have different TZ set so it's a bit difficult to troubleshoot logs.
That said, the last migration on vm-n2 should not have failed.

--- Additional comment from Michal Skrivanek on 2016-01-14 11:01:15 CET ---

(In reply to dominique.taffin from comment #13)
> Verified again, all cluster settings are identical.
> 
> here a current libvirt log entry for an example VM that fails:
...
> 2016-01-14T09:25:24.603034Z qemu-kvm: Length mismatch:
> 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
> 2016-01-14T09:25:24.603108Z qemu-kvm: error while loading state for instance
> 0x0 of device 'ram'
> 2016-01-14T09:25:24.603193Z qemu-kvm: load of migration failed: Invalid
> argument
> 2016-01-14 09:25:24.625+0000: shutting down

After comparing with comment #14 it looks similar, but there is a difference in machine type. Michale's VM is 3.6 and yours is 3.5(which is correct/consistent to what you described)
We need to retest qemu migration support. I suppose it happens when the VM has a NIC, right? Can you quickly test a VM without any? That would be helpful
Thanks a lot!

--- Additional comment from  on 2016-01-14 11:08:02 CET ---

(In reply to Michal Skrivanek from comment #15)
> We need to retest qemu migration support. I suppose it happens when the VM
> has a NIC, right? Can you quickly test a VM without any? That would be
> helpful

All of our VMs do have at least 1 NIC. Depending on customer request, also 2 NICs per VM. I will deploy an identical VM and remove the NIC. please note that we also use PXE as primary boot target, as all OS deployment is done via PXE. All our KVM NICs are VirtIO.

--- Additional comment from  on 2016-01-14 11:20:40 CET ---

migration without NIC is working.

I noted that location and filename on the hypervisor are different for PXE files. But I assume it does not matter, as the newer qemu-kvm-ev should be build with correct paths.
CentOS 6: /usr/share/gpxe/virtio-net.rom
CentOS 7: /usr/share/qemu-kvm/rhel6-virtio.rom (/usr/share/ipxe/1af41000.rom)

libvirt log for the successfull mirgration (without NIC):

2016-01-14 10:15:37.592+0000: starting up libvirt version: 1.2.17, package: 13.el7 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-20-16:24:10, worker1.bsys.centos.org), qemu version: 2.3.0 (qemu-kvm-ev-2.3.0-31.el7_2.4.1)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm-dtaffin-26037 -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid d28b8835-c360-418b-b45d-5842df1765e6 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6-7.el6.centos.12.3,serial=32393735-3733-5A43-3332-303235575245,uuid=d28b8835-c360-418b-b45d-5842df1765e6 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm-dtaffin-26037/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2016-01-14T10:15:37,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-00000000037c/0bb94892-4574-4d7f-a514-478999af10a0/images/a2545f6b-141a-4769-9151-39276e76ba16/235662a0-2845-43f7-b61a-c9e613bca557,if=none,id=drive-virtio-disk0,format=raw,serial=a2545f6b-141a-4769-9151-39276e76ba16,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/d28b8835-c360-418b-b45d-5842df1765e6.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/d28b8835-c360-418b-b45d-5842df1765e6.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5902,tls-port=5903,addr=10.76.98.160,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
Domain id=8 is tainted: hook-script
copying E and F segments from pc.bios to pc.ram
copying C and D segments from pc.rom to pc.ram


best,
 Dominique

--- Additional comment from Michal Skrivanek on 2016-01-14 11:25:44 CET ---

(In reply to Michael Burman from comment #9)

Meital, we would need need a local reproducer ASAP. Thanks.

--- Additional comment from  on 2016-01-14 11:27:59 CET ---

I think I found it:


The PXE ROMs have different md5sums.

I copied the PXE ROM file from CentOS 6 to the CentOS 7 machine:

CentOS 6 Source: /usr/share/gpxe/virtio-net.rom

CentOS 7.2 Destinations (yes, no symlink, but 2 copies for testing):
/usr/share/qemu-kvm/rhel6-virtio.rom 
/usr/share/ipxe/1af41000.rom

and the migration seems to work. I will have to test it some more with other VMs.

--- Additional comment from Francesco Romani on 2016-01-14 12:45:44 CET ---

(In reply to dominique.taffin from comment #17)
> migration without NIC is working.

Makes sense, because:

(In reply to Michael Burman from comment #0)
> Migration failed with error - libvirtError: internal error: early end of
> file from monitor: possible problem:
> 2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any
> NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> 2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus

This is a warning we need to check, but should not be critical

> should be described in NUMA config
> 2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch:
> 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
> 2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance
> 0x0 of device 'ram'
> 2015-12-22T07:29:05.023722Z qemu-kvm: load of migration failed: Invalid
> argument

This really looks like a qemu issue, in the upgrade path.

What Vdsm needs to guarantee is that the configuration of the VMs is consistent and correct.
I will now carefully check that Vdsm/Engine did the right thing and gave consistent configuration. If this is the case, we'll need to move the bug down the stack, to qemu.
Comment 1 Michal Skrivanek 2016-01-14 07:34:32 EST
> Perhaps a gpxe change in bug 1231931 is related, as this used to work just fine.
sorry, that's not a correct gpxe bug, something else then
Comment 2 huiqingding 2016-01-15 04:28:29 EST
Test migration between two RHEL7.2 hosts and the kernel and qemu-kvm version are:
kernel-3.10.0-327.8.1.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

1. boot vm in src host:
# /usr/libexec/qemu-kvm  -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -smbios type=1 -no-user-config -nodefaults -net none -monitor stdio

2. boot vm in dst host:
# /usr/libexec/qemu-kvm  -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -smbios type=1 -no-user-config -nodefaults -net none -monitor stdio -incoming tcp:0:5800

3. do migration from src host to dst host:
(qemu) migrate -d tcp:dst_ip:5800

after step 3, qemu-kvm in dst host quits with error:
copying E and F segments from pc.bios to pc.ram
copying C and D segments from pc.rom to pc.ram
ERROR: invalid runstate transition: 'inmigrate' -> 'prelaunch'
Aborted (core dumped)

(gdb) bt
#0  0x00007ffff04e45f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff04e5ce8 in __GI_abort () at abort.c:90
#2  0x00005555556d4bc7 in runstate_set (new_state=<optimized out>) at vl.c:648
#3  0x0000555555796401 in process_incoming_migration_co (opaque=0x555556a38000) at migration/migration.c:259
#4  0x00005555557ec70a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:80
#5  0x00007ffff04f6110 in ?? () from /lib64/libc.so.6
#6  0x00007fffffffd500 in ?? ()
#7  0x0000000000000000 in ?? ()

Also test -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off, hit the same problem.
Comment 3 Dr. David Alan Gilbert 2016-01-15 04:46:52 EST
huiqingding:
  That's a separate issue; that's because you're trying to migrate a source which is in -S and hasn't been started yet.

Michal: OK, we need to work out the history of those ROM sizes. We can't just change it back since that will break migration for other combinations.
Comment 4 Dr. David Alan Gilbert 2016-01-15 08:02:18 EST
> 2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch:
> 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
> 2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance
> 0x0 of device 'ram'

0x20000 = 128k
0x40000 = 256k
so I think that's saying it's trying to load a 128k ROM image from the migration stream, but it's expecting a 256k ROM.


The RHEL6 /usr/share/gpxe/virtio-net.rom comes from gpxe-roms-qemu, and both the 0.9.7-6.10.el6 package from 2015 and a nice new 0.9.7-6.15 both have that as 53760 bytes.

On RHEL7:
/usr/share/qemu-kvm/rhel6-virtio.rom comes from the qemu-kvm-rhev package itself; and is 53248 bytes in qemu-kvm-rhev-2.3.0-31.el7_2.5 and in the qemu-kvm-ev-2.3.0-31.el7_2.3.1 that I downloaded.

So those ROM sizes look OK - and neither of them are anywhere near 128 or 256k.


strace -e open /usr/libexec/qemu-kvm -M rhel6.5.0 -device virtio-net -S -nographic -nodefaults

shows:
open("/usr/share/qemu-kvm/rhel6-virtio.rom", O_RDONLY) = 10

so it is opening the ROM file we thought it should.

Hmm, we need to reproduce this fully.

Dave
Comment 5 Dr. David Alan Gilbert 2016-01-15 08:26:09 EST
I've dumped migration streams as well, both rhel7 and rhel6 are showing 64k RAM blocks for the virito-net-pci ROM; so I can't see where the sizes are coming from.
Comment 11 Dr. David Alan Gilbert 2016-01-20 06:14:26 EST
The only case where we've got someone with a machine that can reproduce this, has a broken package install, and therefore it's not a qemu migration bug.

Given that we don't understand how they got the broken installation, it's still possible there's a bug somewhere, and there's still plenty unexplained.

So marking as INSUFFICIENT_DATA - please reopen if we can reproduce it or we find someone else who hits it that we can see what's going on in their filesystem.
Comment 12 meital avital 2016-02-08 02:52:18 EST
Please see bug - https://bugzilla.redhat.com/show_bug.cgi?id=1293566#c50
Comment 13 Karen Noel 2016-04-28 16:09:39 EDT
(In reply to meital avital from comment #12)
> Please see bug - https://bugzilla.redhat.com/show_bug.cgi?id=1293566#c50

I think the needinfo? was intended for Dave.
Comment 14 Dr. David Alan Gilbert 2016-05-03 13:02:45 EDT
Yes, this bug was fixed; it was a packaging error with the iPXE roms on the RHEV isos.

Note You need to log in before you can comment on or make changes to this bug.