Bug 1293566 - Migration failed with error - libvirtError: internal error: early end of file from monitor: possible problem:
Summary: Migration failed with error - libvirtError: internal error: early end of file...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.1.3
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-3.6.3
: ---
Assignee: Michal Skrivanek
QA Contact: Israel Pinto
URL:
Whiteboard: virt
Depends On: 1298552
Blocks: RHEV3.6Upgrade 1302742
TreeView+ depends on / blocked
 
Reported: 2015-12-22 08:12 UTC by Michael Burman
Modified: 2020-02-14 17:43 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1298552 1302742 (view as bug list)
Environment:
Last Closed: 2016-01-28 13:55:33 UTC
oVirt Team: Virt
Embargoed:
gklein: ovirt-3.6.z?
gklein: blocker?
mgoldboi: planning_ack+
mburman: devel_ack?
mburman: testing_ack?


Attachments (Terms of Use)
rhev-h logs (386.04 KB, application/x-gzip)
2015-12-22 08:12 UTC, Michael Burman
no flags Details
rhel 7.2 logs (4.95 MB, application/x-gzip)
2015-12-22 08:14 UTC, Michael Burman
no flags Details
screenshots (147.02 KB, application/x-gzip)
2015-12-22 08:19 UTC, Michael Burman
no flags Details

Description Michael Burman 2015-12-22 08:12:56 UTC
Created attachment 1108584 [details]
rhev-h logs

Description of problem:
Migration failed with error - libvirtError: internal error: early end of file from monitor: possible problem:
2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2015-12-22T07:29:05.023722Z qemu-kvm: load of migration failed: Invalid argument

- The migration failing from latest rhel 7.2 vdsm-4.17.13-1.el7ev.noarch to 
rhev-h 7.2(20151218.2.el7ev) vdsm-4.17.13-1.el7ev.noarch and vise versa.

- Note, both servers were upgraded from vdsm-4.16.30/31 and before the upgrade the migration between the servers was successful.(attaching screen shot with event log of successful migration between servers just before the upgrade)

Version-Release number of selected component (if applicable):
rhev-m 3.6.1.3-0.1.el6
vdsm-4.17.13-1.el7ev.noarch(both)
libvirt-1.2.17-13.el7_2.2.x86_64(both)
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64(both)

Steps to Reproduce:
1. Try migrate between rhel 7.2 latest vdsm 3.6.1.3 and rhev-h 7.2 latest vdsm 3.6.1.3 in both directions.


Actual results:
Failing with libvirt and numa errors in vdsm logs.

Expected results:
Should work as expected.

Comment 1 Michael Burman 2015-12-22 08:14:23 UTC
Created attachment 1108585 [details]
rhel 7.2 logs

Note, the migration failing on both directions, so both servers are source and destination

Comment 2 Michael Burman 2015-12-22 08:19:40 UTC
Created attachment 1108586 [details]
screenshots

screenshots(event log UI) of the successful migration just before rhev-h upgrade.

Comment 3 Yaniv Kaul 2015-12-27 09:09:47 UTC
Does it happen without upgrade? Is it reproducible? Anything interesting in the VM configuration?

Comment 4 Michael Burman 2015-12-27 13:11:51 UTC
(In reply to Yaniv Kaul from comment #3)
> Does it happen without upgrade? Is it reproducible? Anything interesting in
> the VM configuration?

I saw this issue only as reported and described above.(vdsm 3.5 > 3.6.1)
Didn't saw it on 3.5.6/3.5.7 and not on 3.6.1/3.6.2 without involving upgrade. 
Nothing special on my VMs configurations.

Comment 5 dominique.taffin 2015-12-28 12:28:00 UTC
running also into this bug, just without NUMA reference when trying to migrate VM from CentOS 6.7 to CentOS 7.2.

Dec 28 13:18:10 onode030231 journal: Domain id=10 name='vm-dtaffin-25796' uuid=86191e76-5765-4e77-b909-5d29150797b9 is tainted: hook-script
Dec 28 13:18:10 onode030231 systemd-machined: New machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 systemd: Started Virtual Machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 systemd: Starting Virtual Machine qemu-vm-dtaffin-25796.
Dec 28 13:18:10 onode030231 kvm: 2 guests now active
Dec 28 13:18:11 onode030231 kernel: int312: port 3(vnet1) entered disabled state
Dec 28 13:18:11 onode030231 kernel: device vnet1 left promiscuous mode
Dec 28 13:18:11 onode030231 kernel: int312: port 3(vnet1) entered disabled state
Dec 28 13:18:11 onode030231 journal: internal error: End of file from monitor
Dec 28 13:18:11 onode030231 journal: internal error: early end of file from monitor: possible problem:#0122015-12-28T12:18:11.038163Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument#0122015-12-28T12:18:11.038230Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'#0122015-12-28T12:18:11.038334Z qemu-kvm: load of migration failed: Invalid argument
Dec 28 13:18:11 onode030231 kvm: 1 guest now active
Dec 28 13:18:11 onode030231 systemd-machined: Machine qemu-vm-dtaffin-25796 terminated.


Destination 7.2 host:
qemu-kvm-ev-2.3.0-31.el7_2.3.1.x86_64
vdsm-4.17.13-0.el7.centos.noarch
libvirt-daemon-1.2.17-13.el7.x86_64

Source 6.7 host:
qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
vdsm-4.16.27-0.el6.x86_64
libvirt-0.10.2-54.el6_7.3.x86_64

engine:
ovirt-engine-3.6.1.3-1.el6.noarch

Comment 6 dominique.taffin 2015-12-29 10:32:38 UTC
Just as additional information:

same issue occurs when trying to migrate VM between two CentOS 7.2 hosts.

both running identical versions:
qemu-kvm-ev-2.3.0-31.el7_2.3.1.x86_64
vdsm-4.17.13-0.el7.centos.noarch
libvirt-daemon-1.2.17-13.el7.x86_64

In case it matters: SELinux enforced on all hosts.

Comment 7 Michal Skrivanek 2016-01-13 16:36:53 UTC
(In reply to Michael Burman from comment #4)
> (In reply to Yaniv Kaul from comment #3)
> > Does it happen without upgrade? Is it reproducible? Anything interesting in
> > the VM configuration?
> 
> I saw this issue only as reported and described above.(vdsm 3.5 > 3.6.1)
> Didn't saw it on 3.5.6/3.5.7 and not on 3.6.1/3.6.2 without involving
> upgrade. 
> Nothing special on my VMs configurations.

so both hosts were upgraded, what about engine? if yes - it's in cluster level 3.5 or 3.6? 
VM were still running? No stop&start or anything?

Comment 8 Michal Skrivanek 2016-01-13 16:39:19 UTC
(In reply to dominique.taffin from comment #6)

similar question to you. But from your details it seems like you are running that VM in 3.5 cluster level. migrated from 6.7 to 7.2 and it failed (correct?); then a separate vm between two 7.2 hosts(correct?) - when and where did you launch that vm? any previous migrations?

Comment 9 Michael Burman 2016-01-13 16:54:24 UTC
Hi Michal,

Yes, both hosts were upgraded, as well the engine ^^ to rhev-m 3.6.1.3-0.1.el6.

It was part of a whole upgrade cycle in a very mixed environment, please note it's been a while since reported.

First the engine was upgraded(from 3.5.7), then i upgraded my 2 servers to 3.6 vdsm, and i think that i upgraded my cluster level to 3.6(but i can't really be sure, this setup no longer exists in the reported status and maybe the cluster level left on 3.5)

VMs were still running(no stop/start), 1 VM on each host.

Comment 10 dominique.taffin 2016-01-14 07:36:12 UTC
Hello,

(In reply to Michal Skrivanek from comment #8)

> similar question to you. But from your details it seems like you are running
> that VM in 3.5 cluster level. migrated from 6.7 to 7.2 and it failed
> (correct?); then a separate vm between two 7.2 hosts(correct?) - when and
> where did you launch that vm? any previous migrations?


correct. 

background: We do have a large infrastructure with several thousand VMs runinng on 3.5.7, cluster level 3.5. We do need to migrate those step by step without downtime to oVirt 3.6.x.

our migration step is: 
- update engine to latest 3.6.x
- move some CentOS 6 hosts of an old cluster (running in 3.5 level) to maintenance, reinstall them using CentOS 7.2 and 3.6.x ovirt packages.
- put CentOS 7.2 hosts in new cluster, migrate some VMs from old cluster to new one.
- repeat steps until all VMs / hosts are in new cluster.


Using the latest qemu-kvm-ev version we are now able to migrate VMs that have been launched on CentOS 7 between CentOS 7 hosts, but are still not able to migrate between CentOS 6 and CentOS 7 hosts, meaning we are blocked.

Please let me know what information I can provide in order to assist you.

Comment 11 Michal Skrivanek 2016-01-14 09:09:43 UTC
(In reply to dominique.taffin from comment #10)
 > background: We do have a large infrastructure with several thousand VMs
> runinng on 3.5.7, cluster level 3.5. We do need to migrate those step by
> step without downtime to oVirt 3.6.x.

that's quite a few - did you consider automation via REST API or everything manual only?
 
> our migration step is: 
> - update engine to latest 3.6.x
> - move some CentOS 6 hosts of an old cluster (running in 3.5 level) to
> maintenance, reinstall them using CentOS 7.2 and 3.6.x ovirt packages.
> - put CentOS 7.2 hosts in new cluster, migrate some VMs from old cluster to
> new one.

can you please confirm that cluster settings are exactly the same between both?
It needs to match not only the actual cluster level, but all other properties as well

> Using the latest qemu-kvm-ev version we are now able to migrate VMs that
> have been launched on CentOS 7 between CentOS 7 hosts, but are still not

so unlike Michael's case migration between 7.2 and 7.2 works ok? is that cross-cluster(3.5->3.5) or within cluster(3.5)?

Comment 12 dominique.taffin 2016-01-14 09:19:41 UTC
(In reply to Michal Skrivanek from comment #11)
> 
> that's quite a few - did you consider automation via REST API or everything
> manual only?
Mainly manual over several weeks as we do need to move host by host.

 
> can you please confirm that cluster settings are exactly the same between
> both?
> It needs to match not only the actual cluster level, but all other
> properties as well
I will recheck to verify and come back to you on this. AFAIK everything is identical.

 
> so unlike Michael's case migration between 7.2 and 7.2 works ok? is that
> cross-cluster(3.5->3.5) or within cluster(3.5)?
migration within cluster(3.5 level / 7.2 hosts). Cross Cluster 3.5/7.2 not tested.

Comment 13 dominique.taffin 2016-01-14 09:29:12 UTC
Verified again, all cluster settings are identical.

here a current libvirt log entry for an example VM that fails:

2016-01-14 09:25:24.171+0000: starting up libvirt version: 1.2.17, package: 13.el7 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-20-16:24:10, worker1.bsys.centos.org), qemu version: 2.3.0 (qemu-kvm-ev-2.3.0-31.el7_2.4.1)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm-dtaffin-25796 -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 86191e76-5765-4e77-b909-5d29150797b9 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6-7.el6.centos.12.3,serial=32393735-3733-5A43-3332-303235575250,uuid=86191e76-5765-4e77-b909-5d29150797b9 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm-dtaffin-25796/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2016-01-14T09:25:23,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-00000000037c/2dfc3bc7-ec09-4efa-82fb-0615b1f7c1d0/images/2d526a9a-43c4-4d5b-99bf-3460d2aceb01/d8ddaf83-3fec-439e-931b-a5d89eb1b05d,if=none,id=drive-virtio-disk0,format=raw,serial=2d526a9a-43c4-4d5b-99bf-3460d2aceb01,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:03:00:10,bus=pci.0,addr=0x3,bootindex=1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/86191e76-5765-4e77-b909-5d29150797b9.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/86191e76-5765-4e77-b909-5d29150797b9.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5902,tls-port=5903,addr=10.76.98.160,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
Domain id=7 is tainted: hook-script
2016-01-14T09:25:24.603034Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
2016-01-14T09:25:24.603108Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2016-01-14T09:25:24.603193Z qemu-kvm: load of migration failed: Invalid argument
2016-01-14 09:25:24.625+0000: shutting down

Comment 14 Michal Skrivanek 2016-01-14 09:55:53 UTC
(In reply to Michael Burman from comment #9)
> Hi Michal,
> 
> Yes, both hosts were upgraded, as well the engine ^^ to rhev-m
> 3.6.1.3-0.1.el6.
> 
> It was part of a whole upgrade cycle in a very mixed environment, please
> note it's been a while since reported.
> 
> First the engine was upgraded(from 3.5.7), then i upgraded my 2 servers to
> 3.6 vdsm, and i think that i upgraded my cluster level to 3.6(but i can't
> really be sure, this setup no longer exists in the reported status and maybe
> the cluster level left on 3.5)
> 
> VMs were still running(no stop/start), 1 VM on each host.

I've reviewed the logs and I wonder if it's the same issue or not. In your case the vms started with the new machine type (i.e. in upgraded cluster level 3.6) and were not running. E.g. vm-n2 was shut down as 3.5 VM and then started as a 3.6 VM properly (not via migration)
Also, your hosts have different TZ set so it's a bit difficult to troubleshoot logs.
That said, the last migration on vm-n2 should not have failed.

Comment 15 Michal Skrivanek 2016-01-14 10:01:15 UTC
(In reply to dominique.taffin from comment #13)
> Verified again, all cluster settings are identical.
> 
> here a current libvirt log entry for an example VM that fails:
...
> 2016-01-14T09:25:24.603034Z qemu-kvm: Length mismatch:
> 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
> 2016-01-14T09:25:24.603108Z qemu-kvm: error while loading state for instance
> 0x0 of device 'ram'
> 2016-01-14T09:25:24.603193Z qemu-kvm: load of migration failed: Invalid
> argument
> 2016-01-14 09:25:24.625+0000: shutting down

After comparing with comment #14 it looks similar, but there is a difference in machine type. Michale's VM is 3.6 and yours is 3.5(which is correct/consistent to what you described)
We need to retest qemu migration support. I suppose it happens when the VM has a NIC, right? Can you quickly test a VM without any? That would be helpful
Thanks a lot!

Comment 16 dominique.taffin 2016-01-14 10:08:02 UTC
(In reply to Michal Skrivanek from comment #15)
> We need to retest qemu migration support. I suppose it happens when the VM
> has a NIC, right? Can you quickly test a VM without any? That would be
> helpful

All of our VMs do have at least 1 NIC. Depending on customer request, also 2 NICs per VM. I will deploy an identical VM and remove the NIC. please note that we also use PXE as primary boot target, as all OS deployment is done via PXE. All our KVM NICs are VirtIO.

Comment 17 dominique.taffin 2016-01-14 10:20:40 UTC
migration without NIC is working.

I noted that location and filename on the hypervisor are different for PXE files. But I assume it does not matter, as the newer qemu-kvm-ev should be build with correct paths.
CentOS 6: /usr/share/gpxe/virtio-net.rom
CentOS 7: /usr/share/qemu-kvm/rhel6-virtio.rom (/usr/share/ipxe/1af41000.rom)

libvirt log for the successfull mirgration (without NIC):

2016-01-14 10:15:37.592+0000: starting up libvirt version: 1.2.17, package: 13.el7 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-20-16:24:10, worker1.bsys.centos.org), qemu version: 2.3.0 (qemu-kvm-ev-2.3.0-31.el7_2.4.1)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm-dtaffin-26037 -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid d28b8835-c360-418b-b45d-5842df1765e6 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6-7.el6.centos.12.3,serial=32393735-3733-5A43-3332-303235575245,uuid=d28b8835-c360-418b-b45d-5842df1765e6 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm-dtaffin-26037/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2016-01-14T10:15:37,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-00000000037c/0bb94892-4574-4d7f-a514-478999af10a0/images/a2545f6b-141a-4769-9151-39276e76ba16/235662a0-2845-43f7-b61a-c9e613bca557,if=none,id=drive-virtio-disk0,format=raw,serial=a2545f6b-141a-4769-9151-39276e76ba16,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/d28b8835-c360-418b-b45d-5842df1765e6.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/d28b8835-c360-418b-b45d-5842df1765e6.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5902,tls-port=5903,addr=10.76.98.160,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
Domain id=8 is tainted: hook-script
copying E and F segments from pc.bios to pc.ram
copying C and D segments from pc.rom to pc.ram


best,
 Dominique

Comment 18 Michal Skrivanek 2016-01-14 10:25:44 UTC
(In reply to Michael Burman from comment #9)

Meital, we would need need a local reproducer ASAP. Thanks.

Comment 19 dominique.taffin 2016-01-14 10:27:59 UTC
I think I found it:


The PXE ROMs have different md5sums.

I copied the PXE ROM file from CentOS 6 to the CentOS 7 machine:

CentOS 6 Source: /usr/share/gpxe/virtio-net.rom

CentOS 7.2 Destinations (yes, no symlink, but 2 copies for testing):
/usr/share/qemu-kvm/rhel6-virtio.rom 
/usr/share/ipxe/1af41000.rom

and the migration seems to work. I will have to test it some more with other VMs.

Comment 20 Francesco Romani 2016-01-14 11:45:44 UTC
(In reply to dominique.taffin from comment #17)
> migration without NIC is working.

Makes sense, because:

(In reply to Michael Burman from comment #0)
> Migration failed with error - libvirtError: internal error: early end of
> file from monitor: possible problem:
> 2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any
> NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> 2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus

This is a warning we need to check, but should not be critical

> should be described in NUMA config
> 2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch:
> 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
> 2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance
> 0x0 of device 'ram'
> 2015-12-22T07:29:05.023722Z qemu-kvm: load of migration failed: Invalid
> argument

This really looks like a qemu issue, in the upgrade path.

What Vdsm needs to guarantee is that the configuration of the VMs is consistent and correct.
I will now carefully check that Vdsm/Engine did the right thing and gave consistent configuration. If this is the case, we'll need to move the bug down the stack, to qemu.

Comment 21 Michal Skrivanek 2016-01-14 12:33:35 UTC
(In reply to dominique.taffin from comment #19)

can you please try with older gpxe on the el6 host and the original state on el7 host? e.g. gpxe-0.9.7-6.12.el6 (I suppose the VM would need to be started with this gpxe in place on that el6 host first)
Just to check which side is to blame

Comment 22 dominique.taffin 2016-01-14 12:36:12 UTC
(In reply to Michal Skrivanek from comment #21)
> can you please try with older gpxe on the el6 host and the original state on
> el7 host? e.g. gpxe-0.9.7-6.12.el6 (I suppose the VM would need to be
> started with this gpxe in place on that el6 host first)
> Just to check which side is to blame

I will, but it might take up until Monday noon before I can report back.

Comment 23 Francesco Romani 2016-01-14 12:41:29 UTC
(In reply to Michael Burman from comment #0)
> Created attachment 1108584 [details]
> rhev-h logs
> 
> Description of problem:
> Migration failed with error - libvirtError: internal error: early end of
> file from monitor: possible problem:
> 2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any
> NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> 2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus
> should be described in NUMA config

In order to debug this ^^^^^^^^^^^^^^^^

could you please share more vdsm logs, showing how this VM (vmId': u'22fa763b-3ea5-473f-8621-3eefeb51c350) was created? Specifically, I'd like to see the logs regarding VM.create verb.

A Vdsm log snippet which shows both VM creation and (failed) migration would be very nice.

Comment 24 Francesco Romani 2016-01-14 12:56:44 UTC
(In reply to Francesco Romani from comment #23)
> (In reply to Michael Burman from comment #0)
> > Created attachment 1108584 [details]
> > rhev-h logs
> > 
> > Description of problem:
> > Migration failed with error - libvirtError: internal error: early end of
> > file from monitor: possible problem:
> > 2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any
> > NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > 2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus
> > should be described in NUMA config
> 
> In order to debug this ^^^^^^^^^^^^^^^^
> 
> could you please share more vdsm logs, showing how this VM (vmId':
> u'22fa763b-3ea5-473f-8621-3eefeb51c350) was created? Specifically, I'd like
> to see the logs regarding VM.create verb.
> 
> A Vdsm log snippet which shows both VM creation and (failed) migration would
> be very nice.

Sorry, I missed this bit in the migration XML

<vcpu placement='static' current='1'>16</vcpu>

looks like the VM was configured to have just one cpu, is this right?

Comment 25 dominique.taffin 2016-01-14 14:06:29 UTC
(In reply to Michal Skrivanek from comment #21)
> can you please try with older gpxe on the el6 host and the original state on
> el7 host? e.g. gpxe-0.9.7-6.12.el6 (I suppose the VM would need to be
> started with this gpxe in place on that el6 host first)
> Just to check which side is to blame

OK, reinstalled all vdsm/qemu/... packages on centos6 and centos7 to ensure my virtio.net ROM image is gone. Deployed new VM running with stock PXE ROM image. 


Current setup 1st host: CentOS 6.7 host with:
qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.479.el6_7.2.x86_64
gpxe-roms-qemu-0.9.7-6.14.el6.noarch
qemu-kvm-rhev-tools-0.12.1.2-2.479.el6_7.2.x86_64
vdsm-4.16.27-0.el6.x86_64

and:
md5sum /usr/share/gpxe/virtio-net.rom 
bab6408c84e62746fdc06fe9baa47919  /usr/share/gpxe/virtio-net.rom



Current setup 2nd host: CentOS 7.2 with:
qemu-kvm-ev-2.3.0-31.el7_2.4.1.x86_64
qemu-kvm-tools-ev-2.3.0-31.el7_2.4.1.x86_64
qemu-img-ev-2.3.0-31.el7_2.4.1.x86_64
qemu-kvm-common-ev-2.3.0-31.el7_2.4.1.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.el7.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

and:
md5sum  /usr/share/qemu-kvm/rhel6-virtio.rom
281bb91bcb083a32b5db5059f51ead24  /usr/share/qemu-kvm/rhel6-virtio.rom
md5sum  /usr/share/ipxe/1af41000.rom
281bb91bcb083a32b5db5059f51ead24  /usr/share/ipxe/1af41000.rom



Trying to migrate freshly powered-on VM from CentOS6 to CentOS 7.
Migration fails as before with:

2016-01-14 14:00:18.078+0000: starting up libvirt version: 1.2.17, package: 13.el7 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-20-16:24:10, worker1.bsys.centos.org), qemu version: 2.3.0 (qemu-kvm-ev-2.3.0-31.el7_2.4.1)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm-dtaffin-26051 -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Westmere -m 1024 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 35ef0ad2-0bc2-45aa-86b6-2e85c28259df -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6-7.el6.centos.12.3,serial=32393735-3733-5A43-3332-303235575245,uuid=35ef0ad2-0bc2-45aa-86b6-2e85c28259df -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm-dtaffin-26051/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2016-01-14T14:00:17,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-00000000037c/0bb94892-4574-4d7f-a514-478999af10a0/images/1c4c1f5e-4eb6-4e99-83e2-5d89b7d8dda9/c8757869-c9dc-4d17-81a3-2f44abd47f15,if=none,id=drive-virtio-disk0,format=raw,serial=1c4c1f5e-4eb6-4e99-83e2-5d89b7d8dda9,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:03:00:26,bus=pci.0,addr=0x3,bootindex=1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/35ef0ad2-0bc2-45aa-86b6-2e85c28259df.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/35ef0ad2-0bc2-45aa-86b6-2e85c28259df.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=10.76.98.163,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
Domain id=2 is tainted: hook-script
2016-01-14T14:00:18.621239Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x10000 in != 0x40000: Invalid argument
2016-01-14T14:00:18.621321Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2016-01-14T14:00:18.621415Z qemu-kvm: load of migration failed: Invalid argument
2016-01-14 14:00:18.646+0000: shutting down

Comment 26 Michal Skrivanek 2016-01-15 10:58:48 UTC
Dominique, while QEMU guys are investigating, a potential workaround might be to hot unplug NIC and reconnect after migration. It may be tedious to do this for many VMs, but worth a try and it would perhaps be useful for important stuff

Comment 27 Dr. David Alan Gilbert 2016-01-15 18:43:27 UTC
(In reply to dominique.taffin from comment #19)
> I think I found it:
> 
> 
> The PXE ROMs have different md5sums.
> 
> I copied the PXE ROM file from CentOS 6 to the CentOS 7 machine:
> 
> CentOS 6 Source: /usr/share/gpxe/virtio-net.rom
> 
> CentOS 7.2 Destinations (yes, no symlink, but 2 copies for testing):
> /usr/share/qemu-kvm/rhel6-virtio.rom 
> /usr/share/ipxe/1af41000.rom

Hi Dominique,
  I'm confused by that line, those should be different files.  Can you show me (from your RHEL7 box):
  ls -l /usr/share/qemu-kvm/rhel6-virtio.rom
  ls -l /usr/share/ipxe/1af41000.rom
  rpm -qf /usr/share/qemu-kvm/rhel6-virtio.rom
  rpm -qf /usr/share/ipxe/1af41000.rom

  the rhel6-virtio.rom should be a nice tiny 53kb, the 1af41000 should be 256k.
If somehow the rhel6-virtio.rom has grown, then that would explain what's going on.

Dave

> and the migration seems to work. I will have to test it some more with other
> VMs.

Comment 28 Michael Burman 2016-01-17 06:55:19 UTC
(In reply to Francesco Romani from comment #23)
> (In reply to Michael Burman from comment #0)
> > Created attachment 1108584 [details]
> > rhev-h logs
> > 
> > Description of problem:
> > Migration failed with error - libvirtError: internal error: early end of
> > file from monitor: possible problem:
> > 2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any
> > NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > 2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus
> > should be described in NUMA config
> 
> In order to debug this ^^^^^^^^^^^^^^^^
> 
> could you please share more vdsm logs, showing how this VM (vmId':
> u'22fa763b-3ea5-473f-8621-3eefeb51c350) was created? Specifically, I'd like
> to see the logs regarding VM.create verb.
> 
> A Vdsm log snippet which shows both VM creation and (failed) migration would
> be very nice.

Hi Francesco, 
I can't share more vdsm logs. It's been to long since reported and this setup and logs no longer available.

Comment 29 dominique.taffin 2016-01-18 06:50:11 UTC
Hello,

(In reply to Dr. David Alan Gilbert from comment #27)
> Hi Dominique,
>   I'm confused by that line, those should be different files.  Can you show
> me (from your RHEL7 box):
>   ls -l /usr/share/qemu-kvm/rhel6-virtio.rom
>   ls -l /usr/share/ipxe/1af41000.rom
>   rpm -qf /usr/share/qemu-kvm/rhel6-virtio.rom
>   rpm -qf /usr/share/ipxe/1af41000.rom
> 
>   the rhel6-virtio.rom should be a nice tiny 53kb, the 1af41000 should be
> 256k.
> If somehow the rhel6-virtio.rom has grown, then that would explain what's
> going on.
> 

sure, here the result:

# cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 

# ls -l /usr/share/qemu-kvm/rhel6-virtio.rom
lrwxrwxrwx. 1 root root 28 14. Jan 14:49 /usr/share/qemu-kvm/rhel6-virtio.rom -> /usr/share/ipxe/1af41000.rom

# ls -l /usr/share/ipxe/1af41000.rom
-rw-r--r--. 1 root root 262144 20. Nov 07:28 /usr/share/ipxe/1af41000.rom

# rpm -qf /usr/share/qemu-kvm/rhel6-virtio.rom
qemu-kvm-ev-2.3.0-31.el7_2.4.1.x86_64

# rpm -qf /usr/share/ipxe/1af41000.rom
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

Comment 30 Dr. David Alan Gilbert 2016-01-18 09:09:56 UTC
(In reply to dominique.taffin from comment #29)
> Hello,
> 
> (In reply to Dr. David Alan Gilbert from comment #27)
> > Hi Dominique,
> >   I'm confused by that line, those should be different files.  Can you show
> > me (from your RHEL7 box):
> >   ls -l /usr/share/qemu-kvm/rhel6-virtio.rom
> >   ls -l /usr/share/ipxe/1af41000.rom
> >   rpm -qf /usr/share/qemu-kvm/rhel6-virtio.rom
> >   rpm -qf /usr/share/ipxe/1af41000.rom
> > 
> >   the rhel6-virtio.rom should be a nice tiny 53kb, the 1af41000 should be
> > 256k.
> > If somehow the rhel6-virtio.rom has grown, then that would explain what's
> > going on.
> > 
> 
> sure, here the result:

Thanks,
 
> # cat /etc/redhat-release 
> CentOS Linux release 7.2.1511 (Core) 
> 
> # ls -l /usr/share/qemu-kvm/rhel6-virtio.rom
> lrwxrwxrwx. 1 root root 28 14. Jan 14:49
> /usr/share/qemu-kvm/rhel6-virtio.rom -> /usr/share/ipxe/1af41000.rom
> 
> # ls -l /usr/share/ipxe/1af41000.rom
> -rw-r--r--. 1 root root 262144 20. Nov 07:28 /usr/share/ipxe/1af41000.rom
> 
> # rpm -qf /usr/share/qemu-kvm/rhel6-virtio.rom
> qemu-kvm-ev-2.3.0-31.el7_2.4.1.x86_64
> 
> # rpm -qf /usr/share/ipxe/1af41000.rom
> ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

Well that at least half explains the problem - that /usr/share/qemu-kvm/rhel6-virtio.rom should *NOT* be a link; Now we just have to figure out how it ended up that way.
I just donwloaded the qemu-kvm-ev from:
http://cbs.centos.org/kojifiles/packages/qemu-kvm-ev/2.3.0/31.el7_2.4.1/x86_64/
and did:

rpm2cpio http://cbs.centos.org/kojifiles/packages/qemu-kvm-ev/2.3.0/31.el7_2.4.1/x86_64/qemu-kvm-ev-2.3.0-31.el7_2.4.1.x86_64.rpm | cpio -t -v

and it shows:

-rwxr-xr-x   1 root     root        53248 Dec 18 12:13 ./usr/share/qemu-kvm/rhel6-virtio.rom

so that's OK.

Can you confirm:
  1) Exactly how you installed this host,
  2) Which repo you got the qemu-kvm-ev from (I think  yum info qemu-kvm-ev should show you)

Dave

Comment 31 dominique.taffin 2016-01-18 09:19:37 UTC
(In reply to Dr. David Alan Gilbert from comment #30)
> Can you confirm:
>   1) Exactly how you installed this host,
>   2) Which repo you got the qemu-kvm-ev from (I think  yum info qemu-kvm-ev
> should show you)

regarding 1)
 CentOS minimal installation over PXE with repo configuration (OS, Updated, EPEL, oVirt repo - all company internal mirrors), manually added to oVirt engine - which then installes the relevant packages like qemu-kvm-ev.

regarding 2)
all the packages are mirrored from the official ovirt download site (http://resources.ovirt.org/pub/ovirt-3.6/) to a local repository, including dependent gluster packages from gluster.org

We do not have any custom build packages, just "official" once.


best,
 Dominique

Comment 32 Dr. David Alan Gilbert 2016-01-18 09:31:43 UTC
(In reply to dominique.taffin from comment #31)
> (In reply to Dr. David Alan Gilbert from comment #30)
> > Can you confirm:
> >   1) Exactly how you installed this host,
> >   2) Which repo you got the qemu-kvm-ev from (I think  yum info qemu-kvm-ev
> > should show you)
> 
> regarding 1)
>  CentOS minimal installation over PXE with repo configuration (OS, Updated,
> EPEL, oVirt repo - all company internal mirrors), manually added to oVirt
> engine - which then installes the relevant packages like qemu-kvm-ev.

OK

> regarding 2)
> all the packages are mirrored from the official ovirt download site
> (http://resources.ovirt.org/pub/ovirt-3.6/) to a local repository, including
> dependent gluster packages from gluster.org

OK, thanks;  I've checked the packages on there as well, and they look fine as well.

> We do not have any custom build packages, just "official" once.

Thanks for the info; we'll keep trying to figure out how that's happening.

> best,
>  Dominique

Comment 33 Dr. David Alan Gilbert 2016-01-18 11:35:31 UTC
I tried installing rhev-h (20151218.1.iso ) and the file looks right, and I also then upgraded that (20151218.2) and it still looks right.  I did the upgrade via boot off CD.

Comment 34 Dr. David Alan Gilbert 2016-01-18 20:05:35 UTC
(In reply to Dr. David Alan Gilbert from comment #33)
> I tried installing rhev-h (20151218.1.iso ) and the file looks right, and I
> also then upgraded that (20151218.2) and it still looks right.  I did the
> upgrade via boot off CD.

I also tried an upgrade from rhev-m; still looks right.  What I've not tried is rhel6->rhel7 host upgrade - someone who knows the rhev-m side better than me needs to try and follow that to recreate it.

Comment 35 Michael Burman 2016-01-19 08:21:56 UTC
I tried to reproduced this report and maybe succeeded, but this time failing with other error. 

- Before upgrade(3.5) migration was working, after upgrade to 3.6(tested on cluster 3.5 and on cluster 3.6 after upgrade), migration failing. Not sure if it is the same issue, but the same steps reproduction as described in the report ^^

- Please contact me for setup details, will leave it for one day. Thanks.

- Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160105.1.el7ev)
ovirt-node-3.2.3-30.el7.noarch
vdsm-4.16.32-1.el7ev.x86_64
libvirt-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64  
kernel - 3.10.0-327.3.1.el7.x86_64      >>

- RHEV Hypervisor - 7.2 - 20160113.0.el7ev
ovirt-node-3.6.1-3.0.el7ev.noarch
vdsm-4.17.17-0.el7ev
libvirt-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
3.10.0-327.4.4.el7.x86_64


- Red Hat Enterprise Linux Server release 7.2 (Maipo) 
vdsm-4.16.32-1.el7ev.x86_64
libvirt-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
kerenl - 3.10.0-327.el7.x86_64          >> 

- Red Hat Enterprise Linux Server release 7.2 (Maipo)
3.10.0 - 327.8.1.el7.x86_64
vdsm-4.17.17-0.el7ev
libvirt-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.6.x86_64


vdsm log error from source(rhev-h) :

Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 211, in _recover
    self._destServer.destroy(self._vm.id)
AttributeError: 'SourceThread' object has no attribute '_destServer'
Thread-499::DEBUG::2016-01-19 08:19:14,160::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4301361260, "0974fb9c-131f-4ee4-a428-1d8172e489a2": {"status": "Migration Source"}}, "jsonrpc": "2.0", "method": "|virt|VM_status|0974fb9c-131f-4ee4-a428-1d8172e489a2"}
Thread-499::ERROR::2016-01-19 08:19:14,160::migration::310::virt.vm::(run) vmId=`0974fb9c-131f-4ee4-a428-1d8172e489a2`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 278, in run
    self._setupVdsConnection()
  File "/usr/share/vdsm/virt/migration.py", line 143, in _setupVdsConnection
    client = self._createClient(port)
  File "/usr/share/vdsm/virt/migration.py", line 130, in _createClient
    self.remoteHost, int(port), sslctx)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1267, in create_connected_socket
    sock.connect((host, port))
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 181, in connect
    self.socket.connect(addr)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
gaierror: [Errno -2] Name or service not known

Comment 36 Michal Skrivanek 2016-01-19 10:37:43 UTC
(In reply to Michael Burman from comment #35)
add the setup details then, or logs, please.
thanks

Comment 37 Michael Burman 2016-01-19 12:32:08 UTC
Hi Michal , i'm trying to load logs, but have issues with bugzilla for some reason.

I will provide setup details in private.

I think i'm failing to migrate because of this BZ 1232338, because rhev-h server is set to local host after upgrade and reboot.

I will fix it and see if i can reproduce the original error ^^

Comment 38 Michael Burman 2016-01-19 13:20:20 UTC
I can't reproduce this report. 
After fixing the local host issue, migration is successful after upgrade.

Comment 39 Dr. David Alan Gilbert 2016-01-20 11:29:03 UTC
dominique: I think the right way to clean up that box is to reinstall your qemu-kvm-ev package,  you should then find the /usr/share/qemu/rhel6-virtio.rom is a nice 53k file.  If you do that, then you should be able to migrate from rhel6 hosts into that box; however, if you've got VMs that are running on the box with the messed up ROM you wont be able to migrate them out.  You need to shut the guest down, fix the qemu package install and then restart them.


Dave

Comment 40 Dr. David Alan Gilbert 2016-01-20 11:45:40 UTC
mburman's errors have two different cases:

2015-12-21T15:42:35.446059Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2015-12-21T15:42:35.446247Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config

2015-12-22 07:29:04.675+0000: 18144: info : libvirt version: 1.2.17, package: 13.el7_2.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-11-23-07:46:04, x86-019.build.eng.bos.redhat.com)
2015-12-22 07:29:04.675+0000: 18144: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fb75410b1e0
2015-12-22T07:29:04.742599Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2015-12-22T07:29:04.742812Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
2015-12-22T07:29:05.023584Z qemu-kvm: Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 0x20000 in != 0x40000: Invalid argument
2015-12-22T07:29:05.023620Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2015-12-22T07:29:05.023722Z qemu-kvm: load of migration failed: Invalid argument

So the NUMA error is on both sides and needs looking at probably; but also note he's using rhel7.2 machine types

Now, for rhel7 machine types we have:
lrwxrwxrwx. 1 root root     20 Dec 18 12:51 pxe-virtio.rom -> ../ipxe/1af41000.rom
-rw-r--r--. 1 root root 262144 May  6  2015 /usr/share/ipxe/1af41000.rom
which comes from:
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

I can see that back in 2013 we had ipxe-roms that were smaller (66k) - os maybe this is what's happening; an old ipxe-roms ?

Comment 41 Dr. David Alan Gilbert 2016-01-20 12:10:41 UTC
OK, I think I see the problem; we're shipping the wrong ipxe roms in the rhev-h image and the rom is ~66k - i.e. rounds up to 128k (which is where we get the 0x20000) but if you install rhel you get the latest rhel ipxe rom which is 0x40000.

Comment 54 Ohad Levy 2016-01-26 09:54:38 UTC
adding lzap and Mike, as they might have a clue.

Comment 55 Michal Skrivanek 2016-01-28 13:55:33 UTC
closing since we identified the problem is caused by running a different ipxe on one of the hypervisors (source), likely getting pulled in by Katello or other means, making the VMs incompatible with the target machine

Comment 56 Fabian Deutsch 2016-01-28 14:09:41 UTC
Dominique, do you have an idea how you have ended up with two different iPXE roms on the two machines?

Comment 57 Lukas Zapletal 2016-02-15 09:24:58 UTC
Hello guys,

we use iPXE to build our bootdisk ISO which are used for provisioning in non-PXE or non-DHCP environments. And since we have customers with modern hardware which is not supported in iPXE from RHEL, we build latest and greatest iPXE into Satellite 6. It is not supported to have RHEV hypervisor on the same server with Satellite 6, we were not expecting problems. But this was not the case even.

After reading this BZ, we should perhaps consider changing this policy of rebasing and ask our platform team to backport new drivers or fixes into the RHEL iPXE branch. This was not the root cause of this bug, but it puts some light into this grey area.


Note You need to log in before you can comment on or make changes to this bug.