Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1258757 - qemu-kvm aborted while resume VM after suspend
qemu-kvm aborted while resume VM after suspend
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity urgent
: rc
: ---
Assigned To: Juan Quintela
Virtualization Bugs
virt
:
Depends On:
Blocks: 1172230 1154205
  Show dependency treegraph
 
Reported: 2015-09-01 04:24 EDT by Israel Pinto
Modified: 2015-09-20 08:30 EDT (History)
20 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-09-09 14:39:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Engine log (155.90 KB, application/x-xz)
2015-09-01 04:25 EDT, Israel Pinto
no flags Details
host_logs (1.23 MB, application/x-xz)
2015-09-01 04:26 EDT, Israel Pinto
no flags Details

  None (edit)
Description Israel Pinto 2015-09-01 04:24:48 EDT
Description of problem:
While testing memory hot plug on VM, did suspend and resume to the VM,
The stuck in restore state.

Version-Release number of selected component (if applicable):
Hosts: RHEL 7.2
RHEVM:3.6.0-0.12.master.el6
VDSM:vdsm-4.17.3-1.el7ev
libvirt:libvirt-1.2.17-5.el7
How reproducible:
Happened once.

Steps to Reproduce:
1. Set VM with 15GB
2. Increase memory with 1GB and 3GB 
3. Suspend VM and Resume
4. Note: Migrate failed before action (3) 

Actual results:
VM stack in restore state, and libvirt crash.
There are core dump on the host.
Failed to backtrace dump.

Expected results:
VM up and running with new memory

Additional info:
1. Hosts are nested
2. journalctl -u libvirtd
[root@virt-nested-vm12 ~]# journalctl -u libvirtd
-- Logs begin at Mon 2015-08-31 12:32:24 IDT, end at Mon 2015-08-31 22:10:25 IDT. --
Aug 31 09:33:22 virt-nested-vm12.scl.lab.tlv.redhat.com systemd[1]: Starting Virtualization daemon...
Aug 31 12:33:29 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-
Aug 31 12:33:29 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
Aug 31 12:33:33 virt-nested-vm12.scl.lab.tlv.redhat.com systemd[1]: Started Virtualization daemon.
Aug 31 14:11:33 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: Unsupported migration cookie feature memory-hotplug
Aug 31 14:11:33 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Domain id=2 name='linux_vm_with_gui' uuid=692f956c-8802-4536-a710-e252c6ab7887 is tainted: hook-script
Aug 31 14:12:13 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Cannot start job (query, none) for domain linux_vm_with_gui; current job is (async nested, migration i
Aug 31 14:12:13 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepa
Aug 31 14:12:28 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Unable to read from monitor: Connection reset by peer
Aug 31 14:12:28 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: early end of file from monitor: possible problem:
                                                                       2015-08-31T11:11:35.349646Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9
                                                                       2015-08-31T11:11:35.350249Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA co
                                                                       red_dispatcher_loadvm_commands: 
                                                                       id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
                                                                       id 1, group 1, virt start 7fcb74c00000, virt end 7fcb78bfe000, generation 0, delta 7fcb74c00000
                                                                       id 2, group 1, virt start 7fcb72a00000, virt end 7fcb74a00000, generation 0, delta 7fcb72a00000
                                                                       ((null):16326): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 160 too big, addr=a00000000000
Aug 31 14:12:28 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: End of file from monitor
Aug 31 14:12:28 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Unable to get index for interface vnet0: No such device
Aug 31 14:19:26 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Domain id=3 name='linux_vm_with_gui' uuid=692f956c-8802-4536-a710-e252c6ab7887 is tainted: hook-script
Aug 31 14:19:51 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: End of file from monitor
Aug 31 14:19:51 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: early end of file from monitor: possible problem:
                                                                       2015-08-31T11:19:26.781204Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9
                                                                       2015-08-31T11:19:26.781332Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA co
                                                                       red_dispatcher_loadvm_commands: 
                                                                       id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
                                                                       id 1, group 1, virt start 7f3fdca00000, virt end 7f3fe09fe000, generation 0, delta 7f3fdca00000
                                                                       id 2, group 1, virt start 7f3fda800000, virt end 7f3fdc800000, generation 0, delta 7f3fda800000
                                                                       ((null):17309): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 160 too big, addr=a00000000400
lines 1-30/30 (END)

3. vdsClient -s 0 list table
[root@virt-nested-vm12 ~]# vdsClient -s 0 list table
692f956c-8802-4536-a710-e252c6ab7887      0  linux_vm_with_gui    Restoring state
Comment 1 Israel Pinto 2015-09-01 04:25:24 EDT
Created attachment 1068876 [details]
Engine log
Comment 2 Israel Pinto 2015-09-01 04:26:16 EDT
Created attachment 1068877 [details]
host_logs
Comment 4 Jiri Denemark 2015-09-01 05:32:05 EDT
I don't see any sign of a libvirt crash here. The coredump says it was generated by /usr/libexec/qemu-kvm and according to logs it seems qemu-kvm aborted itself. Moving to qemu-kvm-rhev for further investigation.
Comment 5 Juan Quintela 2015-09-01 12:37:54 EDT
What does "suspend" and "resume" means on this context?
"savevm" at qemu devel, or something different?

BTW, name of the machine is: virt-nested_vm12" are we talking about nested vm here?

Once here, can we get the command line that was launched on both ends?

Thanks, Juan.
Comment 6 Juan Quintela 2015-09-01 12:48:46 EDT
To the spice team:

Stripped log messages of timestamps, logs, whatever:

warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9
warning: All CPU(s) up to maxcpus should be described in NUMA co
         red_dispatcher_loadvm_commands: 
         id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
         id 1, group 1, virt start 7fcb74c00000, virt end 7fcb78bfe000, generation 0, delta 7fcb74c00000
         id 2, group 1, virt start 7fcb72a00000, virt end 7fcb74a00000, generation 0, delta 7fcb72a00000
         ((null):16326): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 160 too big, addr=a00000000000

Does this ring any bell?
Comment 7 Amit Shah 2015-09-01 20:16:46 EDT
We definitely need answers to comment 6, but here are a few observations.

From attachment 1068877 [details], linux_vm_with_gui.log, I see a qemu command line, pasted below.

There are several runs before attempting migration, but migration seems to be attempted on 31 Aug, which corresponds to what's given in comment 0.

There are also several such messages:

2015-08-31T08:15:17.949271Z qemu-kvm: error while loading state section id 2(ram)
2015-08-31T08:15:17.949799Z qemu-kvm: load of migration failed: Input/output error

which definitely correspond to a migration.  Igor should know more, re-assigning to him for input.

Another (possibly unrelated) observation: after each qemu run are these messages:

2015-08-31 08:15:17.983+0000: shutting down
2015-08-31 08:28:01.739+0000: starting up libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com), qemu version: 2.3.0 (qemu-kvm-rhev-2.3.0-19.el7)

does this mean libvirt is being shutdown, or is this due to some libvirt crash?  (Correlate this with the other logs, also mentioned in comment 0:


Aug 31 14:11:33 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: internal error: Unsupported migration cookie feature memory-hotplug
Aug 31 14:11:33 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Domain id=2 name='linux_vm_with_gui' uuid=692f956c-8802-4536-a710-e252c6ab7887 is tainted: hook-script
Aug 31 14:12:13 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Cannot start job (query, none) for domain linux_vm_with_gui; current job is (async nested, migration i
Aug 31 14:12:13 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepa
Aug 31 14:12:28 virt-nested-vm12.scl.lab.tlv.redhat.com libvirtd[983]: Unable to read from monitor: Connection reset by peer

which indicate they happen during a failed migration.)

Are we tainting a domain that has memory hotplug enabled?  In any case, migration was attempted, and it failed.


LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name linux_vm_with_gui -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu Conroe -m size=15360000k,slots=16,maxmem=4294967296k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=15000 -uuid 692f956c-8802-4536-a710-e252c6ab7887 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.2-3.el7,serial=48F8B571-EF77-4920-8FBB-A913F5BBF2E2,uuid=692f956c-8802-4536-a710-e252c6ab7887 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/linux_vm_with_gui.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-08-31T08:08:07,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-000000000105/18af3aae-92c2-4c0c-8315-2d67a988a0e3/images/55e5f52a-5828-4f40-9cd3-c8077fbc0a27/5fa8b899-a07c-41fa-8546-b003eb78b33f,if=none,id=drive-virtio-disk0,format=raw,serial=55e5f52a-5828-4f40-9cd3-c8077fbc0a27,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:01:53,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/692f956c-8802-4536-a710-e252c6ab7887.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/692f956c-8802-4536-a710-e252c6ab7887.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming tcp:[::]:49152 -msg timestamp=on
Comment 8 Amit Shah 2015-09-02 00:07:09 EDT
From comment 0, you start VM with 15G, then hotplug 1G and 3G -- making it 4G total, right?

On the destination, do you start the VM with the 19G as the -m parameter?

From the command line in comment 7, you have:

   -m size=15360000k,slots=16,maxmem=4294967296k

what are the exact sizes you provide for hotplug, and the dest VM?
Comment 9 Gerd Hoffmann 2015-09-02 02:50:22 EDT
>          ((null):16326): Spice-CRITICAL **: red_memslots.c:123:get_virt:
> slot_id 160 too big, addr=a00000000000

qxl command fails sanity checks, slot_id should be 1 or 2.
Most likely this comes from corrupted qxl device memory.
Comment 10 Amit Shah 2015-09-02 03:13:06 EDT
(In reply to Amit Shah from comment #8)
> From comment 0, you start VM with 15G, then hotplug 1G and 3G -- making it
> 4G total, right?
> 
> On the destination, do you start the VM with the 19G as the -m parameter?
> 
> From the command line in comment 7, you have:
> 
>    -m size=15360000k,slots=16,maxmem=4294967296k
> 
> what are the exact sizes you provide for hotplug, and the dest VM?

BTW it looks like these logs are from the destination, because they have the -incoming parameter:

    -incoming tcp:[::]:49152

which means the amount of RAM provided to the dest VM is inconsistent with that on the host (15G + 4G hotplugged).  This will definitely fail migration.
Comment 11 Israel Pinto 2015-09-02 03:43:28 EDT
comment 5:
1. "suspend" means hibernate
2. The host is "virt-nested_vm12"
3. Command: 
   With vscClient: 
   pause vm: vdsClient -s0 pause <vmId> 
   resume vm: vdsClient -s0 continue <vmId> 
comment 8:
We start VM with 15GB and then hot-plug 1GB and 3GB --> 4 total
VM had 19096GB and run migration. The migration failed. 
I think it because the destination host did not have available memory.
Comment 12 Amit Shah 2015-09-02 03:51:35 EDT
Can you please provide the qemu command line from both the src and dest hosts?

Does this work if you start the VM with much lower RAM (say 10G) and then hotplug 4G, so the dest has enough RAM to work with?  Does the migration fail then as well?
Comment 13 Israel Pinto 2015-09-02 04:21:19 EDT
I don't use CLI for migration or hot-plug. Only form RHEVM UI.
I did tests for hot-plug were i created VM 10GB and hot-plug 1,3,4 GB
and invoke migration, migration works.
And on destination host did hot-plug for 1,3,4 GB and migrate VM again,migration work.
Comment 14 Amit Shah 2015-09-02 04:30:12 EDT
(In reply to Israel Pinto from comment #13)
> I don't use CLI for migration or hot-plug. Only form RHEVM UI.

Understood; but the qemu CLI can be found from the logs.  That is the best way for us qemu developers to know how the machine was started and how it would behave.

> I did tests for hot-plug were i created VM 10GB and hot-plug 1,3,4 GB
> and invoke migration, migration works.
> And on destination host did hot-plug for 1,3,4 GB and migrate VM
> again,migration work.

So is it alright to assume that this issue was due to lack of available memory on the destination, as you said in comment 11, and we can close this as NOTABUG?
Comment 15 Juan Quintela 2015-09-02 04:59:51 EDT
Igor, how do mem-plug slots work?

Something like:

Source:
  slot 0=15GB
  slot 1 = 1GB
  slot 2 = 3GB
on destination we just have
  slot 0 = 19GB

And if so, why onle qxl seems to care? Or why it is working with only one hotplug?

For migration of memory, we don't really care, but it appears that spice cares.
Comment 16 Juan Quintela 2015-09-02 05:01:13 EDT
Israel, can you confirm that when migration with hotplug works, you are using spice?  Thanks
Comment 17 Igor Mammedov 2015-09-02 05:25:57 EDT
(In reply to Juan Quintela from comment #15)
> Igor, how do mem-plug slots work?
> 
> Something like:
> 
> Source:
>   slot 0=15GB
>   slot 1 = 1GB
>   slot 2 = 3GB
> on destination we just have
>   slot 0 = 19GB
It's wrong, doing this will result in different memory map and migration will fail at attempt to migrate not matching RAMBlock-s

memory hotplug operates in term's of DIMM devices, so if you have N DIMM devices on source, you should have the same amount of DIMM devices on target, where each device is exactly the same as on source. Meaning all its properties are the same (slot,size,memdev,node,addr).

Usually if addr/slot are not provided autoassign does a good job of assigning them to the same values as on source if DIMM devices provided on CLI in the same order as they were added to QEMU (coldplug/hotplug).
But libvirt shouldn't rely on it and should query all DIMM's properties on source and explicitly specify them on target.

> 
> And if so, why onle qxl seems to care? Or why it is working with only one
> hotplug?
> 
> For migration of memory, we don't really care, but it appears that spice
> cares.
Comment 18 Amit Shah 2015-09-02 05:34:08 EDT
Juan, Jiri and I discussed this bug and took a close look at the logs (linux_vm_with_gui.log).  All the VM invocations in that log have the -incoming set, so it means the VMs have been received, and this must be the dest host.

Our findings are:
* The spice error in comment 0 and comment 6 only happen when migration did not fail.  Did a migration actually succeed before this?  We don't know.

* When the spice failure happens, new memslots are inserted via the cmdline to reflect memory hotplug:

    -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0,addr=17179869184 -object memory-backend-ram,id=memdimm1,size=3221225472,host-nodes=0,policy=interleave -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1,addr=18253611008

* When the migration fails, the memslots are not present in the cmdline.  Was mem hotplug attempted before this?  We can't say.

* Suspend/resume from comment 0 is most likely not hibernation.  The only way to hibernate a guest is via invoking it from the guest.  Was that done?  (E.g. pm-hibernate, or similar in the guest?)  In the likely case that this was not hibernation, we think it's actually migrate-to-file that was attempted.  This adds a new dimension, where a live migration after migrate-to-file (and restore-from-file) was attempted.

* There seem to be various different tests run in different invocations going by the different failures observed.  Please record all runs, tests, and their corresponding logs for us to examine.  There might be more than one bug here.
Comment 19 Amit Shah 2015-09-02 05:34:51 EDT
Please see comments 18, 16 for questions.
Comment 20 Jiri Denemark 2015-09-02 05:50:48 EDT
(In reply to Igor Mammedov from comment #17)
> > Source:
> >   slot 0=15GB
> >   slot 1 = 1GB
> >   slot 2 = 3GB
> > on destination we just have
> >   slot 0 = 19GB
> It's wrong, doing this will result in different memory map and migration
> will fail at attempt to migrate not matching RAMBlock-s
> 
> memory hotplug operates in term's of DIMM devices, so if you have N DIMM
> devices on source, you should have the same amount of DIMM devices on
> target, where each device is exactly the same as on source. Meaning all its
> properties are the same (slot,size,memdev,node,addr).
> 
> Usually if addr/slot are not provided autoassign does a good job of
> assigning them to the same values as on source if DIMM devices provided on
> CLI in the same order as they were added to QEMU (coldplug/hotplug).
> But libvirt shouldn't rely on it and should query all DIMM's properties on
> source and explicitly specify them on target.

I think libvirt is doing the right thing (and it corresponds to what Peter explained to me when he was implementing memory hotplug in libvirt)... the process with -incoming has

-m size=15360000k,slots=16,maxmem=4294967296k \
-object memory-backend-ram,id=memdimm0,size=1073741824,host-nodes=0,policy=interleave \
-device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0,addr=17179869184 \
-object memory-backend-ram,id=memdimm1,size=3221225472,host-nodes=0,policy=interleave \
-device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1,addr=18253611008

That is 15G base memory + 1G module + 3G module.
Comment 21 Israel Pinto 2015-09-02 06:52:16 EDT
comment 16: 
I don't use spice while migration VM.
I have open ssh session to VM only.
comment 18:
1. This is the first time checking migration after memory hot-plug. 
2. Did hot plug of the memory and then run suspend / resume. Did not set 
the VM to hibernation from the guest.
Comment 22 Michal Skrivanek 2015-09-02 07:16:33 EDT
(In reply to Israel Pinto from comment #21)
> comment 16: 
> I don't use spice while migration VM.
> I have open ssh session to VM only.
> comment 18:
> 1. This is the first time checking migration after memory hot-plug. 
> 2. Did hot plug of the memory and then run suspend / resume. Did not set 
> the VM to hibernation from the guest.

Please clarify once more what do you mean by suspend/resume. In previous comment you mentioned vdsClient...but that does very different thing from Suspend/Resume in UI (it does pause/unpause)
Comment 23 Jiri Denemark 2015-09-02 07:27:21 EDT
I noticed one more thing, I can see -incoming tcp:... and -incoming fd:... in the log, which means there was a normal incoming migration attempt followed by an incoming migration from file. And since both logs are about the same domain and it wasn't started in another way between the two logs, the following seems to be what happened here:

0) start a VM, do some memory hotplug, etc., we can't really tell exactly from the qemu log
1) hostA: migrate VM to hostB (failed)
2) hostA: migrate VM to file (aka virsh save or Suspend in RHEV UI)
3) hostB: migrate VM from file (aka virsh restore or Resume in RHEV UI)

Is this what you did Israel?

So doing 2 and 3 is pretty much the same as 1 except that if it fails, the VM is still running on hostA in 1 while it is not running anywhere in 2+3.
Comment 24 Israel Pinto 2015-09-02 07:42:41 EDT
Comment 22: As i wrote in the first i did suspend for the GUI and resume (run) for the GUI only. did not do any CLI commands.
Comment 23:
0) start a VM, do some memory hotplug, etc., we can't really tell exactly from the qemu log >>  YES 
1) hostA: migrate VM to hostB (failed) >> YES
2) hostA: migrate VM to file (aka virsh save or Suspend in RHEV UI) >> YES (did suspend) 
The VM stuck in restore state, I did run VM only. (from GUI)
3) hostB: migrate VM from file (aka virsh restore or Resume in RHEV UI)
Comment 25 Michal Skrivanek 2015-09-03 11:09:51 EDT
do we need a follow up on the spice error? comment 9 indicates a serious issue.
Perhaps try to reproduce separately and open a new bug?

Isolating the problematic run would be very helpful I don't think we have a clear picture of the exact state of the VM.
Comment 26 Amit Shah 2015-09-03 23:02:04 EDT
We are more or less just groping in the dark due to lack of logs and an accurate report of what was tried and how.

From our findings, there have been multiple VM runs in various stages, and various tests have been tried.  The spice error and the live migration error seem to be unrelated based on the one log we have.  However, this is just guesswork, and we don't know what steps were tried.

So till we get a clear set of steps that were tried and the corresponding logs, we can't really proceed further.

Israel, if you can please go through all the comments we've made and try to answer all of them with as much detail as possible, that'll help in moving this forward.
Comment 27 Jiri Denemark 2015-09-04 03:25:36 EDT
Actually, we could try to reproduce the migration error using save/restore after memory hotplug on a single machine. In case there is a machine which can run 19GB VM.
Comment 29 Dr. David Alan Gilbert 2015-09-07 07:07:33 EDT
I wonder; there's a message from Andrey Korolyov on qemu-devel about corruption with migration/suspend+hot plug DIMMs; are we hitting that:

https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg03117.html

and he followed up saying he still gets it (but more rarely):

https://lists.gnu.org/archive/html/qemu-devel/2015-09/msg00940.html

but those are reports of the guest crashing not the qemu.
Comment 30 Igor Mammedov 2015-09-07 10:21:41 EDT
(In reply to Dr. David Alan Gilbert from comment #29)
> I wonder; there's a message from Andrey Korolyov on qemu-devel about
> corruption with migration/suspend+hot plug DIMMs; are we hitting that:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg03117.html
> 
> and he followed up saying he still gets it (but more rarely):
> 
> https://lists.gnu.org/archive/html/qemu-devel/2015-09/msg00940.html
> 
> but those are reports of the guest crashing not the qemu.

above crash is supposedly happens under IO workload and QEMU runs without graphics.
Comment 31 Gil Klein 2015-09-08 09:02:56 EDT
Israel, please try reproducing it again. 
If it does not reproduce, I suggest we remove the blocker flag, and not block RHEL 7.2 on this.
Comment 32 Israel Pinto 2015-09-08 09:27:40 EDT
re-tested it with:
Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6
Host rhel7.2: vdsm-4.17.5-1.el7ev

Scenarios: 
1. 
 1.1 Vm with 5GB memory, hot plug memory of 256M
 1.2 Suspend VM , Resume VM  -- PASS
2. 
 2.1 On the same VM hot plug memory of 1GB and 2GB
 2.2 Suspend VM , Resume VM  -- PASS

Removing blocker
Comment 33 Dr. David Alan Gilbert 2015-09-08 09:30:37 EDT
(In reply to Israel Pinto from comment #32)
> re-tested it with:
> Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6
> Host rhel7.2: vdsm-4.17.5-1.el7ev
> 
> Scenarios: 
> 1. 
>  1.1 Vm with 5GB memory, hot plug memory of 256M
>  1.2 Suspend VM , Resume VM  -- PASS
> 2. 
>  2.1 On the same VM hot plug memory of 1GB and 2GB
>  2.2 Suspend VM , Resume VM  -- PASS
> 
> Removing blocker


Great; now please try it with the 16GB memory and then hot plug 1GB and 2GB
(on a host with at least 20GB RAM)
Comment 34 Dr. David Alan Gilbert 2015-09-09 10:29:59 EDT
I've tried to recreate this on my pair of test hosts and haven't managed to make it to fail;  I've done a pretty random mix of adding RAM, migrating, saveing, restoring and a few more migrations, with spice display and 16GB RAM as a start.
I've not been stressing the guest hard while I did it though.
Comment 35 Juan Quintela 2015-09-09 14:39:31 EDT
As we are not able to reproduce, closing the bug.

If the reporter is able to reproduce it and give exact steps, we would work on it again.

Thanks.
Comment 36 Israel Pinto 2015-09-20 08:30:45 EDT
re-tested it with:
Red Hat Enterprise Virtualization Manager Version: 
rhevm-3.6.0-0.16.master.el6.noarch
Host rhel7.2: libvirt-daemon-1.2.17-5.el7.x86_64
vdsm-4.17.7-1.el7ev.noarch


Scenario: 
 1.1 Vm with 16GB memory, hot plug memory of 1GB and 2GB
 1.2 Suspend VM , Resume VM  -- PASS

Note You need to log in before you can comment on or make changes to this bug.