Bug 894723 - libvirt: cannot resume a suspended vm - vm is shut down
libvirt: cannot resume a suspended vm - vm is shut down
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.4
x86_64 Linux
high Severity urgent
: rc
: ---
Assigned To: Michal Privoznik
Virtualization Bugs
: Regression, TestBlocker, ZStream
: 910706 (view as bug list)
Depends On:
Blocks: 907972 910706 915347 947865
  Show dependency treegraph
 
Reported: 2013-01-13 07:47 EST by Dafna Ron
Modified: 2013-11-21 03:37 EST (History)
18 users (show)

See Also:
Fixed In Version: libvirt-0.10.2-19.el6
Doc Type: Bug Fix
Doc Text:
When a VM was saved into a compressed file and decompression of that file failed while libvirt was trying to resume the VM, libvirt removed the VM from the list of running VMs, but did not remove the corresponding QEMU process. With this update, the QEMU process is killed in such cases. Moreover, non-fatal decompression errors are now ignored and a VM can be successfully resumed if such an error occurs.
Story Points: ---
Clone Of:
: 913226 (view as bug list)
Environment:
Last Closed: 2013-11-21 03:37:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (1.32 MB, application/x-gzip)
2013-01-13 07:47 EST, Dafna Ron
no flags Details
logs (675.48 KB, application/x-gzip)
2013-01-20 04:46 EST, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2013-01-13 07:47:11 EST
Created attachment 677692 [details]
logs

Description of problem:

I suspended a vm and failed to resume it. 
it seems that the vm is killed in libvirt and we are able to rerun it, but the qemu pid is still alive (see additional info) . 
so aside from not being able to resume a s suspended vm, we have 2 qemu process for the same vm. 

Version-Release number of selected component (if applicable):

libvirt-devel-0.10.2-14.el6.x86_64
vdsm-4.10.2-3.0.el6ev.x86_64
qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. run and suspend a vm
2. try to resume the vm
3. try to restart the vm
  
Actual results:

we fail resuming the vm and the vm shuts down. 

Expected results:

we should be able to resume the vm 

Additional info: logs

I suspended the same vm twice - 3ed time I run it there are 3 qemu processes running all for the same vm: 

Tasks: 353 total,   1 running, 350 sleeping,   1 stopped,   1 zombie
Cpu(s):  0.8%us,  0.5%sy,  0.0%ni, 98.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32868340k total,  1050024k used, 31818316k free,    49848k buffers
Swap: 16383992k total,        0k used, 16383992k free,   372624k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                        
 7056 qemu      20   0 1043m  29m 5948 S  1.0  0.1   0:25.00 qemu-kvm                                                                                                                                                                       
 6804 qemu      20   0 1042m  19m 5872 S  0.3  0.1   0:03.50 qemu-kvm                                                                                                                                                                       
27311 qemu      20   0 1042m  34m 5872 S  0.3  0.1   0:26.69 qemu-kvm                                                                                                                                                                       
10202 qemu      15  -5     0    0    0 Z  0.0  0.0   0:00.00 python <defunct> 



ps shows that its the same vm name and same disk -> diff pid: 

[root@gold-vdsc ~]# ps -elf |grep qemu
6 S qemu      6804     1  0  80   0 - 266960 poll_s 14:34 ?       00:00:03 /usr/libexec/qemu-kvm -name DESKTOP -S -M rhel6.4.0 -cpu Conroe -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 213d5348-4215-4f79-b313-81b23af4f502 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=58BF6CC0-8ED1-11E0-A8E7-CFC2C2C2C2D6,uuid=213d5348-4215-4f79-b313-81b23af4f502 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/DESKTOP.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-13T00:34:12,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/afcde1c5-6022-4077-ab06-2beed7e5e404/4230e713-9cc4-40ba-9c96-876ccca30a9d/images/5b2ce745-d7a8-469d-889a-4f23ccf623f9/8b8e5554-fc54-4452-a513-e213a69fd354,if=none,id=drive-virtio-disk0,format=raw,serial=5b2ce745-d7a8-469d-889a-4f23ccf623f9,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:61:5a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/DESKTOP.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/DESKTOP.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5902,tls-port=5903,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming fd:27 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
6 S qemu      7056     1  5  80   0 - 267026 poll_s 14:35 ?       00:00:25 /usr/libexec/qemu-kvm -name DESKTOP -S -M rhel6.4.0 -cpu Conroe -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 213d5348-4215-4f79-b313-81b23af4f502 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=58BF6CC0-8ED1-11E0-A8E7-CFC2C2C2C2D6,uuid=213d5348-4215-4f79-b313-81b23af4f502 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/DESKTOP.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-13T00:35:02,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/afcde1c5-6022-4077-ab06-2beed7e5e404/4230e713-9cc4-40ba-9c96-876ccca30a9d/images/5b2ce745-d7a8-469d-889a-4f23ccf623f9/8b8e5554-fc54-4452-a513-e213a69fd354,if=none,id=drive-virtio-disk0,format=raw,serial=5b2ce745-d7a8-469d-889a-4f23ccf623f9,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:61:5a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/DESKTOP.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/DESKTOP.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5904,tls-port=5905,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
0 S root      8176 25402  0  80   0 - 25812 pipe_w 14:43 pts/0    00:00:00 grep qemu
5 Z qemu     10202  8666  0  75  -5 -     0 exit   09:34 ?        00:00:00 [python] <defunct>
6 S qemu     27311     1  0  80   0 - 266960 poll_s 13:32 ?       00:00:27 /usr/libexec/qemu-kvm -name DESKTOP -S -M rhel6.4.0 -cpu Conroe -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 213d5348-4215-4f79-b313-81b23af4f502 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=58BF6CC0-8ED1-11E0-A8E7-CFC2C2C2C2D6,uuid=213d5348-4215-4f79-b313-81b23af4f502 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/DESKTOP.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-12T23:32:29,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/afcde1c5-6022-4077-ab06-2beed7e5e404/4230e713-9cc4-40ba-9c96-876ccca30a9d/images/5b2ce745-d7a8-469d-889a-4f23ccf623f9/8b8e5554-fc54-4452-a513-e213a69fd354,if=none,id=drive-virtio-disk0,format=raw,serial=5b2ce745-d7a8-469d-889a-4f23ccf623f9,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:61:5a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/DESKTOP.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/DESKTOP.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming fd:27 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7


root@gold-vdsc ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------
 5     DESKTOP                        running
Comment 1 Peter Krempa 2013-01-14 06:14:59 EST
Hi Dafna,

could you please be more specific on the reproducer case?

with packages:
libvirt-0.10.2-15.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.351.el6.x86_64

I'm not able to reproduce the issue with following steps:

$ virsh start RHEL_nightly
Domain RHEL_nightly started

$ virsh list
 Id    Name                           State
----------------------------------------------------
 2     RHEL_nightly                   running

$ virsh suspend 2
Domain 2 suspended

$ virsh list
 Id    Name                           State
----------------------------------------------------
 2     RHEL_nightly                   paused

$ virsh resume 2
Domain 2 resumed

$ virsh list
 Id    Name                           State
----------------------------------------------------
 2     RHEL_nightly                   running

Please try the steps above and try upgrading the packages to current newest versions or provide better reproducer steps.

Thanks
Comment 2 Dafna Ron 2013-01-14 11:00:32 EST
still reproduced with 
libvirt-0.10.2-15.el6.x86_64 
qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64

I am using rhevm (meaning vdsm is installed and all commands are sent through engine -> vdsm) 

1. run a vm -> suspend the vm -> resume the vm. 

ERROR showing in vdsm log comes from libvirt: 

Thread-1370::ERROR::2013-01-14 17:46:44,298::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`213d5348-4215-4f79-b313-81b23af4f502`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1472, in _run
    self._connection.restore(fname)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3155, in restore
    if ret == -1: raise libvirtError ('virDomainRestore() failed', conn=self)
libvirtError: internal error Child process (lzop -dc) unexpected exit status 2


1 I ran the same vm after it exited when failing restore:  

root@gold-vdsc ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------
 4     DESKTOP                        running

as you can see, we have one vm shown as running in virsh while 2 qemu pid's for the same exact vm:  

[root@gold-vdsc ~]# ps -elf |grep qemu
6 S qemu     23501     1  1  80   0 - 266960 poll_s 17:46 ?       00:00:00 /usr/libexec/qemu-kvm -name DESKTOP -S -M rhel6.4.0 -cpu Conroe -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 213d5348-4215-4f79-b313-81b23af4f502 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=58BF6CC0-8ED1-11E0-A8E7-CFC2C2C2C2D6,uuid=213d5348-4215-4f79-b313-81b23af4f502 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/DESKTOP.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-14T03:46:40,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/afcde1c5-6022-4077-ab06-2beed7e5e404/4230e713-9cc4-40ba-9c96-876ccca30a9d/images/5b2ce745-d7a8-469d-889a-4f23ccf623f9/8b8e5554-fc54-4452-a513-e213a69fd354,if=none,id=drive-virtio-disk0,format=raw,serial=5b2ce745-d7a8-469d-889a-4f23ccf623f9,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:61:5a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/DESKTOP.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/DESKTOP.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming fd:27 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
6 S qemu     23749     1 10  80   0 - 266894 poll_s 17:47 ?       00:00:00 /usr/libexec/qemu-kvm -name DESKTOP -S -M rhel6.4.0 -cpu Conroe -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 213d5348-4215-4f79-b313-81b23af4f502 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=58BF6CC0-8ED1-11E0-A8E7-CFC2C2C2C2D6,uuid=213d5348-4215-4f79-b313-81b23af4f502 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/DESKTOP.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-14T03:47:25,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/afcde1c5-6022-4077-ab06-2beed7e5e404/4230e713-9cc4-40ba-9c96-876ccca30a9d/images/5b2ce745-d7a8-469d-889a-4f23ccf623f9/8b8e5554-fc54-4452-a513-e213a69fd354,if=none,id=drive-virtio-disk0,format=raw,serial=5b2ce745-d7a8-469d-889a-4f23ccf623f9,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:61:5a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/DESKTOP.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/DESKTOP.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5902,tls-port=5903,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
0 S root     23788  7099  0  80   0 - 25812 pipe_w 17:47 pts/0    00:00:00 grep qemu
Comment 4 EricLee 2013-01-15 06:12:25 EST
Hi Dafna,

I can not reproduce this issue in rhevm with the packages:
# rpm -qa libvirt qemu-kvm-rhev vdsm
vdsm-4.10.2-1.1.el6.x86_64
libvirt-0.10.2-15.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.351.el6.x86_64

run a vm -> suspend the vm -> run the vm.
successfully without error.

and # ps aux | grep qemu
qemu     28531  0.0  0.0      0     0 ?        Z<   05:23   0:00 [python] <defunct>
qemu     29978  1.2  3.2 1221752 260068 ?      Sl   05:33   0:00 /usr/libexec/qemu-kvm -name libing -S -M rhel6.3.0 -cpu Penryn -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid c2525d68-9c29-4548-b78d-3ed326cb1378 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=0FBD2800-7BD9-11E1-0000-E839354BFEEA_e8:39:35:4b:fe:ea,uuid=c2525d68-9c29-4548-b78d-3ed326cb1378 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/libing.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-14T22:33:50,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/4005fe5d-b136-4342-b478-1f6a3bc38f1a/e0dfdb68-4f68-40ca-8c54-7506c6c88127/images/b8455d40-bbf4-4da8-82af-51ac274d0e96/fe25752f-1067-4bd5-b29c-67a0605be546,if=none,id=drive-virtio-disk0,format=raw,serial=b8455d40-bbf4-4da8-82af-51ac274d0e96,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:a8:7a:8b,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/libing.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/libing.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming fd:29 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
root     30071  0.0  0.0 103248   852 pts/0    S+   05:34   0:00 grep qemu

only one process, seems working normally.

Did I miss some steps?

Thanks, 
EricLee
Comment 5 Dafna Ron 2013-01-15 06:38:58 EST
I can see a difference in qemu: 
qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64

my kernel is 2.6.32-353.el6.x86_64

I am using iscsi storage (perhaps its a file/block issue).
Comment 6 EricLee 2013-01-15 07:28:17 EST
(In reply to comment #5)
> I can see a difference in qemu: 
> qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64
> 
> my kernel is 2.6.32-353.el6.x86_64
> 
> I am using iscsi storage (perhaps its a file/block issue).

Yes, I think that's the root reason.

I can reproduce it with iscsi storage in rhevm.

when resume the suspend gust will fail:
# cat /var/log/vdsm/vdsm.log
.....
Thread-3212::ERROR::2013-01-15 07:23:22,089::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`963a1200-881b-4b39-89ae-1c05fcfae2db`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1427, in _run
    self._connection.restore(fname)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3155, in restore
    if ret == -1: raise libvirtError ('virDomainRestore() failed', conn=self)
libvirtError: internal error Child process (lzop -dc) unexpected exit status 2
....

Will get two processes after restart the same guest:
# ps aux | grep qemu
qemu      2619  1.2  6.4 1068164 521320 ?      Sl   07:23   0:00 /usr/libexec/qemu-kvm -name bug -S -M rhel6.3.0 -cpu Penryn -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 963a1200-881b-4b39-89ae-1c05fcfae2db -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=0FBD2800-7BD9-11E1-0000-E839354BFEEA_e8:39:35:4b:fe:ea,uuid=963a1200-881b-4b39-89ae-1c05fcfae2db -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/bug.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-15T00:23:16,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/99cd5a84-fd8e-40aa-a0f8-ddc78ff95c83/d6414725-3778-46c1-9215-fdd8d52bfdde/images/c7e4f16d-9004-4a39-9a38-9b87535675a0/6b6331ef-1b4e-445f-8b5b-f129b96100ae,if=none,id=drive-virtio-disk0,format=raw,serial=c7e4f16d-9004-4a39-9a38-9b87535675a0,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:a8:7a:09,bus=pci.0,addr=0x3,bootindex=1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/bug.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/bug.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming fd:29 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
qemu      2881 27.8  0.3 1068280 29008 ?       Sl   07:23   0:01 /usr/libexec/qemu-kvm -name bug -S -M rhel6.3.0 -cpu Penryn -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid 963a1200-881b-4b39-89ae-1c05fcfae2db -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.3.el6,serial=0FBD2800-7BD9-11E1-0000-E839354BFEEA_e8:39:35:4b:fe:ea,uuid=963a1200-881b-4b39-89ae-1c05fcfae2db -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/bug.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-01-15T00:23:57,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/99cd5a84-fd8e-40aa-a0f8-ddc78ff95c83/d6414725-3778-46c1-9215-fdd8d52bfdde/images/c7e4f16d-9004-4a39-9a38-9b87535675a0/6b6331ef-1b4e-445f-8b5b-f129b96100ae,if=none,id=drive-virtio-disk0,format=raw,serial=c7e4f16d-9004-4a39-9a38-9b87535675a0,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:a8:7a:09,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/bug.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/bug.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5902,tls-port=5903,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
root      2918  0.0  0.0 103248   852 pts/0    S+   07:24   0:00 grep qemu
qemu     28531  0.0  0.0      0     0 ?        Z<   05:23   0:00 [python] <defunct>
Comment 7 Michal Privoznik 2013-01-17 16:28:30 EST
Dafna, I've created a scratch build for you:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5284793

I think it should fix your problem - can you please confirm that? And if it doesn't, please attach new full debug logs (there are some patches within the build that extend debugging in crucial parts). Thanks!
Comment 8 Dafna Ron 2013-01-20 04:45:54 EST
the vm is still killed when trying to restore. 
but the qemu pid is killed as well: 

[root@gold-vdsc ~]# ps -elf |grep qemu
5 Z qemu     12853 10320  0  75  -5 -     0 exit   11:23 ?        00:00:00 [python] <defunct>
0 S root     23200  6074  0  80   0 - 25812 pipe_w 11:42 pts/0    00:00:00 grep qemu


will attach full logs
Comment 9 Dafna Ron 2013-01-20 04:46:46 EST
Created attachment 683504 [details]
logs
Comment 10 Michal Privoznik 2013-01-21 05:51:38 EST
Cool, so I think I know where the problem is now:

When using compressed image, here are the steps libvirt do when resuming:

1) Read compressed image header (where domain XML is stored among other run-time info)
2) Start decompression binary, which is fed on stdin with compressed image. The stdout, where decompressed data are expected to be written, is connected to qemu process
3) Start qemu process
4) If any above step fail, and domain is transient, remove it from internal list of domains.

The problem here was, if the decompression binary failed after we started qemu process, we removed appropriate domain from the internal list, so 'virsh list' didn't report it anymore, however - did not kill qemu process. The scratch build I've created is introducing logging of decompression binary's stderr (among other features/fixes). And we can see result yet:

2013-01-20 09:33:16.257+0000: 10048: debug : qemuDomainSaveImageStartVM:5018 : Decompression binary stderr: lzop: <stdin>: warning: ignoring trailing garbage in lzop file

Which is weird, because the file decompression binary is reading from was created by the very same binary. Or maybe - is something appending data to saved image (maybe vdsm is storing some data there)?

The proper solution may be - ignore errors from decompression binary. Qemu process will fail with "load of migration failed" if there are some unrecoverable problems anyway.
Comment 17 Michal Privoznik 2013-01-23 04:49:20 EST
Patches proposed upstream:

https://www.redhat.com/archives/libvir-list/2013-January/msg01639.html
Comment 18 Michal Privoznik 2013-01-29 04:13:02 EST
Patch fixing this bug has been pushed upstream:

commit 93e5a1432d1304fafde4b2186cef63692f171c57
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Mon Jan 28 15:13:27 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Jan 29 09:51:47 2013 +0100

    qemu: Destroy domain on decompression binary error
    
    https://bugzilla.redhat.com/show_bug.cgi?id=894723
    
    Currently, if qemuProcessStart() succeeds, but it's decompression
    binary that returns nonzero status, we don't kill the qemu process,
    but remove it from internal domain list, leaving the qemu process
    hanging around totally uncontrolled.


Although, I think the rest of patches in the set is worth backporting as well, as they catch stderr of decompression binary so if we kill it, we know at least what was the error. However, upstream is currently frozen as we are very close to upstream release, which is scheduled on Wed Jan 23 (tomorrow). So I am not moving to POST yet.

And one more thing - Ayal/Dafna the path you are passing to libvirt to save a domain onto - is it really a path on a filesystem or could it be a bare iSCSI partition? The difference would be - on a partition, decompression binary can't know where is the end of compressed data and where do uninitialized blocks start. That would explain appending some rubbish at the end of saved image.
Comment 20 Michal Privoznik 2013-02-05 10:47:26 EST
Okay, patches has been pushed upstream now:


commit 137229bf4aec7647f3f04033ad390bcc872bd7e1
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Thu Jan 17 11:59:23 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    qemu: Catch stderr of image compression binary
    
    If a compression binary prints something to stderr, currently
    it is discarded. However, it can contain useful data from
    debugging POV, so we should catch it.

commit cc6c425f94f4285261d4d12534f1944372f533a6
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Thu Jan 17 11:42:00 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    qemu: Catch stderr of image decompression binary
    
    If a decompression binary prints something to stderr, currently
    it is discarded. However, it can contain useful data from
    debugging POV, so we should catch it.

commit 1f25194ad1e044d2fe192e871081dd570102a62a
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Thu Jan 17 11:09:39 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    virFileWrapperFd: Switch to new virCommandDoAsyncIO
    
    Commit 34e8f63a32f83 introduced support for catching errors from
    libvirt iohelper. However, at those times there wasn't such fancy
    API as virCommandDoAsyncIO(), so everything has to be implemented
    on our own. But since we do have the API now, we can use it and
    drop our implementation then.

commit f0154959b3f8c3213c611883d04da1a5bac81df9
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Wed Jan 16 18:55:06 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    tests: Create test for virCommandDoAsyncIO
    
    This is just a basic test, so we don't break virCommand in the
    future. A "Hello world\n" string is written to commanhelper,
    which copies input to stdout and stderr where we read it from.
    Then the read strings are compared with expected values.

commit 39c77fe586baccd0a4a9862e8cf7c78ac7af3494
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Wed Jan 16 11:58:00 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    Introduce event loop to commandtest
    
    This is just preparing environment for the next patch, which is
    going to need an event loop.

commit 68fb755002da73db4dad1f2ec41bfa317855c206
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Wed Jan 16 11:33:17 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Feb 5 15:45:21 2013 +0100

    virCommand: Introduce virCommandDoAsyncIO
    
    Currently, if we want to feed stdin, or catch stdout or stderr of a
    virCommand we have to use virCommandRun(). When using virCommandRunAsync()
    we have to register FD handles by hand. This may lead to code duplication.
    Hence, introduce an internal API, which does this automatically within
    virCommandRunAsync(). The intended usage looks like this:
    
        virCommandPtr cmd = virCommandNew*(...);
        char *buf = NULL;
    
        ...
    
        virCommandSetOutputBuffer(cmd, &buf);
        virCommandDoAsyncIO(cmd);
    
        if (virCommandRunAsync(cmd, NULL) < 0)
            goto cleanup;
    
        ...
    
        if (virCommandWait(cmd, NULL) < 0)
            goto cleanup;
    
        /* @buf now contains @cmd's stdout */
        VIR_DEBUG("STDOUT: %s", NULLSTR(buf));
    
        ...
    
    cleanup:
        VIR_FREE(buf);
        virCommandFree(cmd);
    
    Note, that both stdout and stderr buffers may change until virCommandWait()
    returns.

v1.0.2-49-g137229b
Comment 21 Ayal Baron 2013-02-07 03:18:51 EST
(In reply to comment #18)
> Patch fixing this bug has been pushed upstream:
> 
> commit 93e5a1432d1304fafde4b2186cef63692f171c57
> Author:     Michal Privoznik <mprivozn@redhat.com>
> AuthorDate: Mon Jan 28 15:13:27 2013 +0100
> Commit:     Michal Privoznik <mprivozn@redhat.com>
> CommitDate: Tue Jan 29 09:51:47 2013 +0100
> 
>     qemu: Destroy domain on decompression binary error
>     
>     https://bugzilla.redhat.com/show_bug.cgi?id=894723
>     
>     Currently, if qemuProcessStart() succeeds, but it's decompression
>     binary that returns nonzero status, we don't kill the qemu process,
>     but remove it from internal domain list, leaving the qemu process
>     hanging around totally uncontrolled.
> 
> 
> Although, I think the rest of patches in the set is worth backporting as
> well, as they catch stderr of decompression binary so if we kill it, we know
> at least what was the error. However, upstream is currently frozen as we are
> very close to upstream release, which is scheduled on Wed Jan 23 (tomorrow).
> So I am not moving to POST yet.
> 
> And one more thing - Ayal/Dafna the path you are passing to libvirt to save
> a domain onto - is it really a path on a filesystem or could it be a bare
> iSCSI partition? The difference would be - on a partition, decompression
> binary can't know where is the end of compressed data and where do
> uninitialized blocks start. That would explain appending some rubbish at the
> end of saved image.

It can be either a file or an LV.
Comment 22 Michal Privoznik 2013-02-07 04:14:43 EST
Ayal et Dafna,

I don't think you should use a LV or a partition directly. Fortunately, you are using a compression which is wise enough to see a garbage at the end of compressed data. Problem is, when you are using a LV directly - even for storing linear data - you cannot simply tell where does data end and where do uninitialized block start. That's why you should create a filesystem on the top of LV. If you were not using compression, qemu would be very impressed about rubbish migration data when restoring a domain.
Comment 23 Ayal Baron 2013-02-08 02:12:07 EST
(In reply to comment #22)
> Ayal et Dafna,
> 
> I don't think you should use a LV or a partition directly. Fortunately, you
> are using a compression which is wise enough to see a garbage at the end of
> compressed data. Problem is, when you are using a LV directly - even for
> storing linear data - you cannot simply tell where does data end and where
> do uninitialized block start. That's why you should create a filesystem on
> the top of LV. If you were not using compression, qemu would be very
> impressed about rubbish migration data when restoring a domain.

It's not fortunate, it's deliberate for exactly this reason.  Managing a single file file system is a big overhead (error flows are ridiculous for this use case).

Compression solves this problem very neatly.
Only thing is, that we need to make sure virt-qe has this use case in their testing.
Preferably have some unit tests as well.
Comment 24 Michal Privoznik 2013-02-08 02:50:47 EST
I don't think so. The decompression binary exits with non-zero status when something goes bad during decompression process, e.g a garbage appended to the end of compressed data is found. From the binary POV it's mangled data. And when libvirt detects the binary hasn't exited cleanly, it just kills the domain which is being resumed. We cannot let a domain continue with corrupted memory, right? And corrupted memory it is - there is not special exit code (well, for the lzop binary you are using) to distinguish:

a) garbage at EOF = hopefully safe to continue, if the binary just hadn't supplied the garbage to the qemu

b) all the other error states, e.g. corrupted image = unsafe to continue.

And I don't think libvirt should parse the stderr of the decompression binary just to tell which case we are dealing with.

IOW, Dafna, you will still be unable to resume a domain if decompression binary doesn't exit cleanly, but libvirt will not leave any leaked qemu process behind.
Comment 25 Ayal Baron 2013-02-08 05:06:57 EST
Dan, I may be missing something here.  Isn't this the way it has been working for years?
Comment 26 Dan Kenigsberg 2013-02-08 10:10:07 EST
(In reply to comment #25)
> Dan, I may be missing something here.  Isn't this the way it has been
> working for years?

I have not read through this bug, but I suspect that it was tickled by

  http://gerrit.ovirt.org/5928
  Move option 'save_image_format' to qemu.conf

as before it, we have not been using compression at all.
Comment 27 Ayal Baron 2013-02-09 04:50:49 EST
(In reply to comment #26)
> (In reply to comment #25)
> > Dan, I may be missing something here.  Isn't this the way it has been
> > working for years?
> 
> I have not read through this bug, but I suspect that it was tickled by
> 
>   http://gerrit.ovirt.org/5928
>   Move option 'save_image_format' to qemu.conf
> 
> as before it, we have not been using compression at all.

Ok, then qemu just knows where to stop reading according to the amount of memory? or have we been having silent memory corruptions?
According to Michal, we should not be using block devices, only files when suspending to disk (i.e. create a FS on the LV for a single file we're going to save the state in).
I totally disagree with this requirement as compression works with block devices in other places and does not require a file system, that is the point of adding a size variable.  It should just ignore the rest of the device.  If it cannot ignore it, then it should just zero out the rest of the LV.
Comment 28 Michal Privoznik 2013-02-13 05:19:38 EST
Turns out there's a race, so we need another patch to prevent the race:

commit 3178df9afa45cf9d0694536f7fcefd0384def488
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Fri Feb 8 15:17:44 2013 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Wed Feb 13 09:54:19 2013 +0100

    virCommand: Don't misuse the eventloop for async IO
    
    Currently, if a command wants to do asynchronous IO, a callback
    is registered in the libvirtd eventloop to handle writes and
    reads. However, there's a race in virCommandWait. The eventloop
    may already be executing the callback, while virCommandWait is
    mangling internal state of virCommand. To deal with it, we need
    to either introduce locking or spawn a separate thread where we
    poll() on stdio from child. The former, however, requires to
    unlock all mutexes held, as the event loop may execute other
    callbacks which tries to lock one of the mutexes, deadlock and
    thus never wake us up. So it's safer to spawn a separate thread.

v1.0.2-152-g3178df9

The patch is pushed upstream, so no changes to the bug are needed really. I am adding this comment just for transcription sakes.
Comment 29 Dan Kenigsberg 2013-02-17 16:42:34 EST
(In reply to comment #27)
> 
> Ok, then qemu just knows where to stop reading according to the amount of
> memory? or have we been having silent memory corruptions?

I doubt that we have such corruptions (we hibernate into block devices since rhev-2.2, someone should have noticed already). Be if we cannot restore a hibernated VM from a block volume, we must issue a quick fix.
Comment 30 Michal Privoznik 2013-02-19 12:21:11 EST
Okay, so from my discussion with Ayal, we decided to solve this issue in libvirt. So I've just posted patch upstream:

https://www.redhat.com/archives/libvir-list/2013-February/msg01062.html
Comment 31 Jiri Denemark 2013-02-21 04:54:27 EST
Fixed upstream by v1.0.2-209-g0eeedf5:

commit 0eeedf52e7cb6a1794796856ac077d17a7d7def4
Author: Michal Privoznik <mprivozn@redhat.com>
Date:   Tue Feb 19 18:07:58 2013 +0100

    qemu: Run lzop with '--ignore-warn'
    
    Currently, if lzop decompression binary produces a warning, it
    doesn't exit with zero status but 2 instead. Terrifying, but
    true. However, warnings may be ignored using '--ignore-warn'
    command line argument.  Moreover, in which case, the exit status
    will be zero.
Comment 33 Michal Skrivanek 2013-03-11 06:53:34 EDT
*** Bug 910706 has been marked as a duplicate of this bug. ***
Comment 35 EricLee 2013-07-11 05:14:08 EDT
Verified pass with package: libvirt-0.10.2-19.el6

1. Prepare a iscsi storage domain in RHEVM.
2. Run a vm on the iscsi storage domain.
3. Suspend the vm, # ps aux | grep qemu there is only one qemu process, after suspend completed, can not catch qemu process, and the state of the vm on RHEVM is Suspend.
4. reRun the vm, and # ps aux | grep qemu there is only one qemu process too.
5. Shut Off the vm.
6. reRun the vm.

All above steps passed.

So setting VERIFIED.
Comment 37 errata-xmlrpc 2013-11-21 03:37:29 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html

Note You need to log in before you can comment on or make changes to this bug.