Description of problem: VM freezes during first minutes of live merge. Version-Release number of selected component (if applicable): Hypervisor Node FC20 Engine Centos 6.6 Ovirt 3.5.4 VDSM 4.16.20 qemu 2.1.3-8 (FC21) How reproducible: happend once. Do not want to retry. Steps to Reproduce: Remove snapshot (live merge) of running VM Actual results: Snapshot removing starts. Machine freezes for 3 minutes Expected results: machine should run on. Added some screenshots that give an idea what CPU & Disk IO of qemu was during the operation Engine Log: 2015-09-11 09:51:25,942 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp--127.0.0.1-8702-8) Correlation ID: 574c98c0, Job ID: bb11cac3-8eab-49dc-bf76-743b279cdd52, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'vor_sps12' deletion for VM 'colvm02' was initiated by admin@internal. 2015-09-11 09:51:28,725 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-51) [fa7d485] FINISH, FullListVdsCommand, return: [{acpiEnable=true, emulatedMachine=pc-1.0, afterMigrationStatus=, tabletEnable=true, pid=6006, memGuaranteedSize=10240, transparentHugePages=true, displaySecurePort=-1, spiceSslCipherSuite=DEFAULT, cpuType=Nehalem, smp=2, smartcardEnable=false, custom={device_5fa98756-fb7e-4149-8f33-fe8c71070dd9=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=5fa98756-fb7e-4149-8f33-fe8c71070dd9, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=2b1f4bfe-8306-4331-9c43-894ef9021a71, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=1, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71device_0ccea91b-7755-4503-9f8f-03d59bcd877f=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=0ccea91b-7755-4503-9f8f-03d59bcd877f, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=2, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=6030cf01-6e6a-4c2f-87e1-f05ce7ccc646, device=virtio-serial, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x04, function=0x0}, managed=false, plugged=true, readOnly=false, deviceAlias=virtio-serial0, customProperties={}, snapshotId=null}}, vmType=kvm, memSize=14336, smpCoresPerSocket=1, vmName=colvm02, nice=0, status=Up, vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, displayIp=192.168.11.46, displayPort=5910, guestDiskMapping={}, spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard, nicModel=rtl8139,pv, keyboardLayout=de, kvmEnable=true, pitReinjection=false, displayNetwork=ovirtmgmt, devices=[Ljava.lang.Object;@5864b1dd, timeOffset=0, maxVCpus=160, clientIp=, display=vnc}], log id: 12e29022 2015-09-11 09:51:59,948 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-27) [6e20583d] FINISH, FullListVdsCommand, return: [{acpiEnable=true, emulatedMachine=pc-1.0, afterMigrationStatus=, tabletEnable=true, pid=6006, memGuaranteedSize=10240, transparentHugePages=true, displaySecurePort=-1, spiceSslCipherSuite=DEFAULT, cpuType=Nehalem, smp=2, smartcardEnable=false, custom={device_5fa98756-fb7e-4149-8f33-fe8c71070dd9=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=5fa98756-fb7e-4149-8f33-fe8c71070dd9, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=2b1f4bfe-8306-4331-9c43-894ef9021a71, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=1, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71device_0ccea91b-7755-4503-9f8f-03d59bcd877f=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=0ccea91b-7755-4503-9f8f-03d59bcd877f, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=2, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=6030cf01-6e6a-4c2f-87e1-f05ce7ccc646, device=virtio-serial, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x04, function=0x0}, managed=false, plugged=true, readOnly=false, deviceAlias=virtio-serial0, customProperties={}, snapshotId=null}}, vmType=kvm, memSize=14336, smpCoresPerSocket=1, vmName=colvm02, nice=0, status=Up, vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, displayIp=192.168.11.46, displayPort=5910, guestDiskMapping={}, spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard, nicModel=rtl8139,pv, keyboardLayout=de, kvmEnable=true, pitReinjection=false, displayNetwork=ovirtmgmt, devices=[Ljava.lang.Object;@19cecac, timeOffset=0, maxVCpus=160, clientIp=, display=vnc}], log id: 29c9d47b 2015-09-11 09:52:46,136 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-29) VM colvm02 223ca7c2-c510-48fa-b29b-d09d9a64f3a8 moved from Up --> NotResponding
Created attachment 1072503 [details] cpu
Created attachment 1072504 [details] disk read
Created attachment 1072505 [details] disk write
Created attachment 1072508 [details] vdsm log
vm hast ~ 300GB of disk data and 14 GB RAM
Created attachment 1072519 [details] engine screenshot
reproduced the problem wit a test vm. 1) install Linux (e.g. Ubuntu 14.04) on a single 200GB virtio disk 2) stop VM 3) create snapshot 4) start VM 5) fill the disk with random data (/dev/urandom) 6) wait until snapshot qcow2 file gets ~150GB 7) create a simple cpu load/disk load script #!/bin/bash while true; do echo `date` echo `date` >> date.txt dd if=/dev/zero of=/dev/null bs=1m count=200 sleep 0.3 done 8) run the above script 9) remove the snapshot in ovirt 10) wait 30 seconds and see how the VM stalls what is even more interesting - and an error in my opinion: Ovirt 3.5.4 reports the success of the snapshot deletion although qemu is still merging the backing chain.
Created attachment 1072741 [details] test stall
Created attachment 1072742 [details] test merging io
Created attachment 1072743 [details] test engine log
Tentatively targeting, until we understand what's going on here. Adam, can you take a look please?
Had the chance to catch some gdb backtraces during the hang situation. backtrace 1: #0 0x00007f103027fe7d in lseek64 () from /lib64/libpthread.so.0 #1 0x00007f1031a0e3de in try_seek_hole (bs=0x7f1033fe0dc0, pnum=0x7f103a8f6274, hole=<synthetic pointer>, data=<synthetic pointer>, start=35239559168) at block/raw-posix.c:1458 #2 raw_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=<optimized out>, nb_sectors=128, pnum=0x7f103a8f6274) at block/raw-posix.c:1512 #3 0x00007f10319d5070 in bdrv_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=sector_num@entry=68827264, nb_sectors=128, pnum=pnum@entry=0x7f103a8f6274) at block.c:3968 #4 0x00007f10319d50c1 in bdrv_co_get_block_status (bs=0x7f1033fde240, sector_num=74874112, nb_sectors=<optimized out>, pnum=0x7f103a8f6274) at block.c:3999 #5 0x00007f10319d5217 in bdrv_get_block_status_co_entry (opaque=opaque@entry=0x7f103a8f6200) at block.c:4018 #6 0x00007f10319d52d8 in bdrv_get_block_status (bs=bs@entry=0x7f1033fde240, sector_num=sector_num@entry=74874112, nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6274) at block.c:4042 #7 0x00007f10319d5ae4 in bdrv_is_allocated (pnum=0x7f103a8f6274, nb_sectors=128, sector_num=74874112, bs=0x7f1033fde240) at block.c:4058 #8 bdrv_is_allocated_above (top=top@entry=0x7f1033fde240, base=base@entry=0x7f1034145da0, sector_num=sector_num@entry=74874112, nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6340) at block.c:4088 #9 0x00007f10318b9abd in mirror_run (opaque=0x7f1038042a40) at block/mirror.c:379 #10 0x00007f10319e5a0a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118 #11 0x00007f102693a5a0 in ?? () from /lib64/libc.so.6 #12 0x00007fff6cf9b540 in ?? () #13 0x0000000000000000 in ?? () backtrace 2 (a few seconds later): #0 0x00007f103027fe7d in lseek64 () from /lib64/libpthread.so.0 #1 0x00007f1031a0e3de in try_seek_hole (bs=0x7f1033fe0dc0, pnum=0x7f103a8f6274, hole=<synthetic pointer>, data=<synthetic pointer>, start=117926068224) at block/raw-posix.c:1458 #2 raw_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=<optimized out>, nb_sectors=128, pnum=0x7f103a8f6274) at block/raw-posix.c:1512 #3 0x00007f10319d5070 in bdrv_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=sector_num@entry=230324352, nb_sectors=128, pnum=pnum@entry=0x7f103a8f6274) at block.c:3968 #4 0x00007f10319d50c1 in bdrv_co_get_block_status (bs=0x7f1033fde240, sector_num=248575360, nb_sectors=<optimized out>, pnum=0x7f103a8f6274) at block.c:3999 #5 0x00007f10319d5217 in bdrv_get_block_status_co_entry (opaque=opaque@entry=0x7f103a8f6200) at block.c:4018 #6 0x00007f10319d52d8 in bdrv_get_block_status (bs=bs@entry=0x7f1033fde240, sector_num=sector_num@entry=248575360, nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6274) at block.c:4042 #7 0x00007f10319d5ae4 in bdrv_is_allocated (pnum=0x7f103a8f6274, nb_sectors=128, sector_num=248575360, bs=0x7f1033fde240) at block.c:4058 #8 bdrv_is_allocated_above (top=top@entry=0x7f1033fde240, base=base@entry=0x7f1034145da0, sector_num=sector_num@entry=248575360, nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6340) at block.c:4088 #9 0x00007f10318b9abd in mirror_run (opaque=0x7f1038042a40) at block/mirror.c:379 #10 0x00007f10319e5a0a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118 #11 0x00007f102693a5a0 in ?? () from /lib64/libc.so.6 #12 0x00007fff6cf9b540 in ?? () #13 0x0000000000000000 in ?? ()
Maybe fixed by http://git.qemu.org/?p=qemu.git;a=commit;h=4c0cbd6fec7db182a6deb52d5a8a8e7b0c5cbe64
I backported the above commit to qemu 2.1.3. Live merge no longer stalls a machine with large disks during bitmap initialization. Maybe worth a backport for Fedora & RHEV?
Regarding the second issue: OVirt does not show the live merge in the task tab. It ONLY reports the end in the log. Is this really wanted?
Created BZ1262901 to track backport for Fedora 21/22. Problem left: OVirt does not show live merge in task tab.
Markus, thanks for your help in solving this bug! Please don't re-purpose a bug in this way. If the core problem was solved by this bug (which it appears to have been) then we should allow this to be closed as fixed. The issue about reporting live merge as a task is more of a feature request with a different priority than an issue with hanging VMs. Please report that as an RFE if you think it's still relevant.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Already solved in Fedora, waiting for RHEL 7.2 release that will solve the issue on RHEL as well
Consuming the qemu fix is being tracked by bug 1264673 (fedora) and bug 1271173 (RHEL). This BZ is for tracking the display of the task. Greg - can you take a look please? Haven't you already tackled something like this?
(In reply to Allon Mureinik from comment #20) > This BZ is for tracking the display of the task. > Greg - can you take a look please? Haven't you already tackled something > like this? I'm currently looking into making the task display more detailed (an item for each merged disk) in bug 1275655. As for it not displaying, this appears to be fixed (at least in master): the Tasks tab is reporting the task properly and the Events tab shows the initiation and completion of the snapshot removal operation. I'll verify on 3.6 when I get a chance and will close this bug if it's working properly.
Verified vdsm-4.17.15-0.el7ev.noarch rhevm-3.6.2-0.1.el6.noarch
Created attachment 1111911 [details] verification screenshot