Bug 1262251 - live merge: not displayed in OVirt task tab
Summary: live merge: not displayed in OVirt task tab
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 3.5.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-3.6.2
: 3.6.2
Assignee: Greg Padgett
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-11 09:35 UTC by Markus Stockhausen
Modified: 2016-03-10 15:22 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-18 11:03:32 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
cpu (53.93 KB, image/png)
2015-09-11 09:36 UTC, Markus Stockhausen
no flags Details
disk read (31.95 KB, image/png)
2015-09-11 09:37 UTC, Markus Stockhausen
no flags Details
disk write (30.68 KB, image/png)
2015-09-11 09:37 UTC, Markus Stockhausen
no flags Details
vdsm log (104.68 KB, application/zip)
2015-09-11 09:59 UTC, Markus Stockhausen
no flags Details
engine screenshot (190.39 KB, image/png)
2015-09-11 10:44 UTC, Markus Stockhausen
no flags Details
test stall (149.02 KB, image/png)
2015-09-12 20:21 UTC, Markus Stockhausen
no flags Details
test merging io (25.09 KB, image/png)
2015-09-12 20:22 UTC, Markus Stockhausen
no flags Details
test engine log (64.41 KB, image/png)
2015-09-12 20:23 UTC, Markus Stockhausen
no flags Details
verification screenshot (152.58 KB, image/png)
2016-01-05 16:42 UTC, Aharon Canan
no flags Details

Description Markus Stockhausen 2015-09-11 09:35:50 UTC
Description of problem:

VM freezes during first minutes of live merge.


Version-Release number of selected component (if applicable):

Hypervisor Node FC20
Engine Centos 6.6
Ovirt 3.5.4 
VDSM 4.16.20
qemu 2.1.3-8 (FC21)

How reproducible:

happend once. Do not want to retry.

Steps to Reproduce:

Remove snapshot (live merge) of running VM

Actual results:

Snapshot removing starts. Machine freezes for 3 minutes

Expected results:

machine should run on.

Added some screenshots that give an idea what CPU & Disk IO of qemu was during the operation

Engine Log:

2015-09-11 09:51:25,942 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp--127.0.0.1-8702-8) Correlation ID: 574c98c0, Job ID: bb11cac3-8eab-49dc-bf76-743b279cdd52, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'vor_sps12' deletion for VM 'colvm02' was initiated by admin@internal.
2015-09-11 09:51:28,725 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-51) [fa7d485] FINISH, FullListVdsCommand, return: [{acpiEnable=true, emulatedMachine=pc-1.0, afterMigrationStatus=, tabletEnable=true, pid=6006, memGuaranteedSize=10240, transparentHugePages=true, displaySecurePort=-1, spiceSslCipherSuite=DEFAULT, cpuType=Nehalem, smp=2, smartcardEnable=false, custom={device_5fa98756-fb7e-4149-8f33-fe8c71070dd9=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=5fa98756-fb7e-4149-8f33-fe8c71070dd9, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=2b1f4bfe-8306-4331-9c43-894ef9021a71, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=1, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71device_0ccea91b-7755-4503-9f8f-03d59bcd877f=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=0ccea91b-7755-4503-9f8f-03d59bcd877f, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=2, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=6030cf01-6e6a-4c2f-87e1-f05ce7ccc646, device=virtio-serial, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x04, function=0x0}, managed=false, plugged=true, readOnly=false, deviceAlias=virtio-serial0, customProperties={}, snapshotId=null}}, vmType=kvm, memSize=14336, smpCoresPerSocket=1, vmName=colvm02, nice=0, status=Up, vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, displayIp=192.168.11.46, displayPort=5910, guestDiskMapping={}, spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard, nicModel=rtl8139,pv, keyboardLayout=de, kvmEnable=true, pitReinjection=false, displayNetwork=ovirtmgmt, devices=[Ljava.lang.Object;@5864b1dd, timeOffset=0, maxVCpus=160, clientIp=, display=vnc}], log id: 12e29022
2015-09-11 09:51:59,948 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-27) [6e20583d] FINISH, FullListVdsCommand, return: [{acpiEnable=true, emulatedMachine=pc-1.0, afterMigrationStatus=, tabletEnable=true, pid=6006, memGuaranteedSize=10240, transparentHugePages=true, displaySecurePort=-1, spiceSslCipherSuite=DEFAULT, cpuType=Nehalem, smp=2, smartcardEnable=false, custom={device_5fa98756-fb7e-4149-8f33-fe8c71070dd9=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=5fa98756-fb7e-4149-8f33-fe8c71070dd9, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=2b1f4bfe-8306-4331-9c43-894ef9021a71, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=1, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646device_2b1f4bfe-8306-4331-9c43-894ef9021a71device_0ccea91b-7755-4503-9f8f-03d59bcd877f=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=0ccea91b-7755-4503-9f8f-03d59bcd877f, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=2, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_5fa98756-fb7e-4149-8f33-fe8c71070dd9device_6030cf01-6e6a-4c2f-87e1-f05ce7ccc646=VmDevice {vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, deviceId=6030cf01-6e6a-4c2f-87e1-f05ce7ccc646, device=virtio-serial, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x04, function=0x0}, managed=false, plugged=true, readOnly=false, deviceAlias=virtio-serial0, customProperties={}, snapshotId=null}}, vmType=kvm, memSize=14336, smpCoresPerSocket=1, vmName=colvm02, nice=0, status=Up, vmId=223ca7c2-c510-48fa-b29b-d09d9a64f3a8, displayIp=192.168.11.46, displayPort=5910, guestDiskMapping={}, spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard, nicModel=rtl8139,pv, keyboardLayout=de, kvmEnable=true, pitReinjection=false, displayNetwork=ovirtmgmt, devices=[Ljava.lang.Object;@19cecac, timeOffset=0, maxVCpus=160, clientIp=, display=vnc}], log id: 29c9d47b
2015-09-11 09:52:46,136 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-29) VM colvm02 223ca7c2-c510-48fa-b29b-d09d9a64f3a8 moved from Up --> NotResponding

Comment 1 Markus Stockhausen 2015-09-11 09:36:44 UTC
Created attachment 1072503 [details]
cpu

Comment 2 Markus Stockhausen 2015-09-11 09:37:14 UTC
Created attachment 1072504 [details]
disk read

Comment 3 Markus Stockhausen 2015-09-11 09:37:36 UTC
Created attachment 1072505 [details]
disk write

Comment 4 Markus Stockhausen 2015-09-11 09:59:03 UTC
Created attachment 1072508 [details]
vdsm log

Comment 5 Markus Stockhausen 2015-09-11 10:04:09 UTC
vm hast ~ 300GB of disk data and 14 GB RAM

Comment 6 Markus Stockhausen 2015-09-11 10:44:55 UTC
Created attachment 1072519 [details]
engine screenshot

Comment 7 Markus Stockhausen 2015-09-12 20:12:38 UTC
reproduced the problem wit a test vm.

1) install Linux (e.g. Ubuntu 14.04) on a single 200GB virtio disk
2) stop VM
3) create snapshot
4) start VM
5) fill the disk with random data (/dev/urandom)
6) wait until snapshot qcow2 file gets ~150GB
7) create a simple cpu load/disk load script 

#!/bin/bash
while true; do
  echo `date`
  echo `date` >> date.txt
  dd if=/dev/zero of=/dev/null bs=1m count=200
  sleep 0.3
done

8) run the above script
9) remove the snapshot in ovirt
10) wait 30 seconds and see how the VM stalls

what is even more interesting - and an error in my opinion: Ovirt 3.5.4 reports the success of the snapshot deletion although qemu is still merging the backing chain.

Comment 8 Markus Stockhausen 2015-09-12 20:21:52 UTC
Created attachment 1072741 [details]
test stall

Comment 9 Markus Stockhausen 2015-09-12 20:22:31 UTC
Created attachment 1072742 [details]
test merging io

Comment 10 Markus Stockhausen 2015-09-12 20:23:07 UTC
Created attachment 1072743 [details]
test engine log

Comment 11 Allon Mureinik 2015-09-13 12:38:18 UTC
Tentatively targeting, until we understand what's going on here.

Adam, can you take a look please?

Comment 12 Markus Stockhausen 2015-09-13 15:13:48 UTC
Had the chance to catch some gdb backtraces during the hang situation.

backtrace 1:
#0  0x00007f103027fe7d in lseek64 () from /lib64/libpthread.so.0
#1  0x00007f1031a0e3de in try_seek_hole (bs=0x7f1033fe0dc0, pnum=0x7f103a8f6274, hole=<synthetic pointer>, data=<synthetic pointer>,
    start=35239559168) at block/raw-posix.c:1458
#2  raw_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=<optimized out>, nb_sectors=128, pnum=0x7f103a8f6274) at block/raw-posix.c:1512
#3  0x00007f10319d5070 in bdrv_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=sector_num@entry=68827264, nb_sectors=128,
    pnum=pnum@entry=0x7f103a8f6274) at block.c:3968
#4  0x00007f10319d50c1 in bdrv_co_get_block_status (bs=0x7f1033fde240, sector_num=74874112, nb_sectors=<optimized out>, pnum=0x7f103a8f6274)
    at block.c:3999
#5  0x00007f10319d5217 in bdrv_get_block_status_co_entry (opaque=opaque@entry=0x7f103a8f6200) at block.c:4018
#6  0x00007f10319d52d8 in bdrv_get_block_status (bs=bs@entry=0x7f1033fde240, sector_num=sector_num@entry=74874112, nb_sectors=nb_sectors@entry=128,
    pnum=pnum@entry=0x7f103a8f6274) at block.c:4042
#7  0x00007f10319d5ae4 in bdrv_is_allocated (pnum=0x7f103a8f6274, nb_sectors=128, sector_num=74874112, bs=0x7f1033fde240) at block.c:4058
#8  bdrv_is_allocated_above (top=top@entry=0x7f1033fde240, base=base@entry=0x7f1034145da0, sector_num=sector_num@entry=74874112,
    nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6340) at block.c:4088
#9  0x00007f10318b9abd in mirror_run (opaque=0x7f1038042a40) at block/mirror.c:379
#10 0x00007f10319e5a0a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118
#11 0x00007f102693a5a0 in ?? () from /lib64/libc.so.6
#12 0x00007fff6cf9b540 in ?? ()
#13 0x0000000000000000 in ?? ()

backtrace 2 (a few seconds later):

#0  0x00007f103027fe7d in lseek64 () from /lib64/libpthread.so.0
#1  0x00007f1031a0e3de in try_seek_hole (bs=0x7f1033fe0dc0, pnum=0x7f103a8f6274, hole=<synthetic pointer>, data=<synthetic pointer>,
    start=117926068224) at block/raw-posix.c:1458
#2  raw_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=<optimized out>, nb_sectors=128, pnum=0x7f103a8f6274) at block/raw-posix.c:1512
#3  0x00007f10319d5070 in bdrv_co_get_block_status (bs=0x7f1033fe0dc0, sector_num=sector_num@entry=230324352, nb_sectors=128,
    pnum=pnum@entry=0x7f103a8f6274) at block.c:3968
#4  0x00007f10319d50c1 in bdrv_co_get_block_status (bs=0x7f1033fde240, sector_num=248575360, nb_sectors=<optimized out>, pnum=0x7f103a8f6274)
    at block.c:3999
#5  0x00007f10319d5217 in bdrv_get_block_status_co_entry (opaque=opaque@entry=0x7f103a8f6200) at block.c:4018
#6  0x00007f10319d52d8 in bdrv_get_block_status (bs=bs@entry=0x7f1033fde240, sector_num=sector_num@entry=248575360, nb_sectors=nb_sectors@entry=128,
    pnum=pnum@entry=0x7f103a8f6274) at block.c:4042
#7  0x00007f10319d5ae4 in bdrv_is_allocated (pnum=0x7f103a8f6274, nb_sectors=128, sector_num=248575360, bs=0x7f1033fde240) at block.c:4058
#8  bdrv_is_allocated_above (top=top@entry=0x7f1033fde240, base=base@entry=0x7f1034145da0, sector_num=sector_num@entry=248575360,
    nb_sectors=nb_sectors@entry=128, pnum=pnum@entry=0x7f103a8f6340) at block.c:4088
#9  0x00007f10318b9abd in mirror_run (opaque=0x7f1038042a40) at block/mirror.c:379
#10 0x00007f10319e5a0a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118
#11 0x00007f102693a5a0 in ?? () from /lib64/libc.so.6
#12 0x00007fff6cf9b540 in ?? ()
#13 0x0000000000000000 in ?? ()

Comment 14 Markus Stockhausen 2015-09-13 16:40:09 UTC
I backported the above commit to qemu 2.1.3. Live merge no longer stalls a machine with large disks during bitmap initialization.

Maybe worth a backport for Fedora & RHEV?

Comment 15 Markus Stockhausen 2015-09-13 17:02:55 UTC
Regarding the second issue:

OVirt does not show the live merge in the task tab. It ONLY reports the end in the log. Is this really wanted?

Comment 16 Markus Stockhausen 2015-09-14 15:04:59 UTC
Created BZ1262901 to track backport for Fedora 21/22.

Problem left: OVirt does not show live merge in task tab.

Comment 17 Adam Litke 2015-10-12 19:48:21 UTC
Markus, thanks for your help in solving this bug!

Please don't re-purpose a bug in this way.  If the core problem was solved by this bug (which it appears to have been) then we should allow this to be closed as fixed.  

The issue about reporting live merge as a task is more of a feature request with a different priority than an issue with hanging VMs.  Please report that as an RFE if you think it's still relevant.

Comment 18 Red Hat Bugzilla Rules Engine 2015-10-19 10:56:42 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 19 Tal Nisan 2015-10-26 12:40:31 UTC
Already solved in Fedora, waiting for RHEL 7.2 release that will solve the issue on RHEL as well

Comment 20 Allon Mureinik 2015-11-16 13:26:44 UTC
Consuming the qemu fix is being tracked by bug 1264673 (fedora) and bug 1271173 (RHEL).

This BZ is for tracking the display of the task.
Greg - can you take a look please? Haven't you already tackled something like this?

Comment 21 Greg Padgett 2015-11-17 21:54:23 UTC
(In reply to Allon Mureinik from comment #20)
> This BZ is for tracking the display of the task.
> Greg - can you take a look please? Haven't you already tackled something
> like this?

I'm currently looking into making the task display more detailed (an item for each merged disk) in bug 1275655.

As for it not displaying, this appears to be fixed (at least in master): the Tasks tab is reporting the task properly and the Events tab shows the initiation and completion of the snapshot removal operation.

I'll verify on 3.6 when I get a chance and will close this bug if it's working properly.

Comment 25 Aharon Canan 2016-01-05 16:41:49 UTC
Verified 

vdsm-4.17.15-0.el7ev.noarch
rhevm-3.6.2-0.1.el6.noarch

Comment 26 Aharon Canan 2016-01-05 16:42:19 UTC
Created attachment 1111911 [details]
verification screenshot


Note You need to log in before you can comment on or make changes to this bug.