Bug 1326900 - OVirt remove single disk snapshot. Task does not end [NEEDINFO]
Summary: OVirt remove single disk snapshot. Task does not end
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 3.6.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-3.6.7
: ---
Assignee: Ala Hino
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-13 17:32 UTC by Markus Stockhausen
Modified: 2016-05-29 12:06 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-29 12:06:35 UTC
oVirt Team: Storage
ylavi: needinfo? (mst)
tnisan: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
snap2 (28.44 KB, image/png)
2016-04-13 17:34 UTC, Markus Stockhausen
no flags Details
snap1 (13.67 KB, image/png)
2016-04-13 17:35 UTC, Markus Stockhausen
no flags Details
engine.log (415.03 KB, text/plain)
2016-04-13 18:11 UTC, Markus Stockhausen
no flags Details
vdsm.log (310.10 KB, application/zip)
2016-04-13 18:19 UTC, Markus Stockhausen
no flags Details
live merge disk write (18.84 KB, image/png)
2016-04-13 19:19 UTC, Markus Stockhausen
no flags Details

Description Markus Stockhausen 2016-04-13 17:32:01 UTC
Description of problem:

According to BZ1319400 qemu 2.3.0 has severe problems when removing snapshots (aka live merge). We have been advised to remove disk snapshots one by one to avoid machine stalls during the process. Via tab storage -> tab disk-snapshots.

This process has 2 bugs.

1) At the start of the live merge the message "snapshot deletion has been completed" is logged. This should read "snapshot deletion has been initiated".

2) The remove snapshot task is not finalized. It still exists on the tasks tab.


Version-Release number of selected component (if applicable):

Ovirt 3.6.4.1

How reproducible:

100%

Steps to Reproduce:
1. create & start vm
2. take snapshot
3. do some disk writes so that the snapshot disks get some data
4. goto storage -> disk snapshot tab
5. Select & remove disk snapshot.

Actual results:

wrong message is logged and task does not finish.

Expected results:

right message & task finishes

Additional info:

screenshots attached

Comment 1 Markus Stockhausen 2016-04-13 17:34:46 UTC
Created attachment 1146920 [details]
snap2

Comment 2 Markus Stockhausen 2016-04-13 17:35:06 UTC
Created attachment 1146921 [details]
snap1

Comment 3 Nir Soffer 2016-04-13 17:59:43 UTC
Markus, can you attach engine and vdsm logs?

Comment 4 Markus Stockhausen 2016-04-13 18:11:51 UTC
Created attachment 1146924 [details]
engine.log

Comment 5 Markus Stockhausen 2016-04-13 18:19:22 UTC
Created attachment 1146926 [details]
vdsm.log

Comment 6 Markus Stockhausen 2016-04-13 18:19:39 UTC
Et voila

Comment 7 Nir Soffer 2016-04-13 18:43:36 UTC
In vdsm log we see:

1. We start a live merge

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4
f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05-
a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f
-453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'}

2. Live merge starts without an error

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:02,118::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.merge' in bridge with True

3. Vdsm fails to connect to qemu

VM Channels Listener::DEBUG::2016-04-13 19:12:07,474::guestagent::225::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Attempting connection to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2016-04-13 19:12:07,475::guestagent::239::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Failed to connect to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm with 11

This is EAGAIN - this error is expected when using non-blocking io,
looks like bad log.

3. Huge amound of periodic workers created and discarded

vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::209::Executor::(_discard) Worker periodic/9891 discarded
vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::157::Executor::(__init__) Starting worker periodic/9900
periodic/9900::DEBUG::2016-04-13 19:12:17,096::executor::171::Executor::(_run) Worker started
periodic/9891::DEBUG::2016-04-13 19:12:19,780::executor::178::Executor::(_run) Worker was discarded

Typically there are 4! periodic threads - here we are starting the 9900
thread, and discarding the 9891 thread.

4. Error checking drive watermark

periodic/9894::WARNING::2016-04-13 19:12:20,858::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'542affb0-7a2f-4297-84f4-728a315a6abd']

Generally on the vdsm side, this look like issues with libvirt and qemu, expected
when we consider bug 1319400.

I guess that libvirt is not strating the block job, because qemu is not responding,
and this fools engine, thinking that the block job has completed while it did not
start yet.

Needs deeper inspection on the engine side. Ala, can you take a look?

Comment 8 Markus Stockhausen 2016-04-13 19:19:11 UTC
Regarding block starting or not. From our system metrics we can see that qemu indeed starts the live merge and runs disk IO. See picture attached.

Comment 9 Markus Stockhausen 2016-04-13 19:19:45 UTC
Created attachment 1146943 [details]
live merge disk write

Comment 10 Nir Soffer 2016-04-13 20:25:30 UTC
Second pass at vdsm log show the the merge succeeded:

1. Starting merge

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4
f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05-
a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f
-453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'}

2. Starting block job

jsonrpc.Executor/1::INFO::2016-04-13 19:12:00,631::vm::4874::virt.vm::(merge) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Starting merge with jobUUID='c8045708-c52f-453e-989a-60d79d10d108'

3. Merge finished, trying to pivot to the merged image

Thread-46628::INFO::2016-04-13 19:14:35,854::vm::5159::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Requesting pivot to complete active layer commit (job c8045708-c52f-453e-989a-60d79d10d108)

4. Pivot completed, using new merged image

Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5172::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Pivot completed (job c8045708-c52f-453e-989a-60d79d10d108)

5. Synchronizing volume chain

Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5195::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronizing volume chain after live merge (job c8045708-c52f-453e-989a-60d79d10d108)
...
Thread-46628::INFO::2016-04-13 19:14:36,001::vm::5201::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronization completed (job c8045708-c52f-453e-989a-60d79d10d108)

So this is an issue on the engine side.

Comment 11 Yaniv Lavi 2016-05-02 11:41:28 UTC
We made a lot of fixes in ovirt 3.6.7 to address all the live merge issues we had. Can you test and see if this resolves your issues?

Comment 12 Ala Hino 2016-05-04 08:21:24 UTC
Waiting for response from submitter to see if issue fixed in ovirt 3.6.7

Comment 13 Allon Mureinik 2016-05-05 12:17:40 UTC
(In reply to Ala Hino from comment #12)
> Waiting for response from submitter to see if issue fixed in ovirt 3.6.7
All the relevant fixes [that we know about] were included in 3.6.6

Comment 14 Allon Mureinik 2016-05-29 12:06:35 UTC
We can't reproduce this on the latest 3.6.6 - closing.
If you're able to reproduce it, please reopen and attach the steps to reproduce and updated logs.


Note You need to log in before you can comment on or make changes to this bug.