Bug 1326900

Summary:

OVirt remove single disk snapshot. Task does not end

Product:

[oVirt] ovirt-engine

Reporter:

Markus Stockhausen <mst>

Component:

Backend.Core

Assignee:

Ala Hino <ahino>

Status:

CLOSED WORKSFORME

QA Contact:

Aharon Canan <acanan>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.6.4.1

CC:

ahino, amureini, bugs, mst, nsoffer, tnisan, ylavi

Target Milestone:

ovirt-3.6.7

Flags:

tnisan: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-29 12:06:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
snap2	none
snap1	none
engine.log	none
vdsm.log	none
live merge disk write	none

Description Markus Stockhausen 2016-04-13 17:32:01 UTC

Description of problem:

According to BZ1319400 qemu 2.3.0 has severe problems when removing snapshots (aka live merge). We have been advised to remove disk snapshots one by one to avoid machine stalls during the process. Via tab storage -> tab disk-snapshots.

This process has 2 bugs.

1) At the start of the live merge the message "snapshot deletion has been completed" is logged. This should read "snapshot deletion has been initiated".

2) The remove snapshot task is not finalized. It still exists on the tasks tab.


Version-Release number of selected component (if applicable):

Ovirt 3.6.4.1

How reproducible:

100%

Steps to Reproduce:
1. create & start vm
2. take snapshot
3. do some disk writes so that the snapshot disks get some data
4. goto storage -> disk snapshot tab
5. Select & remove disk snapshot.

Actual results:

wrong message is logged and task does not finish.

Expected results:

right message & task finishes

Additional info:

screenshots attached

Comment 1 Markus Stockhausen 2016-04-13 17:34:46 UTC

Created attachment 1146920 [details]
snap2

Comment 2 Markus Stockhausen 2016-04-13 17:35:06 UTC

Created attachment 1146921 [details]
snap1

Comment 3 Nir Soffer 2016-04-13 17:59:43 UTC

Markus, can you attach engine and vdsm logs?

Comment 4 Markus Stockhausen 2016-04-13 18:11:51 UTC

Created attachment 1146924 [details]
engine.log

Comment 5 Markus Stockhausen 2016-04-13 18:19:22 UTC

Created attachment 1146926 [details]
vdsm.log

Comment 6 Markus Stockhausen 2016-04-13 18:19:39 UTC

Et voila

Comment 7 Nir Soffer 2016-04-13 18:43:36 UTC

In vdsm log we see:

1. We start a live merge

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4
f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05-
a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f
-453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'}

2. Live merge starts without an error

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:02,118::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.merge' in bridge with True

3. Vdsm fails to connect to qemu

VM Channels Listener::DEBUG::2016-04-13 19:12:07,474::guestagent::225::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Attempting connection to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2016-04-13 19:12:07,475::guestagent::239::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Failed to connect to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm with 11

This is EAGAIN - this error is expected when using non-blocking io,
looks like bad log.

3. Huge amound of periodic workers created and discarded

vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::209::Executor::(_discard) Worker periodic/9891 discarded
vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::157::Executor::(__init__) Starting worker periodic/9900
periodic/9900::DEBUG::2016-04-13 19:12:17,096::executor::171::Executor::(_run) Worker started
periodic/9891::DEBUG::2016-04-13 19:12:19,780::executor::178::Executor::(_run) Worker was discarded

Typically there are 4! periodic threads - here we are starting the 9900
thread, and discarding the 9891 thread.

4. Error checking drive watermark

periodic/9894::WARNING::2016-04-13 19:12:20,858::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'542affb0-7a2f-4297-84f4-728a315a6abd']

Generally on the vdsm side, this look like issues with libvirt and qemu, expected
when we consider bug 1319400.

I guess that libvirt is not strating the block job, because qemu is not responding,
and this fools engine, thinking that the block job has completed while it did not
start yet.

Needs deeper inspection on the engine side. Ala, can you take a look?

Comment 8 Markus Stockhausen 2016-04-13 19:19:11 UTC

Regarding block starting or not. From our system metrics we can see that qemu indeed starts the live merge and runs disk IO. See picture attached.

Comment 9 Markus Stockhausen 2016-04-13 19:19:45 UTC

Created attachment 1146943 [details]
live merge disk write

Comment 10 Nir Soffer 2016-04-13 20:25:30 UTC

Second pass at vdsm log show the the merge succeeded:

1. Starting merge

jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4
f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05-
a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f
-453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'}

2. Starting block job

jsonrpc.Executor/1::INFO::2016-04-13 19:12:00,631::vm::4874::virt.vm::(merge) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Starting merge with jobUUID='c8045708-c52f-453e-989a-60d79d10d108'

3. Merge finished, trying to pivot to the merged image

Thread-46628::INFO::2016-04-13 19:14:35,854::vm::5159::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Requesting pivot to complete active layer commit (job c8045708-c52f-453e-989a-60d79d10d108)

4. Pivot completed, using new merged image

Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5172::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Pivot completed (job c8045708-c52f-453e-989a-60d79d10d108)

5. Synchronizing volume chain

Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5195::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronizing volume chain after live merge (job c8045708-c52f-453e-989a-60d79d10d108)
...
Thread-46628::INFO::2016-04-13 19:14:36,001::vm::5201::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronization completed (job c8045708-c52f-453e-989a-60d79d10d108)

So this is an issue on the engine side.

Comment 11 Yaniv Lavi 2016-05-02 11:41:28 UTC

We made a lot of fixes in ovirt 3.6.7 to address all the live merge issues we had. Can you test and see if this resolves your issues?

Comment 12 Ala Hino 2016-05-04 08:21:24 UTC

Waiting for response from submitter to see if issue fixed in ovirt 3.6.7

Comment 13 Allon Mureinik 2016-05-05 12:17:40 UTC

(In reply to Ala Hino from comment #12)
> Waiting for response from submitter to see if issue fixed in ovirt 3.6.7
All the relevant fixes [that we know about] were included in 3.6.6

Comment 14 Allon Mureinik 2016-05-29 12:06:35 UTC

We can't reproduce this on the latest 3.6.6 - closing.
If you're able to reproduce it, please reopen and attach the steps to reproduce and updated logs.

Comment 15 Red Hat Bugzilla 2023-09-14 03:21:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days