Description of problem: According to BZ1319400 qemu 2.3.0 has severe problems when removing snapshots (aka live merge). We have been advised to remove disk snapshots one by one to avoid machine stalls during the process. Via tab storage -> tab disk-snapshots. This process has 2 bugs. 1) At the start of the live merge the message "snapshot deletion has been completed" is logged. This should read "snapshot deletion has been initiated". 2) The remove snapshot task is not finalized. It still exists on the tasks tab. Version-Release number of selected component (if applicable): Ovirt 3.6.4.1 How reproducible: 100% Steps to Reproduce: 1. create & start vm 2. take snapshot 3. do some disk writes so that the snapshot disks get some data 4. goto storage -> disk snapshot tab 5. Select & remove disk snapshot. Actual results: wrong message is logged and task does not finish. Expected results: right message & task finishes Additional info: screenshots attached
Created attachment 1146920 [details] snap2
Created attachment 1146921 [details] snap1
Markus, can you attach engine and vdsm logs?
Created attachment 1146924 [details] engine.log
Created attachment 1146926 [details] vdsm.log
Et voila
In vdsm log we see: 1. We start a live merge jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4 f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05- a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f -453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'} 2. Live merge starts without an error jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:02,118::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.merge' in bridge with True 3. Vdsm fails to connect to qemu VM Channels Listener::DEBUG::2016-04-13 19:12:07,474::guestagent::225::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Attempting connection to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2016-04-13 19:12:07,475::guestagent::239::virt.vm::(_connect) vmId=`dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2`::Failed to connect to /var/lib/libvirt/qemu/channels/dd04fad6-69dc-4a76-a0f4-b627cf8dbeb2.com.redhat.rhevm.vdsm with 11 This is EAGAIN - this error is expected when using non-blocking io, looks like bad log. 3. Huge amound of periodic workers created and discarded vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::209::Executor::(_discard) Worker periodic/9891 discarded vdsm.Scheduler::DEBUG::2016-04-13 19:12:17,095::executor::157::Executor::(__init__) Starting worker periodic/9900 periodic/9900::DEBUG::2016-04-13 19:12:17,096::executor::171::Executor::(_run) Worker started periodic/9891::DEBUG::2016-04-13 19:12:19,780::executor::178::Executor::(_run) Worker was discarded Typically there are 4! periodic threads - here we are starting the 9900 thread, and discarding the 9891 thread. 4. Error checking drive watermark periodic/9894::WARNING::2016-04-13 19:12:20,858::periodic::258::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'542affb0-7a2f-4297-84f4-728a315a6abd'] Generally on the vdsm side, this look like issues with libvirt and qemu, expected when we consider bug 1319400. I guess that libvirt is not strating the block job, because qemu is not responding, and this fools engine, thinking that the block job has completed while it did not start yet. Needs deeper inspection on the engine side. Ala, can you take a look?
Regarding block starting or not. From our system metrics we can see that qemu indeed starts the live merge and runs disk IO. See picture attached.
Created attachment 1146943 [details] live merge disk write
Second pass at vdsm log show the the merge succeeded: 1. Starting merge jsonrpc.Executor/1::DEBUG::2016-04-13 19:12:00,527::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'a5acfb9f-f774-4 f05-a527-d6dd4250ab95', u'vmID': u'0a25649f-97fb-4ced-bbf8-e03159ceb158', u'drive': {u'domainID': u'c86d8409-4dd6-4e30-86dc-b5175a7ceb86', u'volumeID': u'a5acfb9f-f774-4f05- a527-d6dd4250ab95', u'poolID': u'94ed7a19-fade-4bd6-83f2-2cbb2f730b95', u'imageID': u'30338c17-6644-4f9d-9277-2b5fbbface04'}, u'bandwidth': u'0', u'jobUUID': u'c8045708-c52f -453e-989a-60d79d10d108', u'baseVolUUID': u'691778c3-d3af-4263-aea4-fa9556ef8986'} 2. Starting block job jsonrpc.Executor/1::INFO::2016-04-13 19:12:00,631::vm::4874::virt.vm::(merge) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Starting merge with jobUUID='c8045708-c52f-453e-989a-60d79d10d108' 3. Merge finished, trying to pivot to the merged image Thread-46628::INFO::2016-04-13 19:14:35,854::vm::5159::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Requesting pivot to complete active layer commit (job c8045708-c52f-453e-989a-60d79d10d108) 4. Pivot completed, using new merged image Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5172::virt.vm::(tryPivot) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Pivot completed (job c8045708-c52f-453e-989a-60d79d10d108) 5. Synchronizing volume chain Thread-46628::INFO::2016-04-13 19:14:35,897::vm::5195::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronizing volume chain after live merge (job c8045708-c52f-453e-989a-60d79d10d108) ... Thread-46628::INFO::2016-04-13 19:14:36,001::vm::5201::virt.vm::(run) vmId=`0a25649f-97fb-4ced-bbf8-e03159ceb158`::Synchronization completed (job c8045708-c52f-453e-989a-60d79d10d108) So this is an issue on the engine side.
We made a lot of fixes in ovirt 3.6.7 to address all the live merge issues we had. Can you test and see if this resolves your issues?
Waiting for response from submitter to see if issue fixed in ovirt 3.6.7
(In reply to Ala Hino from comment #12) > Waiting for response from submitter to see if issue fixed in ovirt 3.6.7 All the relevant fixes [that we know about] were included in 3.6.6
We can't reproduce this on the latest 3.6.6 - closing. If you're able to reproduce it, please reopen and attach the steps to reproduce and updated logs.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days