Bug 1155583 - [Block storage] Basic Live Merge after Delete Snapshot fails
Summary: [Block storage] Basic Live Merge after Delete Snapshot fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Adam Litke
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
: 1160336 1205642 (view as bug list)
Depends On: 1206355 1206365 1206722 1207808
Blocks: 647386 1177222 1196199
TreeView+ depends on / blocked
 
Reported: 2014-10-22 12:38 UTC by Kevin Alon Goldblatt
Modified: 2016-03-09 19:26 UTC (History)
22 users (show)

Fixed In Version: ovirt-engine-3.6.0_qa1
Doc Type: Bug Fix
Doc Text:
When live merging snapshots on a block storage domain, the merge target volume is pro-actively extended to accommodate active writing of the source volume. This may cause some over-extension of the target volume.
Clone Of:
: 1177222 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:26:11 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine vdsm and server logs (840.48 KB, application/x-gzip)
2014-10-22 12:38 UTC, Kevin Alon Goldblatt
no flags Details
NEW engine vdsm and server logs (287.76 KB, application/x-gzip)
2014-11-02 13:11 UTC, Kevin Alon Goldblatt
no flags Details
server, engine and vdsm logs (650.92 KB, application/x-gzip)
2015-03-24 21:40 UTC, Kevin Alon Goldblatt
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1170712 0 high CLOSED Live Merge: Failed to remove snapshot on block storage due to -ENOSPC 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 35573 0 ovirt-3.5 MERGED Live Merge: Proactively extend merge target volume 2020-12-02 00:44:37 UTC
oVirt gerrit 35614 0 master MERGED Live Merge: Proactively extend merge target volume 2020-12-02 00:44:10 UTC
oVirt gerrit 35757 0 master MERGED Do not extend volume beyond truesize 2020-12-02 00:44:37 UTC
oVirt gerrit 35768 0 ovirt-3.5 MERGED Do not extend volume beyond truesize 2020-12-02 00:44:11 UTC

Internal Links: 1170712

Description Kevin Alon Goldblatt 2014-10-22 12:38:40 UTC
Created attachment 949378 [details]
engine vdsm and server logs

Description of problem:
Live Merge failed during delete snapshot operation

Version-Release number of selected component (if applicable):
rhevm-3.5.0-0.14.beta.el6ev.noarch
vdsm-4.16.7-1.gitdb83943.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1.Create VM with 5 disks (block preallocated, block thin provisioned, nfs - preallocated and nfs - thin provisioned
2.Created snapshot 1 while VM is down
3.Created snapshot 2 while VM is down
4.Created snapshot 3 while VM is down
5.Start VM
6.Delete snapshot 2 >>>> The operation fails with failed live merge

Actual results:
The snapshot was not deleted - Live merge failed

Expected results:
The snapshot should be deleted and Live merge should pass


Additional info:
2014-10-22 12:50:43,673 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-28) VM job 58d1af1e-c92a-4e20-a55b-f04dd4b32515: In progress (no change)
2014-10-22 12:50:43,673 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-28) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change)
2014-10-22 12:50:46,732 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress, updating
2014-10-22 12:50:46,733 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) VM job 58d1af1e-c92a-4e20-a55b-f04dd4b32515: Deleting
2014-10-22 12:50:48,488 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [3d71a8a0] Waiting on Live Merge command step MERGE to complete
2014-10-22 12:50:48,512 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [227b9015] Failed child command status for step MERGE
2014-10-22 12:50:48,517 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-42) [227b9015] Waiting on Live Merge child commands to complete
2014-10-22 12:50:48,530 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [7579da70] Waiting on Live Merge command step MERGE to complete
2014-10-22 12:50:48,554 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-42) [27a4ff17] Merge command has completed for images bf122b09-bc4a-4c24-9a30-d7d6360f5b7f..13dfe8f2-
d5d6-470a-b60d-fb02d7be52de
2014-10-22 12:50:48,559 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [5df0af11] Failed child command status for step MERGE
2014-10-22 12:50:48,571 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-42) [2ee537d4] Waiting on merge command to complete
2014-10-22 12:50:48,577 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [186c164f] Failed child command status for step MERGE
2014-10-22 12:50:49,834 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change)
2014-10-22 12:50:52,858 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-86) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change)
2014-10-22 12:50:55,948 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-97) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change)
2014-10-22 12:50:58,591 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [3d71a8a0] Executing Live Merge command step MERGE_STATUS
2014-10-22 12:50:58,634 INFO  [org.ovirt.engine.core.bll.MergeStatusCommand] (pool-6-thread-2) [c47c56c] Running command: MergeStatusCommand internal: true. Entities affected :  ID: 01e68b31-0e5a-4f76-9416-11b3e
0cc9182 Type: Storage
2014-10-22 12:50:58,689 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [227b9015] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images b55971a
e-544b-4c8f-9656-306a2c06ba52..397d10f6-56f4-45fa-83b6-ec5976246a9a failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the 
operation.
2014-10-22 12:50:58,691 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-69) [227b9015] Waiting on Live Merge child commands to complete
2014-10-22 12:50:58,696 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [7579da70] Waiting on Live Merge command step MERGE to complete
2014-10-22 12:50:58,710 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [5df0af11] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images 6e83297
6-0217-477f-a2d1-b34991e0f702..5d5499e8-3784-40dc-93a9-a08c3c66c2cb failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the 
operation.
2014-10-22 12:50:58,716 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-69) [2ee537d4] Waiting on merge command to complete
2014-10-22 12:50:58,729 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [186c164f] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images 35fb127
6-44ca-4f0f-8b15-83a2cb3b28e2..4f0ceabd-d5e9-468d-9327-1a51fc67c571 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the 
operation.
2014-10-22 12:50:58,777 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (pool-6-thread-2) [c47c56c] START, FullListVdsCommand(HostName = nott-vds1, HostId = 36bc5b33-214c-48d0-828e-f0fb7f903597, vds=Host[nott-vds1,36bc5b33-214c-48d0-828e-f0fb7f903597], vmIds=[b7c8f32b-5ad9-42e3-9616-ee929fb90af8]), log id: 6fdfd5f3

Comment 1 Kevin Alon Goldblatt 2014-10-22 12:49:02 UTC
I used the libvirt rpms supplied by Eric Blake at:
 http://people.redhat.com/jdenemar/libvirt/7.0/ on a vdsm host with Rhel7.0 and V3.5 vt5

Comment 2 Adam Litke 2014-10-22 15:45:36 UTC
A couple of things to note.

1. Live merge for volumes on block storage will not work until Bug 1041569 is fixed.  I would expect everything to work fine if the test was repeated for a VM with disks located solely on NFS storage.  This bug is targeted for RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select customers until 7.1 is released.  We'll work to get an updated build as soon as Eric Blake completes development.

2. The vdsm log attached to this bug has no evidence of any live merge activity whatsoever.  If multiple vdsm hosts are involved in the test, please be sure to post the full log from each host.

Could you please retry the test with only NFS-based disks.  Could you also retry this exact test and be sure to include the full vdsm logs?  I like to truncate the log on the host before attempting a merge "truncate --size 0 /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test completes.  You can tell if the log contains the necessary info if you can grep for the following:

grep 'Starting merge with jobUUID' vdsm.log

And you might be able to confirm that this is a problem with unsupported block storage with the following grep:

grep '<allocation/> missing from backing chain' vdsm.log

Comment 4 Kevin Alon Goldblatt 2014-11-02 13:11:14 UTC
Created attachment 952903 [details]
NEW engine vdsm and server logs

Reproduced the same scenario and added new logs


From vdsm log:
-----------------------------------
grep 'Starting merge with jobUUID' vdsm.log

Thread-50163::INFO::2014-11-02 14:45:14,413::vm::5741::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::Starting merge with jobUUID='16b56816-2508-4670-8f47-0b9b1ce1115f'
Thread-50164::INFO::2014-11-02 14:45:14,482::vm::5741::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::Starting merge with jobUUID='64942a37-a3c2-4dd5-8a83-e251e58e6316'


grep '<allocation/> missing from backing chain' /var/log/vdsm/vdsm.log
-------------------------------------------------------------------------

Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-50162::DEBUG::2014-11-02 14:45:14,248::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:15,627::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:15,636::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:17,667::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:17,676::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:19,705::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:19,715::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:21,744::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:21,756::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:23,787::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:23,797::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:25,827::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:25,837::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:27,867::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:27,877::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:29,906::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:29,973::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:32,014::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:32,022::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:34,053::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:34,061::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:36,092::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:36,100::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:38,131::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:38,146::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:40,186::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:40,194::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:42,227::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:42,234::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-48595::DEBUG::2014-11-02 14:45:44,266::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-48595::DEBUG::2014-11-02 14:45:44,274::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb



More from vdsm.log
--------------------------

Thread-50160::DEBUG::2014-11-02 14:45:12,897::stompReactor::163::yajsonrpc.StompServer::(send) Sending response
Dummy-262::DEBUG::2014-11-02 14:45:13,814::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-262::DEBUG::2014-11-02 14:45:13,836::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0147028 s, 69.6 MB/s\n'; <rc> = 0
JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,133::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'>
JsonRpcServer::DEBUG::2014-11-02 14:45:14,134::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
Thread-50161::DEBUG::2014-11-02 14:45:14,134::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'6035dc5f-6715-477d-a227-2cd8dde2de2f', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60
a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID': u'20303b64-bc2a-4c85-b1ff-82769e8367c3', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'cdeb134c-7512-4ae0-af4d-5bed7e65d7cf'}, u'bandwi
dth': u'0', u'jobUUID': u'a77992d0-15e2-44df-900f-06d8c116528d', u'baseVolUUID': u'b7403493-c5d5-4534-9a5a-d12e0dab4618'}
Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda
Thread-50161::ERROR::2014-11-02 14:45:14,167::vm::5690::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::merge: libvirt does not support volume chain monitoring.  Unable to perform live merge.
Thread-50161::DEBUG::2014-11-02 14:45:14,168::stompReactor::163::yajsonrpc.StompServer::(send) Sending response
JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,200::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'>
JsonRpcServer::DEBUG::2014-11-02 14:45:14,201::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
Thread-50162::DEBUG::2014-11-02 14:45:14,201::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'b6074af3-510f-4240-8673-42b99307f40d', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID': u'b814ba88-503b-4a77-956d-051c08a5906a', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'23bc016e-9afa-42ff-b855-b8f7e2de2a60'}, u'bandwidth': u'0', u'jobUUID': u'b7c2f1c2-bcaa-40e1-b5f2-ce19e5853fd6', u'baseVolUUID': u'fa86d61e-6115-4bb5-9ba1-546df45c897e'}
Thread-50162::DEBUG::2014-11-02 14:45:14,248::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb
Thread-50162::ERROR::2014-11-02 14:45:14,248::vm::5690::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::merge: libvirt does not support volume chain monitoring.  Unable to perform live merge.
Thread-50162::DEBUG::2014-11-02 14:45:14,249::stompReactor::163::yajsonrpc.StompServer::(send) Sending response
JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,255::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'>
JsonRpcServer::DEBUG::2014-11-02 14:45:14,256::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
Thread-50163::DEBUG::2014-11-02 14:45:14,256::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'708ce999-4b00-4d2e-97e8-237f15b7581e', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd', u'volumeID': u'269c518b-5a9e-4264-9703-ed5c3216ef82', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'1c018658-d26b-4af5-a808-6625320e5845'}, u'bandwidth': u'0', u'jobUUID': u'16b56816-2508-4670-8f47-0b9b1ce1115f', u'baseVolUUID': u'6136ee84-99b6-4a8e-9c49-4f9bd60525f1'}
Thread-50163::DEBUG::2014-11-02 14:45:14,288::task::595::Storage.TaskManager.Task::(_updateState) Task=`94b6bf28-58b8-480e-9258-501c4aeb184c`::moving from state init -> state preparing
Thread-50163::INFO::2014-11-02 14:45:14,289::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeInfo(sdUUID=u'b473691a-e6f6-4c44-81cb-6bee354608fd', spUUID=u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', imgUUID=u'1c018658-d26b-4af5-a808-6625320e5845', volUUID=u'6136ee84-99b6-4a8e-9c49-4f9bd60525f1', options=None)
Thread-50163::DEBUG::2014-11-02 14:45:14,289::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.b473691a-e6f6-4c44-81cb-6bee354608fd`ReqID=`a12f997e-25a2-4e27-a458-92ec8ad04155`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3133' at 'getVolumeInfo'


Fom engine log:
---------------------

2014-11-02 14:44:51,099 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Lock Acquired to object EngineLock [exclusiveLocks= key: 116b380c-458d-4aea-be5a-865d033dc60a value: VM
2014-11-02 14:44:51,157 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Running command: RemoveSnapshotCommand internal: false. Entities affected :  ID: 116b380c-458d-4aea-be5a-865d033dc60a Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2014-11-02 14:44:51,164 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Lock freed to object EngineLock [exclusiveLocks= key: 116b380c-458d-4aea-be5a-865d033dc60a value: VM
2014-11-02 14:44:51,297 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-17) Correlation ID: 648f30a7, Job ID: bf9f11c7-b662-4b39-b3b5-2b253763d4ec, Call Stack: null, Custom Event ID: -1, Message: Snapshot 's2' deletion for VM 'nfs_block' was initiated by admin.
2014-11-02 14:46:04,990 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-14) [648f30a7] All Live Merge child commands have completed, status FAILED
2014-11-02 14:46:15,326 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-7) [648f30a7] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand
2014-11-02 14:46:15,353 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) [648f30a7] Correlation ID: 648f30a7, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 's2' for VM 'nfs_block'.

Comment 5 Kevin Alon Goldblatt 2014-11-02 13:12:58 UTC
Provided need info in comment 4 with a mix of block and nfs disks

Comment 6 Kevin Alon Goldblatt 2014-11-02 15:03:29 UTC
(In reply to Adam Litke from comment #2)
> A couple of things to note.
> 
> 1. Live merge for volumes on block storage will not work until Bug 1041569
> is fixed.  I would expect everything to work fine if the test was repeated
> for a VM with disks located solely on NFS storage.  This bug is targeted for
> RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select
> customers until 7.1 is released.  We'll work to get an updated build as soon
> as Eric Blake completes development.
> 
> 2. The vdsm log attached to this bug has no evidence of any live merge
> activity whatsoever.  If multiple vdsm hosts are involved in the test,
> please be sure to post the full log from each host.
> 
> Could you please retry the test with only NFS-based disks. I ran this with only NFS-based disks and it worked fine.  Could you also
> retry this exact test and be sure to include the full vdsm logs? I reproduced this same scenario as before using nfs and block disks and add new logs. I like to
> truncate the log on the host before attempting a merge "truncate --size 0
> /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test
> completes.  You can tell if the log contains the necessary info if you can
> grep for the following:
> 
> grep 'Starting merge with jobUUID' vdsm.log
> 
> And you might be able to confirm that this is a problem with unsupported
> block storage with the following grep:
> 
> grep '<allocation/> missing from backing chain' vdsm.log

Comment 7 Kevin Alon Goldblatt 2014-11-02 15:06:48 UTC
(In reply to Adam Litke from comment #2)
> A couple of things to note.
> 
> 1. Live merge for volumes on block storage will not work until Bug 1041569
> is fixed.  I would expect everything to work fine if the test was repeated
> for a VM with disks located solely on NFS storage.  This bug is targeted for
> RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select
> customers until 7.1 is released.  We'll work to get an updated build as soon
> as Eric Blake completes development.
> 
> 2. The vdsm log attached to this bug has no evidence of any live merge
> activity whatsoever.  If multiple vdsm hosts are involved in the test,
> please be sure to post the full log from each host.
> 
> Could you please retry the test with only NFS-based disks.
 I ran this with only NFS-based disks and it worked fine. 
 Could you also
 > retry this exact test and be sure to include the full vdsm logs? 
I reproduced this same scenario as before using nfs and block disks and added new logs.
 I like to
> truncate the log on the host before attempting a merge "truncate --size 0
> /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test
> completes.  You can tell if the log contains the necessary info if you can
> grep for the following:
> 
> grep 'Starting merge with jobUUID' vdsm.log
> 
> And you might be able to confirm that this is a problem with unsupported
> block storage with the following grep:
> 
> grep '<allocation/> missing from backing chain' vdsm.log

Comment 8 Adam Litke 2014-11-04 18:36:09 UTC
*** Bug 1160336 has been marked as a duplicate of this bug. ***

Comment 9 Ori Gofen 2014-11-05 17:19:30 UTC
Adam, Kevin, after reviewing this bug very carefully I have found, that even though the VM had disks located solely on NFS, the snapshot that has been taken was a live snapshot with RAM enabled, which basically means that 2 more images were associated with that snapshot and were created on an iSCSi domain that belonged to the DC.

from engine logs we see that live merge failed due to:

"2014-11-02 14:45:14,256 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-81) [1f3e0e85] Merging of snapshot
 02849eb3-1328-4479-8a0f-f63832036a6e images b7403493-c5d5-4534-9a5a-d12e0dab4618..6035dc5f-6715-477d-a227-2cd8dde2de2f failed "

vdsm log expands on the problem:

"Thread-50161::DEBUG::2014-11-02 14:45:14,134::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'6035dc5
f-6715-477d-a227-2cd8dde2de2f', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID':
 u'20303b64-bc2a-4c85-b1ff-82769e8367c3', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'cdeb134c-7512-4ae0-af4d-5bed7e65d7cf'}, u'bandwidth
': u'0', u'jobUUID': u'a77992d0-15e2-44df-900f-06d8c116528d', u'baseVolUUID': u'b7403493-c5d5-4534-9a5a-d12e0dab4618'}
Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing 
from backing chain for block device sda
"

Vdsm is looking for the 2 images(RAM & conf images) that engine complained he couldn't merge:

b7403493-c5d5-4534-9a5a-d12e0dab4618 and 6035dc5f-6715-477d-a227-2cd8dde2de2f

and that are "missin from backing chain"

well those images are found on another domain (not a file one):
name: blk1
id:1670633f-ed37-485e-98e8-d9c6f8687754
which is iSCSi: 

mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };], deviceSize=40, vendorName=DGC, pathsDictionary={sde=true, sdr=true, sdar=true, sdae=true}, lunType=ISCSI, status=null, diskId=null, diskAlias=null, storageDomainId=1670633f-ed37-485e-98e8-d9c6f8687754, storageDomainName=null]], log id: 30aadb61
2014-11-02 14:07:11,705 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [749c1778] Correlation ID: 749c1778, Job ID: 34ab9217-fffe-44af-9434-3621faa991ec, Call Stack: null, Custom Event ID: -1, Message: Storage Domain blk1 was added by admin
2014-11-02 14:07:11,861 INFO  [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (ajp-/127.0.0.1:8702-13) [b076fb0] Lock Acquired to object EngineLock [exclusiveLocks= key: 1670633f-ed37-485e-98e8-d9c6f8687754 value: STORAGE

During live snapshot command, vdsm never creates those 2 images on the same domain which holds VM's data disks(there's an unclear algorithm | bug which chooses a targeted domain, it seems random to me) in Kevin's scenario it always chooses "blk1" which is iSCSi.
so what caused this problem is a libvirt fail to perform this action

"merge: libvirt does not support volume chain
monitoring.  Unable to perform live merge.
"
due to the open bug, I guess, on enabling the watermark to inactive block volumes.

Comment 10 Adam Litke 2014-11-25 21:29:49 UTC
We are going to pursue a workaround for 3.5 where instead of extending merge volumes on-demand, we'll extend them before starting the merge.  This will allow us to support live merge for block storage while we wait for the proper API from libvirt.

Comment 11 Allon Mureinik 2014-11-26 08:16:04 UTC
(In reply to Adam Litke from comment #10)
> We are going to pursue a workaround for 3.5 where instead of extending merge
> volumes on-demand, we'll extend them before starting the merge.  This will
> allow us to support live merge for block storage while we wait for the
> proper API from libvirt.
Do we have a separate BZ to track the consumption of libvirt's proper API once it's delivered?

Comment 12 Adam Litke 2014-11-26 16:41:16 UTC
Created https://bugzilla.redhat.com/show_bug.cgi?id=1168327 to track the adoption of the eventual libvirt api.

Comment 13 Julie 2014-12-02 04:58:30 UTC
Hi Adam,
   Please provide the doc text for release notes.

Cheers,
Julie

Comment 14 Andrew Burden 2014-12-02 07:28:56 UTC
Hey Adam,

Can I get you to review the doc text to ensure it's technically accurate so that this can be included in the Release Notes.

Cheers,
Andrew

Comment 15 Allon Mureinik 2014-12-02 09:12:58 UTC
Hi Andrew,

A couple of notes on the doctext:

> Previously, live merging was not possible for block-based storage. 
Live merge is a 3.5 feature. This feature gap shouldn't be visible to any normal customer (except maybe early beta customers), so it shouldn't be relevant to document.

>  is a workaround until the appropriate libvirt API is finalized.
While this is true, I don't think our documentation should be a place to point fingers "component XYZ was not ready, so component ABC had to have a workaround"

IMHO, this should merely report a known issue, something down these lines:
"""
When live merging snapshots on a block storage domain the merge target volume is pro-actively extended to accommodate active writing of the source volume. This may cause some over-extension of the target volume.
"""

Comment 16 Adam Litke 2014-12-02 16:23:36 UTC
I agree with Allon in comment #15.  Please consider changing the doctext according to his suggestion.

Comment 17 Andrew Burden 2014-12-03 02:04:33 UTC
Works for me. 

Thank you both for the feedback.

Comment 18 Kevin Alon Goldblatt 2014-12-17 12:44:42 UTC
I ran the original scenario with block and nfs devices. I was able to successfully delete snapshots. Moving to Verified.

Used the following libvirt rpms:
-----------------------------------
From http://people.redhat.com/jdenemar/libvirt/7.0/

libvirt-client-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-qemu-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-storage-1.2.8-10.el7ost.x86_64
libvirt-python-1.2.8-6.el7ost.x86_64
libvirt-daemon-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-interface-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-secret-1.2.8-10.el7ost.x86_64
libvirt-daemon-kvm-1.2.8-10.el7ost.x86_64
libvirt-lock-sanlock-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-network-1.2.8-10.el7ost.x86_64
libvirt-daemon-driver-nodedev-1.2.8-10.el7ost.x86_64
libvirt-daemon-config-nwfilter-1.2.8-10.el7ost.x86_64

Comment 23 Kevin Alon Goldblatt 2015-03-24 21:38:27 UTC
Tested again with new libvirt and the problem reocurred:
v3.5.1 vt14.1

http://download.devel.redhat.com/brewroot/packages/libvirt/1.2.8/16.el7_1.2/x86_64/


Engine log:
2015-03-24 20:11:19,631 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-8) Correlation ID: 5e7b6c9f, Job ID: d79ab6e5-0e4c-467d-899a-986df506c486, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snap111' deletion for VM 'vm_test1' was initiated by admin@internal.


2015-03-24 20:13:14,740 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) [5e7b6c9f] Correlation ID: 5e7b6c9f, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snap111' for VM 'vm_test1'.

Comment 24 Kevin Alon Goldblatt 2015-03-24 21:40:49 UTC
Created attachment 1006001 [details]
server, engine and vdsm logs

Added new logs

Comment 25 Adam Litke 2015-03-25 13:44:54 UTC
Kevin,

Thanks for reporting this issue.  The following libvirt error

libvirtError: invalid argument: invalid path /rhev/data-center/6d96f52d-d791-4f66-83bd-2553ca0f3012/3c9387ca-e89f-48c5-9dba-824639e604af/images/db96b8bc-9aac-416b-bba6-1d66599ab0d2/38e715e6-d3af-44e6-b153-7463fcb894e9 not assigned to domain

definitely seems fishy.  Unfortunately your vdsm logs don't show any of the merge-related commands that caused the situation.  My guess is that the high rate of tracebacks due to the failure in the VM sampling thread caused the vdsm logs to rotate quickly.

Please attach logs that contain messages like: 

"Starting merge with jobUUID="

and:

"Synchronizing volume chain after live merge"

Without these, I have no way to see how we got into this situation.

Comment 26 Adam Litke 2015-03-26 20:49:26 UTC
*** Bug 1205642 has been marked as a duplicate of this bug. ***

Comment 27 Adam Litke 2015-03-26 20:50:45 UTC
After looking at this bug and Bug 1155583, I have determined that they both have the same root issue.

In the vdsm log I see messages like:

Thread-50::ERROR::2015-03-26 10:53:09,314::sampling::488::vm.Vm::(collect) vmId=`b3fbc637-9cc9-4b15-ba94-5a2ee1607785`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x3133870>
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
    statsFunction()
  File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite
    self._vm.extendDrivesIfNeeded()
  File "/usr/share/vdsm/virt/vm.py", line 2537, in extendDrivesIfNeeded
    extend = [x for x in self._getExtendCandidates()
  File "/usr/share/vdsm/virt/vm.py", line 2489, in _getExtendCandidates
    capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 646, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirtError: invalid argument: invalid path /rhev/data-center/dee92f86-2fef-4edf-ac51-135d12646bde/01f563fa-1aef-41ab-92d3-0d6561d4a731/images/31cfd267-e627-4cfe-8f33-f66b0db627c3/1863e178-c40a-42eb-803a-3d1f1ac6658a not assigned to domain

After talking to Eric Blake on IRC, it looks like another problem with libvirt where the domain XML is not updated before the virDomainBlockJobAbort API (used for pivoting an active layer merge) returns.  Therefore, our volume chain sync code gets an old value for the disk path which causes _highWrite to fail in this manner.

Comment 28 Adam Litke 2015-03-26 21:04:54 UTC
I don't want to repurpose this bug for a new issue so I created Bug 1206355 to track the libvirt race condition.

Comment 30 Allon Mureinik 2015-04-14 11:52:30 UTC
Yaniv/Aharon - resetting the component reset the BZ's flags.
Please re-grant your acks.
Thanks!

Comment 31 Kevin Alon Goldblatt 2015-04-28 14:51:39 UTC
Verified with version:
-----------------------
v3.6
ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch
vdsm-4.17.0-632.git19a83a2.el7.x86_64


Verified using scenario:
------------------------
Steps to Reproduce:
1.Create VM with 5 disks (block preallocated, block thin provisioned, nfs - preallocated and nfs - thin provisioned
2.Created snapshot 1 while VM is down
3.Created snapshot 2 while VM is down
4.Created snapshot 3 while VM is down
5.Start VM
6.Delete snapshot 2 - successful - Verifies this bz

In addition ran:
----------------
7. Delete snapshot 3 - successful
8. Delete snapshot 1 - successful

Moving to verified!

Comment 34 errata-xmlrpc 2016-03-09 19:26:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.