Created attachment 949378 [details] engine vdsm and server logs Description of problem: Live Merge failed during delete snapshot operation Version-Release number of selected component (if applicable): rhevm-3.5.0-0.14.beta.el6ev.noarch vdsm-4.16.7-1.gitdb83943.el7.x86_64 How reproducible: Always Steps to Reproduce: 1.Create VM with 5 disks (block preallocated, block thin provisioned, nfs - preallocated and nfs - thin provisioned 2.Created snapshot 1 while VM is down 3.Created snapshot 2 while VM is down 4.Created snapshot 3 while VM is down 5.Start VM 6.Delete snapshot 2 >>>> The operation fails with failed live merge Actual results: The snapshot was not deleted - Live merge failed Expected results: The snapshot should be deleted and Live merge should pass Additional info: 2014-10-22 12:50:43,673 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-28) VM job 58d1af1e-c92a-4e20-a55b-f04dd4b32515: In progress (no change) 2014-10-22 12:50:43,673 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-28) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change) 2014-10-22 12:50:46,732 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress, updating 2014-10-22 12:50:46,733 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-61) VM job 58d1af1e-c92a-4e20-a55b-f04dd4b32515: Deleting 2014-10-22 12:50:48,488 INFO [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [3d71a8a0] Waiting on Live Merge command step MERGE to complete 2014-10-22 12:50:48,512 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [227b9015] Failed child command status for step MERGE 2014-10-22 12:50:48,517 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-42) [227b9015] Waiting on Live Merge child commands to complete 2014-10-22 12:50:48,530 INFO [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [7579da70] Waiting on Live Merge command step MERGE to complete 2014-10-22 12:50:48,554 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-42) [27a4ff17] Merge command has completed for images bf122b09-bc4a-4c24-9a30-d7d6360f5b7f..13dfe8f2- d5d6-470a-b60d-fb02d7be52de 2014-10-22 12:50:48,559 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [5df0af11] Failed child command status for step MERGE 2014-10-22 12:50:48,571 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-42) [2ee537d4] Waiting on merge command to complete 2014-10-22 12:50:48,577 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-42) [186c164f] Failed child command status for step MERGE 2014-10-22 12:50:49,834 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change) 2014-10-22 12:50:52,858 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-86) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change) 2014-10-22 12:50:55,948 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-97) VM job 1c772a93-8015-44f1-b715-aa4561a66202: In progress (no change) 2014-10-22 12:50:58,591 INFO [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [3d71a8a0] Executing Live Merge command step MERGE_STATUS 2014-10-22 12:50:58,634 INFO [org.ovirt.engine.core.bll.MergeStatusCommand] (pool-6-thread-2) [c47c56c] Running command: MergeStatusCommand internal: true. Entities affected : ID: 01e68b31-0e5a-4f76-9416-11b3e 0cc9182 Type: Storage 2014-10-22 12:50:58,689 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [227b9015] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images b55971a e-544b-4c8f-9656-306a2c06ba52..397d10f6-56f4-45fa-83b6-ec5976246a9a failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-22 12:50:58,691 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-69) [227b9015] Waiting on Live Merge child commands to complete 2014-10-22 12:50:58,696 INFO [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [7579da70] Waiting on Live Merge command step MERGE to complete 2014-10-22 12:50:58,710 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [5df0af11] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images 6e83297 6-0217-477f-a2d1-b34991e0f702..5d5499e8-3784-40dc-93a9-a08c3c66c2cb failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-22 12:50:58,716 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (DefaultQuartzScheduler_Worker-69) [2ee537d4] Waiting on merge command to complete 2014-10-22 12:50:58,729 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-69) [186c164f] Merging of snapshot d390fbdd-b7a8-4587-a95f-6289b70d6d8b images 35fb127 6-44ca-4f0f-8b15-83a2cb3b28e2..4f0ceabd-d5e9-468d-9327-1a51fc67c571 failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2014-10-22 12:50:58,777 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (pool-6-thread-2) [c47c56c] START, FullListVdsCommand(HostName = nott-vds1, HostId = 36bc5b33-214c-48d0-828e-f0fb7f903597, vds=Host[nott-vds1,36bc5b33-214c-48d0-828e-f0fb7f903597], vmIds=[b7c8f32b-5ad9-42e3-9616-ee929fb90af8]), log id: 6fdfd5f3
I used the libvirt rpms supplied by Eric Blake at: http://people.redhat.com/jdenemar/libvirt/7.0/ on a vdsm host with Rhel7.0 and V3.5 vt5
A couple of things to note. 1. Live merge for volumes on block storage will not work until Bug 1041569 is fixed. I would expect everything to work fine if the test was repeated for a VM with disks located solely on NFS storage. This bug is targeted for RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select customers until 7.1 is released. We'll work to get an updated build as soon as Eric Blake completes development. 2. The vdsm log attached to this bug has no evidence of any live merge activity whatsoever. If multiple vdsm hosts are involved in the test, please be sure to post the full log from each host. Could you please retry the test with only NFS-based disks. Could you also retry this exact test and be sure to include the full vdsm logs? I like to truncate the log on the host before attempting a merge "truncate --size 0 /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test completes. You can tell if the log contains the necessary info if you can grep for the following: grep 'Starting merge with jobUUID' vdsm.log And you might be able to confirm that this is a problem with unsupported block storage with the following grep: grep '<allocation/> missing from backing chain' vdsm.log
Created attachment 952903 [details] NEW engine vdsm and server logs Reproduced the same scenario and added new logs From vdsm log: ----------------------------------- grep 'Starting merge with jobUUID' vdsm.log Thread-50163::INFO::2014-11-02 14:45:14,413::vm::5741::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::Starting merge with jobUUID='16b56816-2508-4670-8f47-0b9b1ce1115f' Thread-50164::INFO::2014-11-02 14:45:14,482::vm::5741::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::Starting merge with jobUUID='64942a37-a3c2-4dd5-8a83-e251e58e6316' grep '<allocation/> missing from backing chain' /var/log/vdsm/vdsm.log ------------------------------------------------------------------------- Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-50162::DEBUG::2014-11-02 14:45:14,248::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:15,627::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:15,636::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:17,667::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:17,676::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:19,705::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:19,715::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:21,744::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:21,756::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:23,787::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:23,797::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:25,827::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:25,837::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:27,867::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:27,877::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:29,906::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:29,973::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:32,014::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:32,022::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:34,053::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:34,061::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:36,092::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:36,100::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:38,131::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:38,146::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:40,186::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:40,194::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:42,227::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:42,234::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-48595::DEBUG::2014-11-02 14:45:44,266::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-48595::DEBUG::2014-11-02 14:45:44,274::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb More from vdsm.log -------------------------- Thread-50160::DEBUG::2014-11-02 14:45:12,897::stompReactor::163::yajsonrpc.StompServer::(send) Sending response Dummy-262::DEBUG::2014-11-02 14:45:13,814::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-262::DEBUG::2014-11-02 14:45:13,836::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0147028 s, 69.6 MB/s\n'; <rc> = 0 JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,133::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'> JsonRpcServer::DEBUG::2014-11-02 14:45:14,134::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request Thread-50161::DEBUG::2014-11-02 14:45:14,134::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'6035dc5f-6715-477d-a227-2cd8dde2de2f', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60 a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID': u'20303b64-bc2a-4c85-b1ff-82769e8367c3', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'cdeb134c-7512-4ae0-af4d-5bed7e65d7cf'}, u'bandwi dth': u'0', u'jobUUID': u'a77992d0-15e2-44df-900f-06d8c116528d', u'baseVolUUID': u'b7403493-c5d5-4534-9a5a-d12e0dab4618'} Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda Thread-50161::ERROR::2014-11-02 14:45:14,167::vm::5690::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::merge: libvirt does not support volume chain monitoring. Unable to perform live merge. Thread-50161::DEBUG::2014-11-02 14:45:14,168::stompReactor::163::yajsonrpc.StompServer::(send) Sending response JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,200::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'> JsonRpcServer::DEBUG::2014-11-02 14:45:14,201::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request Thread-50162::DEBUG::2014-11-02 14:45:14,201::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'b6074af3-510f-4240-8673-42b99307f40d', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID': u'b814ba88-503b-4a77-956d-051c08a5906a', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'23bc016e-9afa-42ff-b855-b8f7e2de2a60'}, u'bandwidth': u'0', u'jobUUID': u'b7c2f1c2-bcaa-40e1-b5f2-ce19e5853fd6', u'baseVolUUID': u'fa86d61e-6115-4bb5-9ba1-546df45c897e'} Thread-50162::DEBUG::2014-11-02 14:45:14,248::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sdb Thread-50162::ERROR::2014-11-02 14:45:14,248::vm::5690::vm.Vm::(merge) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::merge: libvirt does not support volume chain monitoring. Unable to perform live merge. Thread-50162::DEBUG::2014-11-02 14:45:14,249::stompReactor::163::yajsonrpc.StompServer::(send) Sending response JsonRpc (StompReactor)::DEBUG::2014-11-02 14:45:14,255::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command='SEND'> JsonRpcServer::DEBUG::2014-11-02 14:45:14,256::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request Thread-50163::DEBUG::2014-11-02 14:45:14,256::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'708ce999-4b00-4d2e-97e8-237f15b7581e', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'b473691a-e6f6-4c44-81cb-6bee354608fd', u'volumeID': u'269c518b-5a9e-4264-9703-ed5c3216ef82', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'1c018658-d26b-4af5-a808-6625320e5845'}, u'bandwidth': u'0', u'jobUUID': u'16b56816-2508-4670-8f47-0b9b1ce1115f', u'baseVolUUID': u'6136ee84-99b6-4a8e-9c49-4f9bd60525f1'} Thread-50163::DEBUG::2014-11-02 14:45:14,288::task::595::Storage.TaskManager.Task::(_updateState) Task=`94b6bf28-58b8-480e-9258-501c4aeb184c`::moving from state init -> state preparing Thread-50163::INFO::2014-11-02 14:45:14,289::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeInfo(sdUUID=u'b473691a-e6f6-4c44-81cb-6bee354608fd', spUUID=u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', imgUUID=u'1c018658-d26b-4af5-a808-6625320e5845', volUUID=u'6136ee84-99b6-4a8e-9c49-4f9bd60525f1', options=None) Thread-50163::DEBUG::2014-11-02 14:45:14,289::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.b473691a-e6f6-4c44-81cb-6bee354608fd`ReqID=`a12f997e-25a2-4e27-a458-92ec8ad04155`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3133' at 'getVolumeInfo' Fom engine log: --------------------- 2014-11-02 14:44:51,099 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Lock Acquired to object EngineLock [exclusiveLocks= key: 116b380c-458d-4aea-be5a-865d033dc60a value: VM 2014-11-02 14:44:51,157 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Running command: RemoveSnapshotCommand internal: false. Entities affected : ID: 116b380c-458d-4aea-be5a-865d033dc60a Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER 2014-11-02 14:44:51,164 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (ajp-/127.0.0.1:8702-17) [648f30a7] Lock freed to object EngineLock [exclusiveLocks= key: 116b380c-458d-4aea-be5a-865d033dc60a value: VM 2014-11-02 14:44:51,297 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-17) Correlation ID: 648f30a7, Job ID: bf9f11c7-b662-4b39-b3b5-2b253763d4ec, Call Stack: null, Custom Event ID: -1, Message: Snapshot 's2' deletion for VM 'nfs_block' was initiated by admin. 2014-11-02 14:46:04,990 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-14) [648f30a7] All Live Merge child commands have completed, status FAILED 2014-11-02 14:46:15,326 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-7) [648f30a7] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2014-11-02 14:46:15,353 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) [648f30a7] Correlation ID: 648f30a7, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 's2' for VM 'nfs_block'.
Provided need info in comment 4 with a mix of block and nfs disks
(In reply to Adam Litke from comment #2) > A couple of things to note. > > 1. Live merge for volumes on block storage will not work until Bug 1041569 > is fixed. I would expect everything to work fine if the test was repeated > for a VM with disks located solely on NFS storage. This bug is targeted for > RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select > customers until 7.1 is released. We'll work to get an updated build as soon > as Eric Blake completes development. > > 2. The vdsm log attached to this bug has no evidence of any live merge > activity whatsoever. If multiple vdsm hosts are involved in the test, > please be sure to post the full log from each host. > > Could you please retry the test with only NFS-based disks. I ran this with only NFS-based disks and it worked fine. Could you also > retry this exact test and be sure to include the full vdsm logs? I reproduced this same scenario as before using nfs and block disks and add new logs. I like to > truncate the log on the host before attempting a merge "truncate --size 0 > /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test > completes. You can tell if the log contains the necessary info if you can > grep for the following: > > grep 'Starting merge with jobUUID' vdsm.log > > And you might be able to confirm that this is a problem with unsupported > block storage with the following grep: > > grep '<allocation/> missing from backing chain' vdsm.log
(In reply to Adam Litke from comment #2) > A couple of things to note. > > 1. Live merge for volumes on block storage will not work until Bug 1041569 > is fixed. I would expect everything to work fine if the test was repeated > for a VM with disks located solely on NFS storage. This bug is targeted for > RHEL-7.1 and we'll be providing builds of libvirt for 7.0 to select > customers until 7.1 is released. We'll work to get an updated build as soon > as Eric Blake completes development. > > 2. The vdsm log attached to this bug has no evidence of any live merge > activity whatsoever. If multiple vdsm hosts are involved in the test, > please be sure to post the full log from each host. > > Could you please retry the test with only NFS-based disks. I ran this with only NFS-based disks and it worked fine. Could you also > retry this exact test and be sure to include the full vdsm logs? I reproduced this same scenario as before using nfs and block disks and added new logs. I like to > truncate the log on the host before attempting a merge "truncate --size 0 > /var/log/vdsm/vdsm.log" and then collect the logs immediately after the test > completes. You can tell if the log contains the necessary info if you can > grep for the following: > > grep 'Starting merge with jobUUID' vdsm.log > > And you might be able to confirm that this is a problem with unsupported > block storage with the following grep: > > grep '<allocation/> missing from backing chain' vdsm.log
*** Bug 1160336 has been marked as a duplicate of this bug. ***
Adam, Kevin, after reviewing this bug very carefully I have found, that even though the VM had disks located solely on NFS, the snapshot that has been taken was a live snapshot with RAM enabled, which basically means that 2 more images were associated with that snapshot and were created on an iSCSi domain that belonged to the DC. from engine logs we see that live merge failed due to: "2014-11-02 14:45:14,256 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-81) [1f3e0e85] Merging of snapshot 02849eb3-1328-4479-8a0f-f63832036a6e images b7403493-c5d5-4534-9a5a-d12e0dab4618..6035dc5f-6715-477d-a227-2cd8dde2de2f failed " vdsm log expands on the problem: "Thread-50161::DEBUG::2014-11-02 14:45:14,134::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.merge' in bridge with {u'topVolUUID': u'6035dc5 f-6715-477d-a227-2cd8dde2de2f', u'vmID': u'116b380c-458d-4aea-be5a-865d033dc60a', u'drive': {u'domainID': u'1670633f-ed37-485e-98e8-d9c6f8687754', u'volumeID': u'20303b64-bc2a-4c85-b1ff-82769e8367c3', u'poolID': u'b2da04d8-13d3-4a4c-9e91-e6ee493c4fdd', u'imageID': u'cdeb134c-7512-4ae0-af4d-5bed7e65d7cf'}, u'bandwidth ': u'0', u'jobUUID': u'a77992d0-15e2-44df-900f-06d8c116528d', u'baseVolUUID': u'b7403493-c5d5-4534-9a5a-d12e0dab4618'} Thread-50161::DEBUG::2014-11-02 14:45:14,167::vm::5790::vm.Vm::(_diskXMLGetVolumeChainInfo) vmId=`116b380c-458d-4aea-be5a-865d033dc60a`::<allocation/> missing from backing chain for block device sda " Vdsm is looking for the 2 images(RAM & conf images) that engine complained he couldn't merge: b7403493-c5d5-4534-9a5a-d12e0dab4618 and 6035dc5f-6715-477d-a227-2cd8dde2de2f and that are "missin from backing chain" well those images are found on another domain (not a file one): name: blk1 id:1670633f-ed37-485e-98e8-d9c6f8687754 which is iSCSi: mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };], deviceSize=40, vendorName=DGC, pathsDictionary={sde=true, sdr=true, sdar=true, sdae=true}, lunType=ISCSI, status=null, diskId=null, diskAlias=null, storageDomainId=1670633f-ed37-485e-98e8-d9c6f8687754, storageDomainName=null]], log id: 30aadb61 2014-11-02 14:07:11,705 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [749c1778] Correlation ID: 749c1778, Job ID: 34ab9217-fffe-44af-9434-3621faa991ec, Call Stack: null, Custom Event ID: -1, Message: Storage Domain blk1 was added by admin 2014-11-02 14:07:11,861 INFO [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (ajp-/127.0.0.1:8702-13) [b076fb0] Lock Acquired to object EngineLock [exclusiveLocks= key: 1670633f-ed37-485e-98e8-d9c6f8687754 value: STORAGE During live snapshot command, vdsm never creates those 2 images on the same domain which holds VM's data disks(there's an unclear algorithm | bug which chooses a targeted domain, it seems random to me) in Kevin's scenario it always chooses "blk1" which is iSCSi. so what caused this problem is a libvirt fail to perform this action "merge: libvirt does not support volume chain monitoring. Unable to perform live merge. " due to the open bug, I guess, on enabling the watermark to inactive block volumes.
We are going to pursue a workaround for 3.5 where instead of extending merge volumes on-demand, we'll extend them before starting the merge. This will allow us to support live merge for block storage while we wait for the proper API from libvirt.
(In reply to Adam Litke from comment #10) > We are going to pursue a workaround for 3.5 where instead of extending merge > volumes on-demand, we'll extend them before starting the merge. This will > allow us to support live merge for block storage while we wait for the > proper API from libvirt. Do we have a separate BZ to track the consumption of libvirt's proper API once it's delivered?
Created https://bugzilla.redhat.com/show_bug.cgi?id=1168327 to track the adoption of the eventual libvirt api.
Hi Adam, Please provide the doc text for release notes. Cheers, Julie
Hey Adam, Can I get you to review the doc text to ensure it's technically accurate so that this can be included in the Release Notes. Cheers, Andrew
Hi Andrew, A couple of notes on the doctext: > Previously, live merging was not possible for block-based storage. Live merge is a 3.5 feature. This feature gap shouldn't be visible to any normal customer (except maybe early beta customers), so it shouldn't be relevant to document. > is a workaround until the appropriate libvirt API is finalized. While this is true, I don't think our documentation should be a place to point fingers "component XYZ was not ready, so component ABC had to have a workaround" IMHO, this should merely report a known issue, something down these lines: """ When live merging snapshots on a block storage domain the merge target volume is pro-actively extended to accommodate active writing of the source volume. This may cause some over-extension of the target volume. """
I agree with Allon in comment #15. Please consider changing the doctext according to his suggestion.
Works for me. Thank you both for the feedback.
I ran the original scenario with block and nfs devices. I was able to successfully delete snapshots. Moving to Verified. Used the following libvirt rpms: ----------------------------------- From http://people.redhat.com/jdenemar/libvirt/7.0/ libvirt-client-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-qemu-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-storage-1.2.8-10.el7ost.x86_64 libvirt-python-1.2.8-6.el7ost.x86_64 libvirt-daemon-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-interface-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-secret-1.2.8-10.el7ost.x86_64 libvirt-daemon-kvm-1.2.8-10.el7ost.x86_64 libvirt-lock-sanlock-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-network-1.2.8-10.el7ost.x86_64 libvirt-daemon-driver-nodedev-1.2.8-10.el7ost.x86_64 libvirt-daemon-config-nwfilter-1.2.8-10.el7ost.x86_64
Tested again with new libvirt and the problem reocurred: v3.5.1 vt14.1 http://download.devel.redhat.com/brewroot/packages/libvirt/1.2.8/16.el7_1.2/x86_64/ Engine log: 2015-03-24 20:11:19,631 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-8) Correlation ID: 5e7b6c9f, Job ID: d79ab6e5-0e4c-467d-899a-986df506c486, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snap111' deletion for VM 'vm_test1' was initiated by admin@internal. 2015-03-24 20:13:14,740 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) [5e7b6c9f] Correlation ID: 5e7b6c9f, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snap111' for VM 'vm_test1'.
Created attachment 1006001 [details] server, engine and vdsm logs Added new logs
Kevin, Thanks for reporting this issue. The following libvirt error libvirtError: invalid argument: invalid path /rhev/data-center/6d96f52d-d791-4f66-83bd-2553ca0f3012/3c9387ca-e89f-48c5-9dba-824639e604af/images/db96b8bc-9aac-416b-bba6-1d66599ab0d2/38e715e6-d3af-44e6-b153-7463fcb894e9 not assigned to domain definitely seems fishy. Unfortunately your vdsm logs don't show any of the merge-related commands that caused the situation. My guess is that the high rate of tracebacks due to the failure in the VM sampling thread caused the vdsm logs to rotate quickly. Please attach logs that contain messages like: "Starting merge with jobUUID=" and: "Synchronizing volume chain after live merge" Without these, I have no way to see how we got into this situation.
*** Bug 1205642 has been marked as a duplicate of this bug. ***
After looking at this bug and Bug 1155583, I have determined that they both have the same root issue. In the vdsm log I see messages like: Thread-50::ERROR::2015-03-26 10:53:09,314::sampling::488::vm.Vm::(collect) vmId=`b3fbc637-9cc9-4b15-ba94-5a2ee1607785`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x3133870> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 484, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite self._vm.extendDrivesIfNeeded() File "/usr/share/vdsm/virt/vm.py", line 2537, in extendDrivesIfNeeded extend = [x for x in self._getExtendCandidates() File "/usr/share/vdsm/virt/vm.py", line 2489, in _getExtendCandidates capacity, alloc, physical = self._dom.blockInfo(drive.path, 0) File "/usr/share/vdsm/virt/vm.py", line 689, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 646, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirtError: invalid argument: invalid path /rhev/data-center/dee92f86-2fef-4edf-ac51-135d12646bde/01f563fa-1aef-41ab-92d3-0d6561d4a731/images/31cfd267-e627-4cfe-8f33-f66b0db627c3/1863e178-c40a-42eb-803a-3d1f1ac6658a not assigned to domain After talking to Eric Blake on IRC, it looks like another problem with libvirt where the domain XML is not updated before the virDomainBlockJobAbort API (used for pivoting an active layer merge) returns. Therefore, our volume chain sync code gets an old value for the disk path which causes _highWrite to fail in this manner.
I don't want to repurpose this bug for a new issue so I created Bug 1206355 to track the libvirt race condition.
Yaniv/Aharon - resetting the component reset the BZ's flags. Please re-grant your acks. Thanks!
Verified with version: ----------------------- v3.6 ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch vdsm-4.17.0-632.git19a83a2.el7.x86_64 Verified using scenario: ------------------------ Steps to Reproduce: 1.Create VM with 5 disks (block preallocated, block thin provisioned, nfs - preallocated and nfs - thin provisioned 2.Created snapshot 1 while VM is down 3.Created snapshot 2 while VM is down 4.Created snapshot 3 while VM is down 5.Start VM 6.Delete snapshot 2 - successful - Verifies this bz In addition ran: ---------------- 7. Delete snapshot 3 - successful 8. Delete snapshot 1 - successful Moving to verified!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html