Bug 1232481
Summary: | VDSM: Live merge fails after a disk containing a snapshot has been extended | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Bimal Chollera <bcholler> | ||||||
Component: | vdsm | Assignee: | Adam Litke <alitke> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Aharon Canan <acanan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.5.1 | CC: | acanan, alitke, amureini, bazulay, ebenahar, fdelorey, gveitmic, gwatson, kgoldbla, lpeer, lsurette, mkalinin, ratamir, tnisan, v.astafiev, yeylon, ykaul, ylavi | ||||||
Target Milestone: | ovirt-3.6.1 | Keywords: | Automation, ZStream | ||||||
Target Release: | 3.6.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | v4.17.2 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1236758 1251958 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-03-09 19:41:24 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1274323, 1291569, 1292129 | ||||||||
Bug Blocks: | 1251958 | ||||||||
Attachments: |
|
Description
Bimal Chollera
2015-06-16 20:53:14 UTC
Adam, can you take a look please? I've tested 4 scenarios: 1. Block disk, merging into smaller cow base: Merge succeeds but engine still shows the smaller disk size originally associated with the base volume. 2. Block disk, merging into smaller raw base: Merge fails to start with libvirt error (failed to resize base) 3. File disk, merging into smaller cow base: Merge succeeds but engine still shows the smaller disk size originally associated with the base volume. 4. File disk, merging into smaller raw base: Merge succeeds but engine still shows the smaller disk size originally associated with the base volume. This matches what is being reported. I think we have two bugs here: 1. If the base is a raw block volume, we must extend it to the virtual size of top. 2. When engine completes a live merge if top was larger than base engine must update the size of base to match the size of top. I think the solution to this problem is as follows: 1. In vdsm we need to detect when we are merging into an undersized raw block volume and prevent it with a new error message. 2. Before submitting the merge command to vdsm, engine should check if the base volume is RAW and if the top volume is larger. In that case it should use the SPM to request an extension of the RAW volume to the size of the top volume. This this is a real severe bug, blocking it is a good workaround, but we need a fix from qemu\libvirt. Can you please clone it to their tracker and make them aware of this issue? (In reply to Yaniv Dary from comment #12) > This this is a real severe bug, blocking it is a good workaround, but we > need a fix from qemu\libvirt. Can you please clone it to their tracker and > make them aware of this issue? Why do you think the so called real fix isn't in vdsm? (In reply to Allon Mureinik from comment #13) > (In reply to Yaniv Dary from comment #12) > > This this is a real severe bug, blocking it is a good workaround, but we > > need a fix from qemu\libvirt. Can you please clone it to their tracker and > > make them aware of this issue? > Why do you think the so called real fix isn't in vdsm? Since the offered solution in comment #11 is blocking flows and not resolving the issue and since live merge is a general use case not specific to RHEV, I think the fix needs to come from the factor doing the merge, but I could be mistaken. Anyways this is a serious limitation to the feature and I expect the fix to extend the base as suggested in comment #10. Bimal, thank you for raising the priority. I think the outcome can be quite unpleasant and require customer opening a ticket and manual fix involved. If the full fix is unavailable by next release, let's at least block this operation and notify user to perform this operation offline, which, I understand, succeeds. (In reply to Yaniv Dary from comment #14) > (In reply to Allon Mureinik from comment #13) > > (In reply to Yaniv Dary from comment #12) > > > This this is a real severe bug, blocking it is a good workaround, but we > > > need a fix from qemu\libvirt. Can you please clone it to their tracker and > > > make them aware of this issue? > > Why do you think the so called real fix isn't in vdsm? > > Since the offered solution in comment #11 is blocking flows and not > resolving the issue and since live merge is a general use case not specific > to RHEV, I think the fix needs to come from the factor doing the merge, but > I could be mistaken. oVirt Engine should orchestrate this by explicitly extending the disk before merging if both volumes have different logical sizes. libvirt/qemu cannot do this - this is a storage management task. Bug 1236758 handles the extension (solution POSTED, pending reviews), and the patch for this bug will serve as a gatekeeper to prevent redundant and potentially dangerous operations where we know they may fail. Reducing severity: The engine orchestration (in bug 1236758 and its z-stream clone) should prevent this issue from happening if you have an updated engine. So this fix should only be applicable when having a new VDSM and an older engine that doesn't contain the relevant fix, and even there all it would do is fail the merge command faster and with a nicer error. Tested using the following version: --------------------------------------- rhevm-3.6.0-0.11.master.el6.noarch - running on Rhel6.7 vdsm-4.17.2-1.el7ev.noarch - running on Rhel7.2 Tested using the following steps: Scenario 1: ---------------- 1. Create a VM with 1gb preallocated disk (probably thin as well, didn't test that) in a block-based domain. 2. Create a snapshot. 3. Extend the disk to 2gb. 4. Start the VM. 5. Delete the snapshot - Live merge works well! Scenario 2: -------------------- 1) Create a VM with 1gb preallocated disk in a block-based domain. 2) Create a snapshot. 3) Create a second snapshot. 4) Extend the disk to 2gb. 5) Start the VM. 6) Delete the second snapshot - Live merge works well! Moving to VERIFIED! Still occurs: Version-Release number of selected component (if applicable): rhevm-3.6.0.1-0.1.el6.noarch vdsm-4.17.9-1.el7ev.noarch from engine.log: 2015-10-15 17:55:09,795 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-7) [5372d9c] Failed in 'MergeVDS' method 2015-10-15 17:55:09,799 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-7) [5372d9c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed _1 command failed: Merge failed 2015-10-15 17:55:09,799 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-7) [5372d9c] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyRetu rnForXmlRpc [status=StatusForXmlRpc [code=52, message=Merge failed]]' 2015-10-15 17:55:09,799 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-7) [5372d9c] HostName = host_mixed_1 2015-10-15 17:55:09,799 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-7) [5372d9c] Command 'MergeVDSCommand(HostName = host_mixed_1, MergeVDSCommandParameters:{runAsync='true', ho stId='15cb84b3-1420-4363-8e3a-f05f56aa1d41', vmId='7b089b30-9c15-4ea7-8112-03be938b78af', storagePoolId='9f5fb006-6d26-4651-9d8e-83881690b7c4', storageDomainId='1b3117e3-d68b-48d5-ac02-8e3ccb4037eb', imageGroupId=' 02bb83d9-0bd6-4500-abbe-91f8a9d41b8a', imageId='78078a61-edb5-4a8a-a5b9-3932e0ad3d84', baseImageId='4056bbaa-e8cb-4d2f-963e-6da626c2e6a6', topImageId='78078a61-edb5-4a8a-a5b9-3932e0ad3d84', bandwidth='0'})' executi on failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2015-10-15 17:55:09,799 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-7) [5372d9c] FINISH, MergeVDSCommand, log id: e25d3bd 2015-10-15 17:55:09,799 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-7) [5372d9c] Command 'org.ovirt.engine.core.bll.MergeCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker .VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2015-10-15 17:55:09,806 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-7) [5372d9c] Transaction rolled-back for command 'org.ovirt.engine.core.bll.MergeCommand'. 2015-10-15 17:55:09,870 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-6) [31a7c0c0] Failed in 'MergeVDS' method 2015-10-15 17:55:09,874 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-6) [31a7c0c0] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixe d_1 command failed: Merge failed 2015-10-15 17:55:09,874 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-6) [31a7c0c0] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyRet urnForXmlRpc [status=StatusForXmlRpc [code=52, message=Merge failed]]' 2015-10-15 17:55:09,874 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-6) [31a7c0c0] HostName = host_mixed_1 2015-10-15 17:55:09,874 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-6) [31a7c0c0] Command 'MergeVDSCommand(HostName = host_mixed_1, MergeVDSCommandParameters:{runAsync='true', h ostId='15cb84b3-1420-4363-8e3a-f05f56aa1d41', vmId='7b089b30-9c15-4ea7-8112-03be938b78af', storagePoolId='9f5fb006-6d26-4651-9d8e-83881690b7c4', storageDomainId='1b3117e3-d68b-48d5-ac02-8e3ccb4037eb', imageGroupId= '0d22f65a-3f57-4144-8eed-d49bc83e2521', imageId='d95cb175-81ab-40cb-96bd-9f0c92df64ef', baseImageId='b55de5e5-bb2e-45c9-b4bb-8271b1023fd1', topImageId='d95cb175-81ab-40cb-96bd-9f0c92df64ef', bandwidth='0'})' execut ion failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 2015-10-15 17:55:09,874 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-6-thread-6) [31a7c0c0] FINISH, MergeVDSCommand, log id: 224defb3 2015-10-15 17:55:09,874 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-6) [31a7c0c0] Command 'org.ovirt.engine.core.bll.MergeCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroke r.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) 2015-10-15 17:55:09,884 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-6-thread-6) [31a7c0c0] Transaction rolled-back for command 'org.ovirt.engine.core.bll.MergeCommand'. 2015-10-15 17:55:10,789 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-19) [5072fa86] Failed child command status for step 'MERGE' 2015-10-15 17:55:10,797 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-19) [3ffd2194] Failed child command status for step 'MERGE' 2015-10-15 17:55:11,817 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-35) [5072fa86] Merging of snapshot '41918c41-8409-49a8-a11b-e870b6c7fd79' images 'b55de5e 5-bb2e-45c9-b4bb-8271b1023fd1'..'d95cb175-81ab-40cb-96bd-9f0c92df64ef' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2015-10-15 17:55:11,833 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-35) [3ffd2194] Merging of snapshot '41918c41-8409-49a8-a11b-e870b6c7fd79' images '4056bbaa-e8cb-4d2f-963e-6da626c2e6a6'..'78078a61-edb5-4a8a-a5b9-3932e0ad3d84' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2015-10-15 17:55:11,840 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-35) [228418f8] All Live Merge child commands have completed, status 'FAILED' From vdsm.log: Thread-5851::ERROR::2015-10-15 17:55:09,770::vm::4776::virt.vm::(merge) vmId=`7b089b30-9c15-4ea7-8112-03be938b78af`::Live merge failed (job: 1171f208-531f-4db7-828d-6c19abb176cc) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 4772, in merge flags) File "/usr/share/vdsm/virt/virdomain.py", line 68, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 668, in blockCommit if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-commit': Top image /rhev/data-center/9f5fb006-6d26-4651-9d8e-83881690b7c4/1b3117e3-d68b-48d5-ac02-8e3ccb4037eb/images/02bb83d9-0bd6-4500-abbe-91f8 a9d41b8a/78078a61-edb5-4a8a-a5b9-3932e0ad3d84 is larger than base image /rhev/data-center/9f5fb006-6d26-4651-9d8e-83881690b7c4/1b3117e3-d68b-48d5-ac02-8e3ccb4037eb/images/02bb83d9-0bd6-4500-abbe-91f8a9d41b8a/../02b b83d9-0bd6-4500-abbe-91f8a9d41b8a/4056bbaa-e8cb-4d2f-963e-6da626c2e6a6, and resize of base image failed: Invalid argument Thread-5851::DEBUG::2015-10-15 17:55:09,790::stompreactor::304::yajsonrpc.StompServer::(send) Sending response Created attachment 1083672 [details]
engine and vdsm logs
The root cause of the failure is described in bug 1274323 this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high Comment 21 describes a scenario not handled by the original patch. To keep things simple, I'd like to move this back to modified and track the backport of the fix described in bug 1274323 to 3.6.1 in a new clone of 1274323. To re-verify, execute the steps described in comment 19 but ensure that the VM being used runs on the SPM host. Hi Adam, Followed the steps described in comment #19 (second scenario): Scenario 2: -------------------- 1) Create a VM with 1gb preallocated disk in a block-based domain. 2) Create a snapshot. 3) Create a second snapshot. 4) Extend the disk to 2gb. 5) Start the VM. 6) Delete the second snapshot Live merge of this snapshot had succeeded. Then, I tried to delete the other snapshot created in step 2 and this deletion attempt failed with the following [1] [2]. Should we re-open this bug or file a new one? [1]engine.log: 2015-12-10 13:31:40,616 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-79) [149e4d39] BaseAsyncTask::logEndTaskFailure: Task '7f1a67f9-43c8-4d41-9eb9-8149c9ebae90' (Parent Comm and 'MergeExtend', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Incorrect size value for volume resize, code = 232', -- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Incorrect size value for volume resize, code = 232' [2]vdsm.log: 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90::ERROR::2015-12-10 15:21:58,072::task::866::Storage.TaskManager.Task::(_setError) Task=`7f1a67f9-43c8-4d41-9eb9-8149c9ebae90`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 332, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1333, in extendVolumeSize .produceVolume(imgUUID, volUUID).extendSize(int(newSize)) File "/usr/share/vdsm/storage/volume.py", line 618, in extendSize raise se.VolumeResizeValueError(newSize) VolumeResizeValueError: Incorrect size value for volume resize: (4194304,) 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90::DEBUG::2015-12-10 15:21:58,073::task::885::Storage.TaskManager.Task::(_run) Task=`7f1a67f9-43c8-4d41-9eb9-8149c9ebae90`::Task._run: 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90 () {} failed - stopping task 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90::DEBUG::2015-12-10 15:21:58,073::task::1246::Storage.TaskManager.Task::(stop) Task=`7f1a67f9-43c8-4d41-9eb9-8149c9ebae90`::stopping in state running (force False) 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90::DEBUG::2015-12-10 15:21:58,074::task::993::Storage.TaskManager.Task::(_decref) Task=`7f1a67f9-43c8-4d41-9eb9-8149c9ebae90`::ref 1 aborting True 7f1a67f9-43c8-4d41-9eb9-8149c9ebae90::DEBUG::2015-12-10 15:21:58,074::task::919::Storage.TaskManager.Task::(_runJobs) Task=`7f1a67f9-43c8-4d41-9eb9-8149c9ebae90`::aborting: Task is aborted: 'Incorrect size value for volume resize' - code 232 Full logs will be attached Created attachment 1104349 [details]
logs-10.12.15
Tried to retest and didn't managed to reproduce the issue mentioned on comment #28 Verified using the following versions: --------------------------------------- rhevm-3.6.2-0.1.el6.noarch - running on Rhel6.7 vdsm-4.17.14-0.el7ev.noarch - running on Rhel7.2 Scenario: --------- 1) Create a VM with 1gb preallocated disk in a block-based domain. 2) Create a snapshot. 3) Create a second snapshot. 4) Extend the disk to 2gb. 5) Start the VM on HSM. 6) Delete the second snapshot 7) Delete the first snapshot Adam - In case you think there is a real issue like appears in comment #28 please let us know. (In reply to Elad from comment #28) > Hi Adam, > Followed the steps described in comment #19 (second scenario): > > Scenario 2: > -------------------- > 1) Create a VM with 1gb preallocated disk in a block-based domain. > 2) Create a snapshot. > 3) Create a second snapshot. > 4) Extend the disk to 2gb. > 5) Start the VM. > 6) Delete the second snapshot > > Live merge of this snapshot had succeeded. Then, I tried to delete the other > snapshot created in step 2 and this deletion attempt failed with the > following [1] [2]. Should we re-open this bug or file a new one? Hi Elad, I was unable to reproduce this issue on master. We've fixed a number of engine-side live merge issues recently and I suspect we swept this up too. If you can reproduce this problem on the latest code please open a new bug to track it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html Tested the scenario described in comment #28 using latest code, could not reproduce. vdsm-4.17.23-0.el7ev.noarch libvirt-daemon-1.2.17-13.el7_2.4.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.8.x86_64 rhevm-3.6.3.4-0.1.el6.noarch *** Bug 1577060 has been marked as a duplicate of this bug. *** |