Created attachment 1894436 [details] engine, vdsm logs Description of problem: When deleting two snapshots of a running VM, the second snapshot can't be removed, errors in engine log: 2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Command 'MergeVDSCommand(HostName = host_mixed_2, MergeVDSCommandParameters:{hostId='69e08ae3-553c-4564-9dae-f8bacc773ae4', vmId='94a5efff-ebb4-4876-b424-3fd2c8f48321', storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='b75195e9-5f17-4169-9fdd-f64c6773cc86', imageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', baseImageId='f1f07afd-7934-4151-a5b4-0d5e7d106f83', topImageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100 2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100 (Failed with error GeneralException and code 100) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2121) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45) Version-Release number of selected component (if applicable): ovirt-engine-4.5.1.2-0.11.el8ev.noarch vdsm-4.50.1.3-1.el8ev.x86_64 How reproducible: Can't be reproduced 100%, but at a high percentage. Steps to Reproduce: 1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra, the storage type of disk image is iSCSI) 2. Create two live snapshots both with memory, snap1, snap2. 3. Remove snap1. 4. Remove snap2 right after snap1 is removed. Actual results: 1. The second snapshot can't be removed with above errors. Expected results: 1. The second snapshot can be removed successfully. Additional info:
(In reply to Qin Yuan from comment #0) > Steps to Reproduce: > 1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra, > the storage type of disk image is iSCSI) > 2. Create two live snapshots both with memory, snap1, snap2. The issue is likely to happen also without memory since the problem is with the volumes > 3. Remove snap1. > 4. Remove snap2 right after snap1 is removed. > > Actual results: > 1. The second snapshot can't be removed with above errors. > > Expected results: > 1. The second snapshot can be removed successfully. Ideally when the first remove-snapshot is done, we should be able to proceed with the second removal but if it's too challenging then even blocking the second operation (on the engine side) would be ok Benny, can you please take a deeper look at what's going on here?
Can you share the logs from the SPM as well?
Initial notes: DestroyImage call for volume 710f86ca-0381-40c0-bb94-e0dbb6d10c7f 2022-07-06 13:31:07,123+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-6) [b36e5c07-a5f3-4460-9824-db9977a4c923] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', ignoreFailoverLimit='false', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='2b116aed-735c-4683-9d13-c6585bea13af', imageId='00000000-0000-0000-0000-000000000000', imageList='[710f86ca-0381-40c0-bb94-e0dbb6d10c7f]', postZero='false', force='false'}), log id: 4e20959 Tags updated 7441:2022-07-06 13:31:07,706+0300 INFO (tasks/5) [storage.lvm] Change LVs tags (vg=877192bd-b310-4913-9732-8c43d40166af, lvs=('4d027efa-8828-49c6-b69e-7631355777c8',), delTags=('PU_710f86ca-0381-40c0-bb94-e0dbb6d10c7f',), addTags=('PU_c79ae80e-f60c-483c-be32-8fa31b659545',)) (lvm:1891) Second live merge starts 2022-07-06 13:31:43,678+03 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-292) [16764fae-71ec-46dc-b7bf-4aa49ee6c7b8] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bf6d6f84-2815-400b-8de3-049040c4dccf=VM]', sharedLocks=''}' For some reason 710f86ca-0381-40c0-bb94-e0dbb6d10c7f is still reported as 4d027efa-8828-49c6-b69e-7631355777c8's parent 2022-07-06 13:31:46,860+0300 INFO (jsonrpc/3) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': '4d027efa-8828-49c6-b69e-7631355777c8', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '10737418240', 'parent': '710f86ca-0381-40c0-bb94-e0dbb6d10c7f', ...} from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, task_id=4ebd95b0-4351-41f6-8e6e-b01c58dba6e1 (api:54) Measure later fails 2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [storage.dispatcher] FINISH measure error=Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (dispatcher:83) 2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [api] FINISH merge error=Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (api:134) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 782, in merge drive, baseVolUUID, topVolUUID, bandwidth, jobUUID) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5979, in merge driveSpec, baseVolUUID, topVolUUID, bandwidth, jobUUID) File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 308, in merge drive, job, base_info) File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 456, in _base_needs_extend baseID=job.base) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6135, in measure f"Unable to measure domain {domainID} image {imageID} top " vdsm.virt.errors.StorageUnavailableError: Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) 2022-07-06 13:31:47,093+0300 INFO (jsonrpc/3) [api.virt] FINISH merge return={'status': {'code': 100, 'message': 'General Exception: ("Unable to measure domain 877192bd-b31 0-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: (\'710f86ca-0381-40c0-bb94-e0dbb6d10c7f\',)",)'}} from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, vmId=bf6d6f84-2815-400b-8de3-049040c4dccf (api:54)
Albert, can you take a look? It looks like the metadata is not updated for some reason, or at least that's what I think the surface issue is
Looking into it.
I have been able to reproduce the bug. The problem is that the host that is running the VM only updates the child volume metadata when a volume in a chain is deleted. The child volume tags are updated only by the SPM, after the volume is removed. In between these two events, any operation that retrieves the parent ID for the child volume will return an outdated UUID. We have a patch posted that should fix this issue.
Do you have better verification flow steps for this bug? The main flow is not 100% reproducible and it's a little unclear from the comments how to reproduce it properly
I have used this automatic stress test: https://github.com/oVirt/ovirt-stress/tree/master/delete-snapshot-chain The test was consistently reproducing the issue. It comes with some instructions to set it up and run it. Please, try it, and if there is any point unclear, you can ping me.
Verified. Snapshots were removed successfully. Versions: engine-4.5.2.1-0.1.el8ev vdsm-4.50.2.2-1.el8ev.x86_64
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.