Bug 2103582

Summary: Failed to delete the second snapshot when deleting two snapshots of a running VM
Product: [oVirt] vdsm Reporter: Qin Yuan <qiyuan>
Component: GeneralAssignee: Albert Esteve <aesteve>
Status: CLOSED CURRENTRELEASE QA Contact: Ilia Markelov <imarkelo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.50.1CC: aesteve, ahadas, bugs, sfishbai
Target Milestone: ovirt-4.5.2Keywords: Automation
Target Release: 4.50.2.2Flags: pm-rhel: ovirt-4.5?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.50.2.2 Doc Type: Bug Fix
Doc Text:
Cause: The parent UUID from a volume is obtained using a special LV tag, which could be outdated after a live merge on a non-SPM host. Consequence: When the host tries to get the LV pointed by the stale UUID, fails. Fix: Parent UUID is kept also in the volume metadata. When retrieving the parent UUID, check both UUIDs, update the cached LVM data and return metadata parent, which is updated immediately after a live merge. Result: Bug is fixed with metadata parent, and LV tags are updated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-30 08:47:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine, vdsm logs none

Description Qin Yuan 2022-07-04 07:13:43 UTC
Created attachment 1894436 [details]
engine, vdsm logs

Description of problem:
When deleting two snapshots of a running VM, the second snapshot can't be removed, errors in engine log:

2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Command 'MergeVDSCommand(HostName = host_mixed_2, MergeVDSCommandParameters:{hostId='69e08ae3-553c-4564-9dae-f8bacc773ae4', vmId='94a5efff-ebb4-4876-b424-3fd2c8f48321', storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='b75195e9-5f17-4169-9fdd-f64c6773cc86', imageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', baseImageId='f1f07afd-7934-4151-a5b4-0d5e7d106f83', topImageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100
2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100 (Failed with error GeneralException and code 100)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2121)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45)


Version-Release number of selected component (if applicable):
ovirt-engine-4.5.1.2-0.11.el8ev.noarch
vdsm-4.50.1.3-1.el8ev.x86_64

How reproducible:
Can't be reproduced 100%, but at a high percentage.

Steps to Reproduce:
1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra, the storage type of disk image is iSCSI)
2. Create two live snapshots both with memory, snap1, snap2.
3. Remove snap1.
4. Remove snap2 right after snap1 is removed.

Actual results:
1. The second snapshot can't be removed with above errors.

Expected results:
1. The second snapshot can be removed successfully.

Additional info:

Comment 1 Arik 2022-07-04 11:29:21 UTC
(In reply to Qin Yuan from comment #0)
> Steps to Reproduce:
> 1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra,
> the storage type of disk image is iSCSI)
> 2. Create two live snapshots both with memory, snap1, snap2.

The issue is likely to happen also without memory since the problem is with the volumes

> 3. Remove snap1.
> 4. Remove snap2 right after snap1 is removed.
> 
> Actual results:
> 1. The second snapshot can't be removed with above errors.
> 
> Expected results:
> 1. The second snapshot can be removed successfully.

Ideally when the first remove-snapshot is done, we should be able to proceed with the second removal but if it's too challenging then even blocking the second operation (on the engine side) would be ok
Benny, can you please take a deeper look at what's going on here?

Comment 2 Benny Zlotnik 2022-07-06 08:20:03 UTC
Can you share the logs from the SPM as well?

Comment 4 Benny Zlotnik 2022-07-07 09:28:37 UTC
Initial notes:
DestroyImage call for volume 710f86ca-0381-40c0-bb94-e0dbb6d10c7f

2022-07-06 13:31:07,123+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-6) [b36e5c07-a5f3-4460-9824-db9977a4c923] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', ignoreFailoverLimit='false', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='2b116aed-735c-4683-9d13-c6585bea13af', imageId='00000000-0000-0000-0000-000000000000', imageList='[710f86ca-0381-40c0-bb94-e0dbb6d10c7f]', postZero='false', force='false'}), log id: 4e20959

Tags updated
7441:2022-07-06 13:31:07,706+0300 INFO  (tasks/5) [storage.lvm] Change LVs tags (vg=877192bd-b310-4913-9732-8c43d40166af, lvs=('4d027efa-8828-49c6-b69e-7631355777c8',), delTags=('PU_710f86ca-0381-40c0-bb94-e0dbb6d10c7f',), addTags=('PU_c79ae80e-f60c-483c-be32-8fa31b659545',)) (lvm:1891)

Second live merge starts
2022-07-06 13:31:43,678+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-292) [16764fae-71ec-46dc-b7bf-4aa49ee6c7b8] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bf6d6f84-2815-400b-8de3-049040c4dccf=VM]', sharedLocks=''}'


For some reason 710f86ca-0381-40c0-bb94-e0dbb6d10c7f is still reported as 4d027efa-8828-49c6-b69e-7631355777c8's parent
2022-07-06 13:31:46,860+0300 INFO  (jsonrpc/3) [vdsm.api] FINISH getVolumeInfo return={'info': 
{'uuid': '4d027efa-8828-49c6-b69e-7631355777c8',
 'type': 'SPARSE',
 'format': 'COW',
 'disktype': 'DATA',
 'voltype': 'LEAF',
 'capacity': '10737418240',
 'parent': '710f86ca-0381-40c0-bb94-e0dbb6d10c7f',
 ...} 
from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, task_id=4ebd95b0-4351-41f6-8e6e-b01c58dba6e1 (api:54)


Measure later fails

2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [storage.dispatcher] FINISH measure error=Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (dispatcher:83)
2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [api] FINISH merge error=Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (api:134)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 782, in merge
    drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5979, in merge
    driveSpec, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 308, in merge
    drive, job, base_info)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 456, in _base_needs_extend
    baseID=job.base)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6135, in measure
    f"Unable to measure domain {domainID} image {imageID} top "
vdsm.virt.errors.StorageUnavailableError: Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',)
2022-07-06 13:31:47,093+0300 INFO  (jsonrpc/3) [api.virt] FINISH merge return={'status': {'code': 100, 'message': 'General Exception: ("Unable to measure domain 877192bd-b31
0-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: (\'710f86ca-0381-40c0-bb94-e0dbb6d10c7f\',)",)'}} from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, vmId=bf6d6f84-2815-400b-8de3-049040c4dccf (api:54)

Comment 5 Benny Zlotnik 2022-07-14 08:35:26 UTC
Albert, can you take a look? It looks like the metadata is not updated for some reason, or at least that's what I think the surface issue is

Comment 6 Albert Esteve 2022-07-18 10:58:21 UTC
Looking into it.

Comment 7 Albert Esteve 2022-07-20 10:51:34 UTC
I have been able to reproduce the bug. The problem is that the host that is
running the VM only updates the child volume metadata when a volume in a
chain is deleted. The child volume tags are updated only by the SPM,
after the volume is removed.

In between these two events, any operation that retrieves the parent ID for the
child volume will return an outdated UUID. We have a patch posted that should
fix this issue.

Comment 8 Ilia Markelov 2022-08-08 14:31:09 UTC
Do you have better verification flow steps for this bug? The main flow is not 100% reproducible and it's a little unclear from the comments how to reproduce it properly

Comment 9 Albert Esteve 2022-08-08 15:37:33 UTC
I have used this automatic stress test: https://github.com/oVirt/ovirt-stress/tree/master/delete-snapshot-chain
The test was consistently reproducing the issue. It comes with some instructions to set it up and run it.
Please, try it, and if there is any point unclear, you can ping me.

Comment 10 Ilia Markelov 2022-08-10 07:42:09 UTC
Verified.

Snapshots were removed successfully.

Versions:
engine-4.5.2.1-0.1.el8ev
vdsm-4.50.2.2-1.el8ev.x86_64

Comment 11 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.