Bug 2103582 - Failed to delete the second snapshot when deleting two snapshots of a running VM
Summary: Failed to delete the second snapshot when deleting two snapshots of a running VM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.50.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.2
: 4.50.2.2
Assignee: Albert Esteve
QA Contact: Ilia Markelov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-04 07:13 UTC by Qin Yuan
Modified: 2022-08-30 08:47 UTC (History)
4 users (show)

Fixed In Version: vdsm-4.50.2.2
Clone Of:
Environment:
Last Closed: 2022-08-30 08:47:42 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
engine, vdsm logs (74.87 KB, application/x-xz)
2022-07-04 07:13 UTC, Qin Yuan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 285 0 None open blockVolume: check parent tag and meta match 2022-07-20 09:47:04 UTC
Red Hat Issue Tracker RHV-46950 0 None None None 2022-07-04 07:26:12 UTC

Description Qin Yuan 2022-07-04 07:13:43 UTC
Created attachment 1894436 [details]
engine, vdsm logs

Description of problem:
When deleting two snapshots of a running VM, the second snapshot can't be removed, errors in engine log:

2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Command 'MergeVDSCommand(HostName = host_mixed_2, MergeVDSCommandParameters:{hostId='69e08ae3-553c-4564-9dae-f8bacc773ae4', vmId='94a5efff-ebb4-4876-b424-3fd2c8f48321', storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='b75195e9-5f17-4169-9fdd-f64c6773cc86', imageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', baseImageId='f1f07afd-7934-4151-a5b4-0d5e7d106f83', topImageId='812cc42d-6fb0-442e-976a-2bc30e304ee7', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100
2022-07-04 09:20:29,379+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [5edf34c0-a007-49e9-be4f-7f21eeeae095] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image b75195e9-5f17-4169-9fdd-f64c6773cc86 top 812cc42d-6fb0-442e-976a-2bc30e304ee7 base f1f07afd-7934-4151-a5b4-0d5e7d106f83: Volume does not exist: ('c363cc65-cccc-48a2-8146-94cdec4698ff',)",), code = 100 (Failed with error GeneralException and code 100)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2121)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45)


Version-Release number of selected component (if applicable):
ovirt-engine-4.5.1.2-0.11.el8ev.noarch
vdsm-4.50.1.3-1.el8ev.x86_64

How reproducible:
Can't be reproduced 100%, but at a high percentage.

Steps to Reproduce:
1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra, the storage type of disk image is iSCSI)
2. Create two live snapshots both with memory, snap1, snap2.
3. Remove snap1.
4. Remove snap2 right after snap1 is removed.

Actual results:
1. The second snapshot can't be removed with above errors.

Expected results:
1. The second snapshot can be removed successfully.

Additional info:

Comment 1 Arik 2022-07-04 11:29:21 UTC
(In reply to Qin Yuan from comment #0)
> Steps to Reproduce:
> 1. Create and run a VM(used vm template latest-rhel-guest-image-8.6-infra,
> the storage type of disk image is iSCSI)
> 2. Create two live snapshots both with memory, snap1, snap2.

The issue is likely to happen also without memory since the problem is with the volumes

> 3. Remove snap1.
> 4. Remove snap2 right after snap1 is removed.
> 
> Actual results:
> 1. The second snapshot can't be removed with above errors.
> 
> Expected results:
> 1. The second snapshot can be removed successfully.

Ideally when the first remove-snapshot is done, we should be able to proceed with the second removal but if it's too challenging then even blocking the second operation (on the engine side) would be ok
Benny, can you please take a deeper look at what's going on here?

Comment 2 Benny Zlotnik 2022-07-06 08:20:03 UTC
Can you share the logs from the SPM as well?

Comment 4 Benny Zlotnik 2022-07-07 09:28:37 UTC
Initial notes:
DestroyImage call for volume 710f86ca-0381-40c0-bb94-e0dbb6d10c7f

2022-07-06 13:31:07,123+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-6) [b36e5c07-a5f3-4460-9824-db9977a4c923] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='3ebd784d-d9cd-454d-9a44-01275d8856bc', ignoreFailoverLimit='false', storageDomainId='877192bd-b310-4913-9732-8c43d40166af', imageGroupId='2b116aed-735c-4683-9d13-c6585bea13af', imageId='00000000-0000-0000-0000-000000000000', imageList='[710f86ca-0381-40c0-bb94-e0dbb6d10c7f]', postZero='false', force='false'}), log id: 4e20959

Tags updated
7441:2022-07-06 13:31:07,706+0300 INFO  (tasks/5) [storage.lvm] Change LVs tags (vg=877192bd-b310-4913-9732-8c43d40166af, lvs=('4d027efa-8828-49c6-b69e-7631355777c8',), delTags=('PU_710f86ca-0381-40c0-bb94-e0dbb6d10c7f',), addTags=('PU_c79ae80e-f60c-483c-be32-8fa31b659545',)) (lvm:1891)

Second live merge starts
2022-07-06 13:31:43,678+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-292) [16764fae-71ec-46dc-b7bf-4aa49ee6c7b8] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bf6d6f84-2815-400b-8de3-049040c4dccf=VM]', sharedLocks=''}'


For some reason 710f86ca-0381-40c0-bb94-e0dbb6d10c7f is still reported as 4d027efa-8828-49c6-b69e-7631355777c8's parent
2022-07-06 13:31:46,860+0300 INFO  (jsonrpc/3) [vdsm.api] FINISH getVolumeInfo return={'info': 
{'uuid': '4d027efa-8828-49c6-b69e-7631355777c8',
 'type': 'SPARSE',
 'format': 'COW',
 'disktype': 'DATA',
 'voltype': 'LEAF',
 'capacity': '10737418240',
 'parent': '710f86ca-0381-40c0-bb94-e0dbb6d10c7f',
 ...} 
from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, task_id=4ebd95b0-4351-41f6-8e6e-b01c58dba6e1 (api:54)


Measure later fails

2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [storage.dispatcher] FINISH measure error=Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (dispatcher:83)
2022-07-06 13:31:47,092+0300 ERROR (jsonrpc/3) [api] FINISH merge error=Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',) (api:134)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 782, in merge
    drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5979, in merge
    driveSpec, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 308, in merge
    drive, job, base_info)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 456, in _base_needs_extend
    baseID=job.base)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6135, in measure
    f"Unable to measure domain {domainID} image {imageID} top "
vdsm.virt.errors.StorageUnavailableError: Unable to measure domain 877192bd-b310-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: ('710f86ca-0381-40c0-bb94-e0dbb6d10c7f',)
2022-07-06 13:31:47,093+0300 INFO  (jsonrpc/3) [api.virt] FINISH merge return={'status': {'code': 100, 'message': 'General Exception: ("Unable to measure domain 877192bd-b31
0-4913-9732-8c43d40166af image 2b116aed-735c-4683-9d13-c6585bea13af top 4d027efa-8828-49c6-b69e-7631355777c8 base c79ae80e-f60c-483c-be32-8fa31b659545: Volume does not exist: (\'710f86ca-0381-40c0-bb94-e0dbb6d10c7f\',)",)'}} from=::ffff:10.37.142.57,45342, flow_id=16764fae-71ec-46dc-b7bf-4aa49ee6c7b8, vmId=bf6d6f84-2815-400b-8de3-049040c4dccf (api:54)

Comment 5 Benny Zlotnik 2022-07-14 08:35:26 UTC
Albert, can you take a look? It looks like the metadata is not updated for some reason, or at least that's what I think the surface issue is

Comment 6 Albert Esteve 2022-07-18 10:58:21 UTC
Looking into it.

Comment 7 Albert Esteve 2022-07-20 10:51:34 UTC
I have been able to reproduce the bug. The problem is that the host that is
running the VM only updates the child volume metadata when a volume in a
chain is deleted. The child volume tags are updated only by the SPM,
after the volume is removed.

In between these two events, any operation that retrieves the parent ID for the
child volume will return an outdated UUID. We have a patch posted that should
fix this issue.

Comment 8 Ilia Markelov 2022-08-08 14:31:09 UTC
Do you have better verification flow steps for this bug? The main flow is not 100% reproducible and it's a little unclear from the comments how to reproduce it properly

Comment 9 Albert Esteve 2022-08-08 15:37:33 UTC
I have used this automatic stress test: https://github.com/oVirt/ovirt-stress/tree/master/delete-snapshot-chain
The test was consistently reproducing the issue. It comes with some instructions to set it up and run it.
Please, try it, and if there is any point unclear, you can ping me.

Comment 10 Ilia Markelov 2022-08-10 07:42:09 UTC
Verified.

Snapshots were removed successfully.

Versions:
engine-4.5.2.1-0.1.el8ev
vdsm-4.50.2.2-1.el8ev.x86_64

Comment 11 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.