1545026 – Cannot delete snapshot after failed disk move while template copying.

Bug 1545026 - Cannot delete snapshot after failed disk move while template copying.

Summary: Cannot delete snapshot after failed disk move while template copying.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-14 05:14 UTC by Germano Veit Michel
Modified:	2019-05-16 13:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.2.2.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-28 07:27:20 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	88823	0	master	MERGED	core: add validation for template disk status prior to storage migration	2018-03-13 14:36:33 UTC
oVirt gerrit	88951	0	ovirt-engine-4.2	MERGED	core: add validation for template disk status prior to storage migration	2018-03-14 08:53:39 UTC

Description Germano Veit Michel 2018-02-14 05:14:59 UTC

Description of problem:

Start with a VM based on a template (thin)

1. Start the VM
2. Copy template from SD A to SD B
3. While it is still copying, engine allows to migrate the VM Disk from SD A to SD B. Do it.
4. Disk copy succeeds, Disk move fails
5. Live move snapshot in illegal state, retry doesn't work. Snapshot again doesn't work as well.

Version-Release number of selected component (if applicable):
rhevm-4.1.9.2-0.1.el7.noarch
vdsm-4.19.31-1.el7ev.x86_64

How reproducible:
Did not try again yet.

Steps to Reproduce:
1. Have 2 Storage Domains, A and B
2. Create Template on A
3. Create VM based on template (thin) on A
4. Copy template to B
5. While 4 is still in execution, live move disk of VM A to SD B.

Actual results:
Live Move fails, illegal snapshot in chain, retry doesnt work. cold retry also fails.

Expected results:
Don't allow 4 and 5 at the same time? Or somehow succeed.

Additional info:

* germano-he1 disk lives on rhevh5-nfs, it's based on RHEL-H-7.4-template which is also on rhevh5-nfs.

1. Start copy template
Feb 14, 2018 2:22:46 PM User admin@internal is copying disk RHEL-H-7.4-template to domain rhevh6-nfs.

2. Start live disk move
Feb 14, 2018 2:24:56 PM Snapshot 'Auto-generated for Live Storage Migration' creation for VM 'germano-he1' was initiated by admin@internal.
Feb 14, 2018 2:26:15 PM Snapshot 'Auto-generated for Live Storage Migration' creation for VM 'germano-he1' has been completed.
Feb 14, 2018 2:26:17 PM User admin@internal moving disk RHEL-H-7.4-template to domain rhevh6-nfs.

3. Copy finishes, live move still going...
Feb 14, 2018 2:26:34 PM User admin@internal finished copying disk RHEL-H-7.4-template to domain rhevh6-nfs.

4. Live move fails, so does snapshot remove
Feb 14, 2018 2:35:03 PM User admin@internal have failed to move disk RHEL-H-7.4-template to domain rhevh6-nfs.
Feb 14, 2018 2:35:03 PM Snapshot 'Auto-generated for Live Storage Migration' deletion for VM 'germano-he1' was initiated by admin@internal.
Feb 14, 2018 2:35:13 PM Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'germano-he1'.

5. Retry remove:
Feb 14, 2018 2:36:25 PM Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'germano-he1'.
Feb 14, 2018 2:37:05 PM Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'germano-he1'.

6. Power off and try cold:
Feb 14, 2018 2:41:54 PM VM germano-he1 powered off by admin@internal (Host: rhevh7).
Feb 14, 2018 2:42:22 PM Failed to delete snapshot 'Auto-generated for Live Storage Migration' for VM 'germano-he1'.

Not really sure if this is actually supposed to be allowed, but the live move failed here:

2018-02-14 14:35:01,028+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler7) [217aa54b-3a2e-4d2e-b46c-f34c294b7a84] START, VmReplicateDiskFinishVDSCommand(HostName = rhevh7, VmReplicateDiskParameters:{runAsync='true', hostId='cc54ddf1-507e-443e-a688-06e37290d2f0', vmId='2fcf1180-1193-4d9e-b432-d9e48885e195', storagePoolId='8922eadb-09a6-4a42-88ca-e6298e95b605', srcStorageDomainId='f50a1e6e-5b88-4d1d-ab44-0c0b2bb804f8', targetStorageDomainId='a22db68a-00e5-43e2-afd1-b42e5689629f', imageGroupId='045b89e9-23c6-4e64-86db-aef96478a008', imageId='1a3520e4-b8c1-48d9-a52b-84c15cecbbb7'}), log id: 3b181270

2018-02-14 14:35:01,800+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [217aa54b-3a2e-4d2e-b46c-f34c294b7a84] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VDSM rhevh7 command VmReplicateDiskFinishVDS failed: Resource unavailable

VDSM side:

2018-02-14 14:35:01,510+1000 ERROR (jsonrpc/1) [virt.vm] (vmId='2fcf1180-1193-4d9e-b432-d9e48885e195') Replication job unfinished (drive: 'vda', srcDisk: {u'device': u'disk', u'poolID': u'8922eadb-09a6-4a42-88ca-e6298e95b605', u'volumeID': u'1a3520e4-b8c1-48d9-a52b-84c15cecbbb7', u'domainID': u'f50a1e6e-5b88-4d1d-ab44-0c0b2bb804f8', u'imageID': u'045b89e9-23c6-4e64-86db-aef96478a008'}, job: {'end': 1097203712L, 'bandwidth': 0L, 'type': 2, 'cur': 1092878336L}) (vm:3839)


Then the Live Merge failed like this:

2018-02-14 14:35:08,080+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-5) [217aa54b-3a2e-4d2e-b46c-f34c294b7a84] FINISH, MergeVDSCommand, log id: 2af67cab
2018-02-14 14:35:08,080+10 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-5-thread-5) [217aa54b-3a2e-4d2e-b46c-f34c294b7a84] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2170) [bll.jar:]
        at org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395) [bll.jar:]

2018-02-14 14:35:10,103+10 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (pool-5-thread-6) [217aa54b-3a2e-4d2e-b46c-f34c294b7a84] Failed to live merge, still in volume chain: [1a3520e4-b8c1-48d9-a52b-84c15cecbbb7, f1790894-c977-4878-a0d6-3e8d16faf41a]

VDSM side:

2018-02-14 14:35:07,016+1000 ERROR (jsonrpc/3) [virt.vm] (vmId='2fcf1180-1193-4d9e-b432-d9e48885e195') Live merge failed (job: f7c40e44-ddcf-4691-9f8b-26ff176395bb) (vm:4926)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 4924, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1006, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 678, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: block copy still active: disk 'vda' already in active block job

Comment 4 Allon Mureinik 2018-02-14 12:23:42 UTC

IMHO the problem is on step 3 - this should be blocked till copying the template's disk is done

Comment 5 RHV bug bot 2018-03-16 15:03:04 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 6 Avihai 2018-03-18 09:31:08 UTC

Verified at 4.2.2.4-0.1.el7

Comment 7 Franta Kust 2019-05-16 13:05:53 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.