Bug 1833780 - Live storage migration failed - Failed to change disk image
Summary: Live storage migration failed - Failed to change disk image
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.4
: 4.4.4.2
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
: 1788464 1899360 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-10 14:48 UTC by Evelina Shames
Modified: 2023-09-15 01:29 UTC (History)
12 users (show)

Fixed In Version: vdsm-4.40.38 ovirt-engine-4.4.4.2
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-21 12:36:27 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
aoconnor: blocker-


Attachments (Terms of Use)
Logs (7.31 MB, application/zip)
2020-05-10 14:48 UTC, Evelina Shames
no flags Details
Logs (6.82 MB, application/zip)
2020-06-02 10:05 UTC, Evelina Shames
no flags Details
Libvirt-logs (9.78 MB, application/zip)
2020-06-02 10:32 UTC, Evelina Shames
no flags Details
vdsm-logs (10.23 MB, application/zip)
2020-06-02 11:02 UTC, Evelina Shames
no flags Details
vdsm-logs (15.13 MB, application/zip)
2020-06-02 11:03 UTC, Evelina Shames
no flags Details
vdsm-logs (10.99 MB, application/zip)
2020-06-02 11:04 UTC, Evelina Shames
no flags Details
Logs_debug (5.14 MB, application/zip)
2020-09-29 13:00 UTC, Evelina Shames
no flags Details
full_logs_debug (3.88 MB, application/zip)
2020-09-29 15:16 UTC, Evelina Shames
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5585691 0 None None None 2021-03-14 06:40:06 UTC
oVirt gerrit 111859 0 master MERGED core: introduce prepared parameter for copy data endpoints 2021-01-10 09:38:54 UTC
oVirt gerrit 111869 0 master MERGED storage: add prepared paramter to CopyDataEndpoint 2021-01-10 09:38:57 UTC
oVirt gerrit 112236 0 master MERGED core: propagate vds id 2021-01-10 09:39:34 UTC

Description Evelina Shames 2020-05-10 14:48:34 UTC
Created attachment 1687022 [details]
Logs

Description of problem:
Live disk migration failed with the following errors:

Engine Log:
2020-05-09 08:38:52,180+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskFinishVDS failed: Failed to change disk image
2020-05-09 08:38:52,185+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]'
2020-05-09 08:38:52,185+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] HostName = host_mixed_3
2020-05-09 08:38:52,185+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='6160bd12-6402-4ac3-9778-35309d1e1a41', vmId='eee87a74-65a2-44f5-8de0-b3d177f09786', storagePoolId='d6a7b010-de07-4903-8785-cddb7935de83', srcStorageDomainId='cdacab39-079b-4c6d-9810-1f276f9db490', targetStorageDomainId='93bacf81-4b0c-4f73-8662-8ec4981930c4', imageGroupId='a8bf807d-1cb3-40be-8712-d088991d1ddf', imageId='2e465545-58e8-46ee-a226-0a9b1845be3c'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Failed to change disk image
2020-05-09 08:38:52,185+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 77ac87f9
2020-05-09 08:38:52,192+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] Failed VmReplicateDiskFinish (Disk 'a8bf807d-1cb3-40be-8712-d088991d1ddf', VM 'eee87a74-65a2-44f5-8de0-b3d177f09786')
2020-05-09 08:38:52,193+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] Command 'LiveMigrateDisk' id: 'cb38c68a-4998-4d3d-ae7c-c1766869e4da' with children [7b481b87-9080-4340-b447-c4c789271b2e, c5bebd47-06ad-4fd4-b035-39eca92da372, a4758288-af08-4c7f-8716-3229ae1e2847] failed when attempting to perform the next operation, marking as 'ACTIVE'
2020-05-09 08:38:52,193+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41): org.ovirt.engine.core.common.errors.EngineException: EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:434)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:398)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:247)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

2020-05-09 08:38:52,193+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [disks_syncAction_68296fc9-aa35-4344] Command 'LiveMigrateDisk' id: 'cb38c68a-4998-4d3d-ae7c-c1766869e4da' child commands '[7b481b87-9080-4340-b447-c4c789271b2e, c5bebd47-06ad-4fd4-b035-39eca92da372, a4758288-af08-4c7f-8716-3229ae1e2847]' executions were completed, status 'FAILED'
2020-05-09 08:38:53,240+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [disks_syncAction_68296fc9-aa35-4344] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.

2020-05-09 08:38:53,240+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [disks_syncAction_68296fc9-aa35-4344] Failed during live storage migration of disk 'a8bf807d-1cb3-40be-8712-d088991d1ddf' of vm 'eee87a74-65a2-44f5-8de0-b3d177f09786', attempting to end replication before deleting the target disk
2020-05-09 08:38:53,777+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [disks_syncAction_68296fc9-aa35-4344] Attempting to delete the target of disk 'a8bf807d-1cb3-40be-8712-d088991d1ddf' of vm 'eee87a74-65a2-44f5-8de0-b3d177f09786'
2020-05-09 08:38:54,334+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk vm_TestCase6004_0907391533_Disk_0 to domain iscsi_0.

VDSM Log:

2020-05-09 08:38:52,170+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='eee87a74-65a2-44f5-8de0-b3d177f09786') Unable to stop the replication for the drive: vda (vm:4101)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4098, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'blockdev-add': Could not open '/rhev/data-center/mnt/blockSD/93bacf81-4b0c-4f73-8662-8ec4981930c4/images/a8bf807d-1cb3-40be-8712-d088991d1ddf/6f48ea77-944e-4fd0-88d3-313558444ccd': No such device or address


Version-Release number of selected component (if applicable):
ovirt-engine-4.4.0-0.33.master.el8ev.noarch
vdsm-4.40.14-1.el8ev.x86_64
libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64


How reproducible:
Once

Steps to Reproduce:
1. Create a VM with a disk on NFS domain.
3, Run the VM
4. Move the disk to iscsi domain.

Actual results:
Operation failed.

Expected results:
Operation should succeed

Additional info:
Logs are attached.

Comment 1 Tal Nisan 2020-05-11 12:10:13 UTC
Looks similar to bug 1597019 and and bug 1788464, Benny please note they might be duplicates

Comment 2 Evelina Shames 2020-06-02 10:05:31 UTC
Created attachment 1694377 [details]
Logs

Happened again in our automation runs.
ovirt-engine-4.4.1.1-0.5.el8ev.noarch
vdsm-4.40.18-1.el8ev.x86_64

Attaching logs.

Comment 3 Benny Zlotnik 2020-06-02 10:07:17 UTC
(In reply to Evelina Shames from comment #2)
> Created attachment 1694377 [details]
> Logs
> 
> Happened again in our automation runs.
> ovirt-engine-4.4.1.1-0.5.el8ev.noarch
> vdsm-4.40.18-1.el8ev.x86_64
> 
> Attaching logs.

please provide libvirt and qemu logs as well

Comment 4 Evelina Shames 2020-06-02 10:32:22 UTC
Created attachment 1694379 [details]
Libvirt-logs

Comment 5 Benny Zlotnik 2020-06-02 10:55:26 UTC
the vdsm log does not contain the error

Comment 6 Evelina Shames 2020-06-02 11:02:10 UTC
Created attachment 1694384 [details]
vdsm-logs

(In reply to Benny Zlotnik from comment #5)
> the vdsm log does not contain the error

I'll add all vdsm logs that we have, the environment is not available anymore.

Comment 7 Evelina Shames 2020-06-02 11:03:48 UTC
Created attachment 1694385 [details]
vdsm-logs

Comment 8 Evelina Shames 2020-06-02 11:04:48 UTC
Created attachment 1694386 [details]
vdsm-logs

Comment 9 Evelina Shames 2020-09-21 12:36:12 UTC
Lately we have many tests failing on this.

How reproducible:
~50%

Reproduction steps that I managed to reproduce the issue with:
1. Create a VM from template (RHEL 8.2)
2. Create 4 disks and add them to the VM:
- VirtIO-SCSI Thin
- VirtIO-SCSI Preallocated
- VirtIO Thin
- VirtIO Preallocated
3. Run the VM
4. Move all the VM disks concurrently to different iSCSI storage domain.
--> Do this several times, one of the attempts will fail.

Comment 10 RHEL Program Management 2020-09-21 12:36:17 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Michal Skrivanek 2020-09-25 09:41:21 UTC
does this still reproduce with the same frequency in 4.4.3/8.3?

Comment 12 Evelina Shames 2020-09-29 12:59:23 UTC
(In reply to Evelina Shames from comment #9)
> Lately we have many tests failing on this.
> 
> How reproducible:
> ~50%
> 
> Reproduction steps that I managed to reproduce the issue with:
> 1. Create a VM from template (RHEL 8.2)
> 2. Create 4 disks and add them to the VM:
> - VirtIO-SCSI Thin
> - VirtIO-SCSI Preallocated
> - VirtIO Thin
> - VirtIO Preallocated
> 3. Run the VM
> 4. Move all the VM disks concurrently to different iSCSI storage domain.
> --> Do this several times, one of the attempts will fail.

Reproduced again with these steps on:
vdsm-4.40.31-1.el8ev.x86_64
ovirt-engine-4.4.3.3-0.19.el8ev

Attaching engine, vdsm, libvirt and qemu logs.
let me know if you need anything else.

Comment 13 Evelina Shames 2020-09-29 13:00:37 UTC
Created attachment 1717528 [details]
Logs_debug

Comment 14 Benny Zlotnik 2020-09-29 14:57:08 UTC
(In reply to Evelina Shames from comment #13)
> Created attachment 1717528 [details]
> Logs_debug

Are these the right logs?
I only see this error:
VolumeImageHasChildren: Cannot delete volume which has children (non-ethical): [u'sdUUID: b182c581-4be4-4e5c-a65d-eebad47a6c23', u'imgUUID: b9a07815-d165-4a9d-87b3-e439d8e03ac8', u'volUUID: e2ba36c4-95cf-4589-9545-9659001e501f']

which is not related, and there are libvirt and qemu logs

Comment 15 Evelina Shames 2020-09-29 15:16:25 UTC
Created attachment 1717572 [details]
full_logs_debug

Comment 16 Evelina Shames 2020-09-29 15:17:24 UTC
(In reply to Benny Zlotnik from comment #14)
> (In reply to Evelina Shames from comment #13)
> > Created attachment 1717528 [details]
> > Logs_debug
> 
> Are these the right logs?
> I only see this error:
> VolumeImageHasChildren: Cannot delete volume which has children
> (non-ethical): [u'sdUUID: b182c581-4be4-4e5c-a65d-eebad47a6c23', u'imgUUID:
> b9a07815-d165-4a9d-87b3-e439d8e03ac8', u'volUUID:
> e2ba36c4-95cf-4589-9545-9659001e501f']
> 
> which is not related, and there are libvirt and qemu logs

Ohh sorry...
Attached this time the right logs.

Comment 17 Benny Zlotnik 2020-10-08 12:47:51 UTC
I see the error is:
2020-09-29 15:44:31,784+0300 ERROR (jsonrpc/7) [virt.vm] (vmId='f3e3bba8-7bf8-4965-95a2-450d910af819') Unable to stop the replication for the drive: sdb (vm:4159)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4156, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'blockdev-add': Image is not in qcow2 format

Which is different than the original error in the bug

The most recent getInfo response I found on the relevant volume show that it is qcow2 (which it has to be since it's a snapshot)
2020-09-29 15:44:38,373+0300 INFO  (jsonrpc/2) [vdsm.api] START getVolumeInfo(sdUUID='98bce80e-1a06-4322-9f47-a83573e84cd4', spUUID='db771b4e-c7c3-4593-ae63-ce6c946604c2', imgUUID='8619d42c-e4b4-47a8-8ae7-7e780a079127', volUUID='a8161202-7494-46b3-beae-7fb763b5a855', options=None) from=::ffff:10.35.162.6,38426, flow_id=1d10eac4-e3bb-492e-a573-76f0280cf1bc, task_id=0a3b7feb-4b6b-48d2-864f-f8ad78b815c7 (api:48)

2020-09-29 15:44:38,411+0300 INFO  (jsonrpc/2) [storage.VolumeManifest] 98bce80e-1a06-4322-9f47-a83573e84cd4/8619d42c-e4b4-47a8-8ae7-7e780a079127/a8161202-7494-46b3-beae-7fb763b5a855 info is {'uuid': 'a8161202-7494-46b3-beae-7fb763b5a855', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '5368709120', 'parent': 'fa566c2d-c807-4b6e-bb49-f68e8dcdc929', 'description': '', 'pool': '', 'domain': '98bce80e-1a06-4322-9f47-a83573e84cd4', 'image': '8619d42c-e4b4-47a8-8ae7-7e780a079127', 'ctime': '1601383375', 'mtime': '0', 'legality': 'LEGAL', 'generation': 0, 'apparentsize': '3221225472', 'truesize': '3221225472', 'status': 'OK', 'lease': {'path': '/dev/98bce80e-1a06-4322-9f47-a83573e84cd4/leases', 'offset': 116391936, 'owners': [], 'version': None}, 'children': []} (volume:277)

So I'm not sure if it's the original issue or something else. It could be an issue with libvirt.

Evelina, is it possible to abort the operation via automation when this is hit so the volume is not removed, allowing us to analyze it with `qemu-img info` and additional tools?

Comment 18 Evelina Shames 2020-10-11 04:55:43 UTC
(In reply to Benny Zlotnik from comment #17)
> I see the error is:
> 2020-09-29 15:44:31,784+0300 ERROR (jsonrpc/7) [virt.vm]
> (vmId='f3e3bba8-7bf8-4965-95a2-450d910af819') Unable to stop the replication
> for the drive: sdb (vm:4159)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4156, in
> diskReplicateFinish
>     self._dom.blockJobAbort(drive.name, blockJobFlags)
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101,
> in f
>     ret = attr(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py",
> line 131, in wrapper
>     ret = f(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94,
> in wrapper
>     return func(inst, *args, **kwargs)
>   File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in
> blockJobAbort
>     if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed',
> dom=self)
> libvirt.libvirtError: internal error: unable to execute QEMU command
> 'blockdev-add': Image is not in qcow2 format
> 
> Which is different than the original error in the bug
> 
> The most recent getInfo response I found on the relevant volume show that it
> is qcow2 (which it has to be since it's a snapshot)
> 2020-09-29 15:44:38,373+0300 INFO  (jsonrpc/2) [vdsm.api] START
> getVolumeInfo(sdUUID='98bce80e-1a06-4322-9f47-a83573e84cd4',
> spUUID='db771b4e-c7c3-4593-ae63-ce6c946604c2',
> imgUUID='8619d42c-e4b4-47a8-8ae7-7e780a079127',
> volUUID='a8161202-7494-46b3-beae-7fb763b5a855', options=None)
> from=::ffff:10.35.162.6,38426, flow_id=1d10eac4-e3bb-492e-a573-76f0280cf1bc,
> task_id=0a3b7feb-4b6b-48d2-864f-f8ad78b815c7 (api:48)
> 
> 2020-09-29 15:44:38,411+0300 INFO  (jsonrpc/2) [storage.VolumeManifest]
> 98bce80e-1a06-4322-9f47-a83573e84cd4/8619d42c-e4b4-47a8-8ae7-7e780a079127/
> a8161202-7494-46b3-beae-7fb763b5a855 info is {'uuid':
> 'a8161202-7494-46b3-beae-7fb763b5a855', 'type': 'SPARSE', 'format': 'COW',
> 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '5368709120', 'parent':
> 'fa566c2d-c807-4b6e-bb49-f68e8dcdc929', 'description': '', 'pool': '',
> 'domain': '98bce80e-1a06-4322-9f47-a83573e84cd4', 'image':
> '8619d42c-e4b4-47a8-8ae7-7e780a079127', 'ctime': '1601383375', 'mtime': '0',
> 'legality': 'LEGAL', 'generation': 0, 'apparentsize': '3221225472',
> 'truesize': '3221225472', 'status': 'OK', 'lease': {'path':
> '/dev/98bce80e-1a06-4322-9f47-a83573e84cd4/leases', 'offset': 116391936,
> 'owners': [], 'version': None}, 'children': []} (volume:277)
> 
> So I'm not sure if it's the original issue or something else. It could be an
> issue with libvirt.
> 
> Evelina, is it possible to abort the operation via automation when this is
> hit so the volume is not removed, allowing us to analyze it with `qemu-img
> info` and additional tools?
I'll try to do that and let you know.

Comment 23 Benny Zlotnik 2020-10-21 18:43:05 UTC
Summary on findings so far:
1. Root cause still unclear, a possible workaround (which is needed anyway) is to not perform teardown when running copy_data as part of LSM.
This works when copy_data and the VM run on the same host. I believe all the failures observed in this bug happen in this layout.

2. Another issue I found while forcing copy_data and the VM to run on different hosts is that a first successful LSM is followed by a failure when moving back to the original SD. This failures seems to by an incorrect read of a backing file of the snapshot when starting blockCopy (the failure in 1 happens when pivoting), and loks like this:
2020-10-21 14:06:57,501+0300 INFO  (jsonrpc/7) [virt.vm] (vmId='0bf6345f-5128-4803-bde2-a34c1b8f5b39') failed start drive replication libvirtError("Cannot access backing file '/rhev/data-center/mnt/blockSD/bc024016-bea8-4e0c-af1f-1a24c93b170e/images/03d348df-19f6-4737-883c-7a6da09607eb/5a8bd998-3002-4f4d-97b7-f30e553bc233' of storage file '/rhev/data-center/mnt/blockSD/bc024016-bea8-4e0c-af1f-1a24c93b170e/images/03d348df-19f6-4737-883c-7a6da09607eb/19b8ecda-69a9-480a-98ae-77a0898ce011' (as uid:107, gid:107): No such file or directory",) (vm:4065)

However, 5a8bd998-3002-4f4d-97b7-f30e553bc233 is not the backing file of 19b8ecda-69a9-480a-98ae-77a0898ce011, 5a8bd998-3002-4f4d-97b7-f30e553bc233 is the backing file of another volume in another disk participating in the LSM. `qemu-img info 19b8ecda-69a9-480a-98ae-77a0898ce011` shows the correct backing file as well `virsh dumpxml`

I am currently investigating why an incorrect backing file is seen and implementing the workaround.

Comment 24 Nir Soffer 2020-10-27 15:13:16 UTC
Should be deliver this in 4.4.3? This is a regression compared with earlier
versions.

Comment 28 RHEL Program Management 2020-11-15 14:45:54 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 29 Benny Zlotnik 2020-11-23 15:37:20 UTC
*** Bug 1788464 has been marked as a duplicate of this bug. ***

Comment 30 Evelina Shames 2020-12-03 08:59:01 UTC
(In reply to Evelina Shames from comment #9)
> Lately we have many tests failing on this.
> 
> How reproducible:
> ~50%
> 
> Reproduction steps that I managed to reproduce the issue with:
> 1. Create a VM from template (RHEL 8.2)
> 2. Create 4 disks and add them to the VM:
> - VirtIO-SCSI Thin
> - VirtIO-SCSI Preallocated
> - VirtIO Thin
> - VirtIO Preallocated
> 3. Run the VM
> 4. Move all the VM disks concurrently to different iSCSI storage domain.
> --> Do this several times, one of the attempts will fail.

Verified on rhv-4.4.4-3 with the above steps several times + we didn't see it in our last automation runs.

Moving to 'Verified'.

Comment 31 Sandro Bonazzola 2020-12-21 12:36:27 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 32 Eyal Shenitzky 2021-03-14 06:40:06 UTC
*** Bug 1899360 has been marked as a duplicate of this bug. ***

Comment 33 Red Hat Bugzilla 2023-09-15 01:29:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.