Created attachment 1455763 [details] logs Description of problem: Live storage migration fails in diskReplicateFinish phase with libvirtError Version-Release number of selected component (if applicable): vdsm-4.20.32-1.el7ev.x86_64 libvirt-daemon-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 How reproducible: Happened once Steps to Reproduce: Reproduce once after the following scenario: 1. Create a VM from a template with 1 disk as a thin copy 2. Create a template from the VM - temp1 3. Create a second template from the VM - temp2 4. Create a VM from temp1 - vm1 5. Create a VM from temp2 - vm2 6. Start vm2 7. Migrate vm2 disk Actual results: prepareImage task in vdsm fails on: 2018-06-29 15:04:31,211+0300 DEBUG (jsonrpc/3) [storage.TaskManager.Task] (Task='952832ae-065d-4d36-9f7f-aa278954c611') ref 0 aborting False (task:1002) 2018-06-29 15:04:31,212+0300 INFO (jsonrpc/3) [vds] prepared volume path: /rhev/data-center/mnt/blockSD/650b75f7-b1c8-4e55-b3c3-4eaec4c17a83/images/fee6faec-ebfc-440a-8e88-84fe4ecf68d1/8e7b9ab2-9d49-46b0-b822-a2e6f0852a26 (clientIF:497) 2018-06-29 15:04:31,212+0300 INFO (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Disabling drive monitoring (drivemonitor:58) 2018-06-29 15:04:31,276+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Unable to stop the replication for the drive: vda (vm:4750) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4747, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 773, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua-fee6faec-ebfc-440a-8e88-84fe4ecf68d1' not found On engine, got the following exception referring to change CD failure which somehow not seem to be related but under the same correlation id: 2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskFinishVDS failed: Failed to change disk image 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]' 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] HostName = host_mixed_3 2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='2177cf70-edbf-40a6-93a3-b73e25f0df54', vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5', storagePoolId='5eb0ed42-7acf-11e8-ae76-00163e7be006', srcStorageDomainId='f54afc10-74ad-4ae5-99b5-32f1ac874c44', targetStorageDomainId='650b75f7-b1c8-4e55-b3c3-4eaec4c17a83', imageGroupId='fee6faec-ebfc-440a-8e88-84fe4ecf68d1', imageId='8e7b9ab2-9d49-46b0-b822-a2e6f0852a26'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Failed to change disk image 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] FINISH, VmReplicateDiskFinishVDSCommand, log id: 55cc2a 2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Failed VmReplicateDiskFinish (Disk 'fee6faec-ebfc-440a-8e88-84fe4ecf68d1', VM 'a09ab6e7-57bf-4010-b616-003df33ff5b5') 2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'LiveMigrateDisk' id: '61b97882-639b-4e30-aeee-7b10f72c95f3' with children [fce0d999-11c1-43cb-bf9b-1a4392273b9a, 56969072-7530-4f85-b1fe-437aabcf478a, 48b367c6-9318-47ca-884f-0099fb7c5e1d] failed when attempting to perform the next operation, marking as 'ACTIVE' 2018-06-29 15:04:31,757+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41): org.ovirt.engine.core.common.errors.EngineException: EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41) at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:402) [bll.jar:] at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:374) [bll.jar:] at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:242) [bll.jar:] at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:] at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:68) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:146) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:107) [bll.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_171] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) Expected results: Live migration to succeed Additional info: logs
Can you attach the VM's libvirt log?
Created attachment 1455937 [details] libvirt logs
Created attachment 1455941 [details] logs_with_qemu
Encountered again (2/2). Raising severity and marking as a regression.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Removing blocker+ flag as this is not easily reproduced
Created attachment 1510634 [details] logs2 Encountered once again. Logs attached 2018-11-30 01:20:06,230+0200 ERROR (jsonrpc/6) [virt.vm] (vmId='467c5bf8-f165-4473-bbe7-8d1599bf194e') Unable to stop the replication for the drive: vda (vm:4803) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4800, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 784, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-job-complete': The active block job 'drive-ua-288a92ab-90f5-4d31-be8b-ea7f2c15c092' cannot be completed vdsm-4.20.44-1.el7ev.x86_64 qemu-img-rhev-2.12.0-18.el7_6.1.x86_64 vdsm-python-4.20.44-1.el7ev.noarch sanlock sanlock-3.6.0-1.el7.x86_64 redhat_release Red Hat Enterprise Linux Server 7.6 (Maipo) kernel_release 3.10.0-957.1.3.el7.x86_64
*** Bug 1650981 has been marked as a duplicate of this bug. ***
I think it's the same as this: https://bugzilla.redhat.com/show_bug.cgi?id=1643707 which is currently investigated by libvirt Do we have the "permission denied" message in the latest qemu logs as well?
The last occurrence we've seen was at Nov 30 with that time latest build, see comment #8 logs. Cannot find 'permission denied' in vdsm logs there. The bug has a downstream clone with customer who faced it (bug 1660265). Please raise this question to its reporter.
Seems a libvirt issue. Moving to 4.2.9.
Hi Fred, We will not have 4.2.9 please move back to 4.2.8 or 4.3
We can consider for an async once there will be a libvirt fix.
None of the libvirt logs attached to the BZ contain the described failure. Do you have any libvirtd debug logs which reproduce the problem? I'm trying to figure out what's happening in https://bugzilla.redhat.com/show_bug.cgi?id=1643707 and that BZ also does not have any libvirt debug logs.
Unfortunately, I don't have libvirt logs with the error
I asked Peter about this issue, see: https://bugzilla.redhat.com/show_bug.cgi?id=1643707#c38 I think we need to fix the code checking if the block job is ready as explained in that comment.
After discussing with Benny, we don't think that fixing the way we monitor the block job will fix it. What we really need is libvirt debug logs showing what happens in libvirt when this fail.
Closing as insufficient data due to the missing Libvirt logs requested in bug 1643707, please reopen once the issue is reproduced and logs can be provided