+++ This bug was initially created as a clone of Bug #1597019 +++ Description of problem: Live storage migration fails in diskReplicateFinish phase with libvirtError Version-Release number of selected component (if applicable): vdsm-4.20.32-1.el7ev.x86_64 libvirt-daemon-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 How reproducible: Happened once Steps to Reproduce: Reproduce once after the following scenario: 1. Create a VM from a template with 1 disk as a thin copy 2. Create a template from the VM - temp1 3. Create a second template from the VM - temp2 4. Create a VM from temp1 - vm1 5. Create a VM from temp2 - vm2 6. Start vm2 7. Migrate vm2 disk Actual results: prepareImage task in vdsm fails on: 2018-06-29 15:04:31,211+0300 DEBUG (jsonrpc/3) [storage.TaskManager.Task] (Task='952832ae-065d-4d36-9f7f-aa278954c611') ref 0 aborting False (task:1002) 2018-06-29 15:04:31,212+0300 INFO (jsonrpc/3) [vds] prepared volume path: /rhev/data-center/mnt/blockSD/650b75f7-b1c8-4e55-b3c3-4eaec4c17a83/images/fee6faec-ebfc-440a-8e88-84fe4ecf68d1/8e7b9ab2-9d49-46b0-b822-a2e6f0852a26 (clientIF:497) 2018-06-29 15:04:31,212+0300 INFO (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Disabling drive monitoring (drivemonitor:58) 2018-06-29 15:04:31,276+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Unable to stop the replication for the drive: vda (vm:4750) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4747, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 773, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua-fee6faec-ebfc-440a-8e88-84fe4ecf68d1' not found On engine, got the following exception referring to change CD failure which somehow not seem to be related but under the same correlation id: 2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskFinishVDS failed: Failed to change disk image 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]' 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] HostName = host_mixed_3 2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='2177cf70-edbf-40a6-93a3-b73e25f0df54', vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5', storagePoolId='5eb0ed42-7acf-11e8-ae76-00163e7be006', srcStorageDomainId='f54afc10-74ad-4ae5-99b5-32f1ac874c44', targetStorageDomainId='650b75f7-b1c8-4e55-b3c3-4eaec4c17a83', imageGroupId='fee6faec-ebfc-440a-8e88-84fe4ecf68d1', imageId='8e7b9ab2-9d49-46b0-b822-a2e6f0852a26'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Failed to change disk image 2018-06-29 15:04:31,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] FINISH, VmReplicateDiskFinishVDSCommand, log id: 55cc2a 2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Failed VmReplicateDiskFinish (Disk 'fee6faec-ebfc-440a-8e88-84fe4ecf68d1', VM 'a09ab6e7-57bf-4010-b616-003df33ff5b5') 2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'LiveMigrateDisk' id: '61b97882-639b-4e30-aeee-7b10f72c95f3' with children [fce0d999-11c1-43cb-bf9b-1a4392273b9a, 56969072-7530-4f85-b1fe-437aabcf478a, 48b367c6-9318-47ca-884f-0099fb7c5e1d] failed when attempting to perform the next operation, marking as 'ACTIVE' 2018-06-29 15:04:31,757+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41): org.ovirt.engine.core.common.errors.EngineException: EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41) at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:402) [bll.jar:] at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:374) [bll.jar:] at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:242) [bll.jar:] at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:] at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:68) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:146) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:107) [bll.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_171] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) Expected results: Live migration to succeed Additional info: logs --- Additional comment from Benny Zlotnik on 2018-07-01 16:19:29 UTC --- Can you attach the VM's libvirt log? --- Additional comment from Elad on 2018-07-02 10:53 UTC --- --- Additional comment from Elad on 2018-07-02 11:33 UTC --- --- Additional comment from Elad on 2018-07-05 08:29:27 UTC --- Encountered again (2/2). Raising severity and marking as a regression. --- Additional comment from Red Hat Bugzilla Rules Engine on 2018-07-05 08:29:32 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from Raz Tamir on 2018-08-22 14:37:37 UTC --- Removing blocker+ flag as this is not easily reproduced --- Additional comment from Red Hat Bugzilla Rules Engine on 2018-08-22 14:37:41 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from Elad on 2018-12-02 16:03 UTC --- Encountered once again. Logs attached 2018-11-30 01:20:06,230+0200 ERROR (jsonrpc/6) [virt.vm] (vmId='467c5bf8-f165-4473-bbe7-8d1599bf194e') Unable to stop the replication for the drive: vda (vm:4803) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4800, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 784, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'block-job-complete': The active block job 'drive-ua-288a92ab-90f5-4d31-be8b-ea7f2c15c092' cannot be completed vdsm-4.20.44-1.el7ev.x86_64 qemu-img-rhev-2.12.0-18.el7_6.1.x86_64 vdsm-python-4.20.44-1.el7ev.noarch sanlock sanlock-3.6.0-1.el7.x86_64 redhat_release Red Hat Enterprise Linux Server 7.6 (Maipo) kernel_release 3.10.0-957.1.3.el7.x86_64 --- Additional comment from Tal Nisan on 2018-12-10 09:51:00 UTC ---
Seems a libvirt issue. Moving to 4.2.9.
Hi Bimal, I am a libvirt QE. Recently I am looking into the LSM problem BZ1643707. Could you please summarize these info to help me reproduce it: 1. What is the lastest ovirt/libvirt/vdsm/qemu-kvm version that the customers encounted the bug at? 2. What is the LSM source and destination SD type in the customers' ticket? Any special settings in SD? 3. For the reproducing steps in comment0, are step2 and step4 necessary? I find the LSM is on vm2. I think the steps related to vm1 can be removed.
Hi Han, The following are the versions and SD information. ~~~ ovirt-engine-4.2.7.4-0.1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.3.x86_64 vdsm-4.19.31-1.el7ev.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64 Both Source and Destination source SD type are FC ~~~ ~~~ ovirt-engine-4.2.7.4-0.1.el7ev.noarch libvirt-4.5.0-10.el7_6.3.x86_64 vdsm-4.20.43-1.el7ev.x86_64 qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64 Both Source and Destination source SD type are ISCSI ~~~ ~~~ ovirt-engine-4.1.11.2-0.1.el7.noarch libvirt-4.5.0-10.el7_6.2.x86_64 vdsm-4.20.43-1.el7ev.x86_64 qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64 Both Source and Destination source SD type are FC ~~~ This was seen on couple different end user as well. Yes I think we can eliminated the steps vm1. I found those steps in the upstream BZ. I have tried several times to reproduce this but have not been successful. On thing I noticed was if selinux (disabled selinux with "setenforce 0) was disabled the migration worked. I am spectulating here, but I feel something on with selinux was disabled when the disks were created, then later enabled hence causes the LSM to fail. Also for we noticed extra lv sym links on both SD's. See comment 3 https://bugzilla.redhat.com/show_bug.cgi?id=1660265#c3 in the BZ. Thanks Bimal.
I tried to reproduce it on: ovirt-4.2.8.1-0.1.el7ev vdsm-4.20.46-1.el7ev.x86_64 libvirt-4.5.0-10.el7_6.4.x86_64 qemu-kvm-rhev-2.12.0-18.el7_6.3.x86_64 Steps: 1. Create DC. Add cluster. Add a host. Add two iscsi SD. Create a vm on iscsi domain 2. LSM from one iscsi SD to another # ./ovirt-move.py -g ff -S rhv2 -D rhv1 LSM failed due to another issue: 2019-02-12 20:29:03,828-0500 ERROR (tasks/7) [storage.Volume] Failed to create volume /rhev/data-center/mnt/blockSD/90ab8104-4656-4f73-a81f-1c30f27d6a0a/images/ecac83c8-17f6-4802-8157-3b9a0b09964e/994197b3-e12d-4f14-8237-d84742f1a95e: Cannot create Logical Volume: 'vgname=90ab8104-4656-4f73-a81f-1c30f27d6a0a lvname=994197b3-e12d-4f14-8237-d84742f1a95e err=[\' Logical Volume "994197b3-e12d-4f14-8237-d84742f1a95e" already exists in volume group "90ab8104-4656-4f73-a81f-1c30f27d6a0a"\']' (volume:1209) 2019-02-12 20:29:03,828-0500 ERROR (tasks/7) [storage.Volume] Unexpected error (volume:1246) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1206, in create initialSize=initialSize) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 505, in _create initialTags=(sc.TAG_VOL_UNINIT,)) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1161, in createLV raise se.CannotCreateLogicalVolume(vgName, lvName, err) CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=90ab8104-4656-4f73-a81f-1c30f27d6a0a lvname=994197b3-e12d-4f14-8237-d84742f1a95e err=[\' Logical Volume "994197b3-e12d-4f14-8237-d84742f1a95e" already exists in volume group "90ab8104-4656-4f73-a81f-1c30f27d6a0a"\']' 2019-02-12 20:29:03,828-0500 ERROR (tasks/7) [storage.Image] Unexpected error (image:420) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 403, in _createTargetImage srcVolUUID=volParams['parent']) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 820, in createVolume initialSize=initialSize) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1206, in create initialSize=initialSize) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 505, in _create initialTags=(sc.TAG_VOL_UNINIT,)) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1161, in createLV raise se.CannotCreateLogicalVolume(vgName, lvName, err) CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=90ab8104-4656-4f73-a81f-1c30f27d6a0a lvname=994197b3-e12d-4f14-8237-d84742f1a95e err=[\' Logical Volume "994197b3-e12d-4f14-8237-d84742f1a95e" already exists in volume group "90ab8104-4656-4f73-a81f-1c30f27d6a0a"\']' 2019-02-12 20:29:03,830-0500 ERROR (tasks/7) [storage.TaskManager.Task] (Task='e2d24bc8-9241-448f-b5f0-8a8ac678086a') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1733, in cloneImageStructure img.cloneStructure(sdUUID, imgUUID, dstSdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 637, in cloneStructure self._createTargetImage(sdCache.produce(dstSdUUID), sdUUID, imgUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 403, in _createTargetImage srcVolUUID=volParams['parent']) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 820, in createVolume Elad, have you met the problem like this before? Any workaround for it?
Created attachment 1534286 [details] script and vdsm/ovirt log
Seems like bug 1650985
Michel, it looks like a libvirt bug. Do you have any idea to reproduce it?
(In reply to Han Han from comment #26) > Michel, it looks like a libvirt bug. Do you have any idea to reproduce it? None, sorry. But I feel like there are two or even more problems discussed here (guessing from two different error messages reported here): libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua-fee6faec-ebfc-440a-8e88-84fe4ecf68d1' not found libvirtError: internal error: unable to execute QEMU command 'block-job-complete': The active block job 'drive-ua-288a92ab-90f5-4d31-be8b-ea7f2c15c092' cannot be completed libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-virtio-disk0' not found However, Peter might have more insight here.
From the described messages and cross-linked customer cases it looks like https://bugzilla.redhat.com/show_bug.cgi?id=1643707 . Given that I was not able to find any useful libvirtd debug logs I can't really tell what's happening. It is fishy though that 'block-job-complete' was called in the situations where the job is missing as libvirt makes sure that the job is in ready state before sending the command or that it can't be completed. Both should not have happened. Unfortunately without debug logs or an reproducer it's very hard to figure out what's happening.
Avihai, can you guys please help with logs according to comment #28?
(In reply to Tal Nisan from comment #29) > Avihai, can you guys please help with logs according to comment #28? Hi Tal , I did not encounter this issue in later runs (I checked with Bug 1597019 automation TC's) and this did not occur. @Elad, correct me if I'm wrong . See also Elad's comment #11 from Bug 1597019 that this issue was not seen for a long time now. What about who reported this issue, is this issue still seen there?
I see the customer tickets attached here were updated today so I guess there are customers who face this.
Sync for product update in Jira.
I see. So we are in the same boat of no libvirt logs. See our request here to get those logs back: bz#1749634. For now, Antonio, sounds like the best way is to enable libvirt logging via virt-admin and get the missing libvirt logs.
Closing as insufficient data due to the missing Libvirt logs requested in bug 1643707, please reopen once the issue is reproduced and logs can be provided