Bug 1597019

Summary: LSM's diskReplicateFinish failed on libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua....' not found
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: CoreAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.31CC: bcholler, bugs, dfediuck, eshames, frolland, gwatson, mkalinin, nsoffer, pkrempa, smaudet, tnisan
Target Milestone: ovirt-4.4.0Keywords: Automation, Regression
Target Release: ---Flags: rule-engine: ovirt-4.3+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1660265 (view as bug list) Environment:
Last Closed: 2019-10-29 16:43:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1643707    
Bug Blocks: 1660265    
Attachments:
Description Flags
logs
none
libvirt logs
none
logs_with_qemu
none
logs2 none

Description Elad 2018-07-01 13:56:20 UTC
Created attachment 1455763 [details]
logs

Description of problem:
Live storage migration fails in diskReplicateFinish phase with libvirtError


Version-Release number of selected component (if applicable):
vdsm-4.20.32-1.el7ev.x86_64
libvirt-daemon-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64


How reproducible:
Happened once

Steps to Reproduce:
Reproduce once after the following scenario:
1. Create a VM from a template with 1 disk as a thin copy
2. Create a template from the VM - temp1
3. Create a second template from the VM - temp2
4. Create a VM from temp1 - vm1
5. Create a VM from temp2 - vm2
6. Start vm2
7. Migrate vm2 disk

Actual results:
prepareImage task in vdsm fails on:

2018-06-29 15:04:31,211+0300 DEBUG (jsonrpc/3) [storage.TaskManager.Task] (Task='952832ae-065d-4d36-9f7f-aa278954c611') ref 0 aborting False (task:1002)
2018-06-29 15:04:31,212+0300 INFO  (jsonrpc/3) [vds] prepared volume path: /rhev/data-center/mnt/blockSD/650b75f7-b1c8-4e55-b3c3-4eaec4c17a83/images/fee6faec-ebfc-440a-8e88-84fe4ecf68d1/8e7b9ab2-9d49-46b0-b822-a2e6f0852a26 (clientIF:497)
2018-06-29 15:04:31,212+0300 INFO  (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Disabling drive monitoring (drivemonitor:58)
2018-06-29 15:04:31,276+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Unable to stop the replication for the drive: vda (vm:4750)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4747, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 773, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua-fee6faec-ebfc-440a-8e88-84fe4ecf68d1' not found



On engine, got the following exception referring to change CD failure which somehow not seem to be related but under the same correlation id:


2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskFinishVDS failed: Failed to change disk image
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]'
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] HostName = host_mixed_3
2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='2177cf70-edbf-40a6-93a3-b73e25f0df54', vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5', storagePoolId='5eb0ed42-7acf-11e8-ae76-00163e7be006', srcStorageDomainId='f54afc10-74ad-4ae5-99b5-32f1ac874c44', targetStorageDomainId='650b75f7-b1c8-4e55-b3c3-4eaec4c17a83', imageGroupId='fee6faec-ebfc-440a-8e88-84fe4ecf68d1', imageId='8e7b9ab2-9d49-46b0-b822-a2e6f0852a26'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Failed to change disk image
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] FINISH, VmReplicateDiskFinishVDSCommand, log id: 55cc2a
2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Failed VmReplicateDiskFinish (Disk 'fee6faec-ebfc-440a-8e88-84fe4ecf68d1', VM 'a09ab6e7-57bf-4010-b616-003df33ff5b5')
2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'LiveMigrateDisk' id: '61b97882-639b-4e30-aeee-7b10f72c95f3' with children [fce0d999-11c1-43cb-bf9b-1a4392273b9a, 56969072-7530-4f85-b1fe-437aabcf478a, 48b367c6-9318-47ca-884f-0099fb7c5e1d] failed when attempting to perform the next operation, marking as 'ACTIVE'
2018-06-29 15:04:31,757+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41): org.ovirt.engine.core.common.errors.EngineException: EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41)
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:402) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:374) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:242) [bll.jar:]
        at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:]
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:68) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:146) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:107) [bll.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)




Expected results:
Live migration to succeed

Additional info:
logs

Comment 1 Benny Zlotnik 2018-07-01 16:19:29 UTC
Can you attach the VM's libvirt log?

Comment 2 Elad 2018-07-02 10:53:04 UTC
Created attachment 1455937 [details]
libvirt logs

Comment 3 Elad 2018-07-02 11:33:00 UTC
Created attachment 1455941 [details]
logs_with_qemu

Comment 4 Elad 2018-07-05 08:29:27 UTC
Encountered again (2/2). Raising severity and marking as a regression.

Comment 5 Red Hat Bugzilla Rules Engine 2018-07-05 08:29:32 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Raz Tamir 2018-08-22 14:37:37 UTC
Removing blocker+ flag as this is not easily reproduced

Comment 7 Red Hat Bugzilla Rules Engine 2018-08-22 14:37:41 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 8 Elad 2018-12-02 16:03:04 UTC
Created attachment 1510634 [details]
logs2

Encountered once again. Logs attached


2018-11-30 01:20:06,230+0200 ERROR (jsonrpc/6) [virt.vm] (vmId='467c5bf8-f165-4473-bbe7-8d1599bf194e') Unable to stop the replication for the drive: vda (vm:4803)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4800, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 784, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-job-complete': The active block job 'drive-ua-288a92ab-90f5-4d31-be8b-ea7f2c15c092' cannot be completed




vdsm-4.20.44-1.el7ev.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64
vdsm-python-4.20.44-1.el7ev.noarch
sanlock	sanlock-3.6.0-1.el7.x86_64
redhat_release	Red Hat Enterprise Linux Server 7.6 (Maipo)
kernel_release	3.10.0-957.1.3.el7.x86_64

Comment 9 Tal Nisan 2018-12-10 09:51:00 UTC
*** Bug 1650981 has been marked as a duplicate of this bug. ***

Comment 10 Benny Zlotnik 2018-12-26 09:02:32 UTC
I think it's the same as this: https://bugzilla.redhat.com/show_bug.cgi?id=1643707 which is currently investigated by libvirt
Do we have the "permission denied" message in the latest qemu logs as well?

Comment 11 Elad 2018-12-26 14:05:08 UTC
The last occurrence we've seen was at Nov 30 with that time latest build, see comment #8 logs. Cannot find 'permission denied' in vdsm logs there.

The bug has a downstream clone with customer who faced it (bug 1660265). Please raise this question to its reporter.

Comment 12 Fred Rolland 2019-01-03 11:31:40 UTC
Seems a libvirt issue. Moving to 4.2.9.

Comment 13 Raz Tamir 2019-01-03 12:47:08 UTC
Hi Fred,

We will not have 4.2.9 please move back to 4.2.8 or 4.3

Comment 14 Doron Fediuck 2019-01-07 14:31:07 UTC
We can consider for an async once there will be a libvirt fix.

Comment 16 Peter Krempa 2019-02-19 14:11:06 UTC
None of the libvirt logs attached to the BZ contain the described failure. Do you have any libvirtd debug logs which reproduce the problem? I'm trying to figure out what's happening in https://bugzilla.redhat.com/show_bug.cgi?id=1643707 and that BZ also does not have any libvirt debug logs.

Comment 18 Elad 2019-03-01 14:11:56 UTC
Unfortunately, I don't have libvirt logs with the error

Comment 19 Nir Soffer 2019-03-12 16:11:02 UTC
I asked Peter about this issue, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1643707#c38

I think we need to fix the code checking if the block job is ready as explained
in that comment.

Comment 20 Nir Soffer 2019-03-12 18:07:58 UTC
After discussing with Benny, we don't think that fixing the way we monitor 
the block job will fix it. What we really need is libvirt debug logs showing
what happens in libvirt when this fail.

Comment 24 Tal Nisan 2019-10-29 16:43:33 UTC
Closing as insufficient data due to the missing Libvirt logs requested in bug 1643707, please reopen once the issue is reproduced and logs can be provided