Bug 1597019

Summary:

LSM's diskReplicateFinish failed on libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua....' not found

Product:

[oVirt] vdsm

Reporter:

Elad <ebenahar>

Component:

Core

Assignee:

Benny Zlotnik <bzlotnik>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Avihai <aefrat>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.20.31

CC:

bcholler, bugs, dfediuck, eshames, frolland, gwatson, mkalinin, nsoffer, pkrempa, smaudet, tnisan

Target Milestone:

ovirt-4.4.0

Keywords:

Automation, Regression

Target Release:

---

Flags:

rule-engine: ovirt-4.3+

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1660265 (view as bug list)

Environment:

Last Closed:

2019-10-29 16:43:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1643707

Bug Blocks:

1660265

Attachments:

Description	Flags
logs	none
libvirt logs	none
logs_with_qemu	none
logs2	none

Description Elad 2018-07-01 13:56:20 UTC

Created attachment 1455763 [details]
logs

Description of problem:
Live storage migration fails in diskReplicateFinish phase with libvirtError


Version-Release number of selected component (if applicable):
vdsm-4.20.32-1.el7ev.x86_64
libvirt-daemon-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64


How reproducible:
Happened once

Steps to Reproduce:
Reproduce once after the following scenario:
1. Create a VM from a template with 1 disk as a thin copy
2. Create a template from the VM - temp1
3. Create a second template from the VM - temp2
4. Create a VM from temp1 - vm1
5. Create a VM from temp2 - vm2
6. Start vm2
7. Migrate vm2 disk

Actual results:
prepareImage task in vdsm fails on:

2018-06-29 15:04:31,211+0300 DEBUG (jsonrpc/3) [storage.TaskManager.Task] (Task='952832ae-065d-4d36-9f7f-aa278954c611') ref 0 aborting False (task:1002)
2018-06-29 15:04:31,212+0300 INFO  (jsonrpc/3) [vds] prepared volume path: /rhev/data-center/mnt/blockSD/650b75f7-b1c8-4e55-b3c3-4eaec4c17a83/images/fee6faec-ebfc-440a-8e88-84fe4ecf68d1/8e7b9ab2-9d49-46b0-b822-a2e6f0852a26 (clientIF:497)
2018-06-29 15:04:31,212+0300 INFO  (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Disabling drive monitoring (drivemonitor:58)
2018-06-29 15:04:31,276+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5') Unable to stop the replication for the drive: vda (vm:4750)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4747, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 773, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-job-complete': Block job 'drive-ua-fee6faec-ebfc-440a-8e88-84fe4ecf68d1' not found



On engine, got the following exception referring to change CD failure which somehow not seem to be related but under the same correlation id:


2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command VmReplicateDiskFinishVDS failed: Failed to change disk image
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]'
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] HostName = host_mixed_3
2018-06-29 15:04:31,743+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='2177cf70-edbf-40a6-93a3-b73e25f0df54', vmId='a09ab6e7-57bf-4010-b616-003df33ff5b5', storagePoolId='5eb0ed42-7acf-11e8-ae76-00163e7be006', srcStorageDomainId='f54afc10-74ad-4ae5-99b5-32f1ac874c44', targetStorageDomainId='650b75f7-b1c8-4e55-b3c3-4eaec4c17a83', imageGroupId='fee6faec-ebfc-440a-8e88-84fe4ecf68d1', imageId='8e7b9ab2-9d49-46b0-b822-a2e6f0852a26'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Failed to change disk image
2018-06-29 15:04:31,743+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] FINISH, VmReplicateDiskFinishVDSCommand, log id: 55cc2a
2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Failed VmReplicateDiskFinish (Disk 'fee6faec-ebfc-440a-8e88-84fe4ecf68d1', VM 'a09ab6e7-57bf-4010-b616-003df33ff5b5')
2018-06-29 15:04:31,756+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] Command 'LiveMigrateDisk' id: '61b97882-639b-4e30-aeee-7b10f72c95f3' with children [fce0d999-11c1-43cb-bf9b-1a4392273b9a, 56969072-7530-4f85-b1fe-437aabcf478a, 48b367c6-9318-47ca-884f-0099fb7c5e1d] failed when attempting to perform the next operation, marking as 'ACTIVE'
2018-06-29 15:04:31,757+03 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [disks_syncAction_c0feb9f6-ad90-44a0] EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41): org.ovirt.engine.core.common.errors.EngineException: EngineException: Failed to change disk image (Failed with error FAILED_CHANGE_CD_IS_MOUNTED and code 41)
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.replicateDiskFinish(LiveMigrateDiskCommand.java:402) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.completeLiveMigration(LiveMigrateDiskCommand.java:374) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand.performNextOperation(LiveMigrateDiskCommand.java:242) [bll.jar:]
        at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:]
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:68) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:146) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:107) [bll.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_171]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_171]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)




Expected results:
Live migration to succeed

Additional info:
logs

Comment 1 Benny Zlotnik 2018-07-01 16:19:29 UTC

Can you attach the VM's libvirt log?

Comment 2 Elad 2018-07-02 10:53:04 UTC

Created attachment 1455937 [details]
libvirt logs

Comment 3 Elad 2018-07-02 11:33:00 UTC

Created attachment 1455941 [details]
logs_with_qemu

Comment 4 Elad 2018-07-05 08:29:27 UTC

Encountered again (2/2). Raising severity and marking as a regression.

Comment 5 Red Hat Bugzilla Rules Engine 2018-07-05 08:29:32 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Raz Tamir 2018-08-22 14:37:37 UTC

Removing blocker+ flag as this is not easily reproduced

Comment 7 Red Hat Bugzilla Rules Engine 2018-08-22 14:37:41 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 8 Elad 2018-12-02 16:03:04 UTC

Created attachment 1510634 [details]
logs2

Encountered once again. Logs attached


2018-11-30 01:20:06,230+0200 ERROR (jsonrpc/6) [virt.vm] (vmId='467c5bf8-f165-4473-bbe7-8d1599bf194e') Unable to stop the replication for the drive: vda (vm:4803)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4800, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 784, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'block-job-complete': The active block job 'drive-ua-288a92ab-90f5-4d31-be8b-ea7f2c15c092' cannot be completed




vdsm-4.20.44-1.el7ev.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64
vdsm-python-4.20.44-1.el7ev.noarch
sanlock	sanlock-3.6.0-1.el7.x86_64
redhat_release	Red Hat Enterprise Linux Server 7.6 (Maipo)
kernel_release	3.10.0-957.1.3.el7.x86_64

Comment 9 Tal Nisan 2018-12-10 09:51:00 UTC

*** Bug 1650981 has been marked as a duplicate of this bug. ***

Comment 10 Benny Zlotnik 2018-12-26 09:02:32 UTC

I think it's the same as this: https://bugzilla.redhat.com/show_bug.cgi?id=1643707 which is currently investigated by libvirt
Do we have the "permission denied" message in the latest qemu logs as well?

Comment 11 Elad 2018-12-26 14:05:08 UTC

The last occurrence we've seen was at Nov 30 with that time latest build, see comment #8 logs. Cannot find 'permission denied' in vdsm logs there.

The bug has a downstream clone with customer who faced it (bug 1660265). Please raise this question to its reporter.

Comment 12 Fred Rolland 2019-01-03 11:31:40 UTC

Seems a libvirt issue. Moving to 4.2.9.

Comment 13 Raz Tamir 2019-01-03 12:47:08 UTC

Hi Fred,

We will not have 4.2.9 please move back to 4.2.8 or 4.3

Comment 14 Doron Fediuck 2019-01-07 14:31:07 UTC

We can consider for an async once there will be a libvirt fix.

Comment 16 Peter Krempa 2019-02-19 14:11:06 UTC

None of the libvirt logs attached to the BZ contain the described failure. Do you have any libvirtd debug logs which reproduce the problem? I'm trying to figure out what's happening in https://bugzilla.redhat.com/show_bug.cgi?id=1643707 and that BZ also does not have any libvirt debug logs.

Comment 18 Elad 2019-03-01 14:11:56 UTC

Unfortunately, I don't have libvirt logs with the error

Comment 19 Nir Soffer 2019-03-12 16:11:02 UTC

I asked Peter about this issue, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1643707#c38

I think we need to fix the code checking if the block job is ready as explained
in that comment.

Comment 20 Nir Soffer 2019-03-12 18:07:58 UTC

After discussing with Benny, we don't think that fixing the way we monitor 
the block job will fix it. What we really need is libvirt debug logs showing
what happens in libvirt when this fail.

Comment 24 Tal Nisan 2019-10-29 16:43:33 UTC

Closing as insufficient data due to the missing Libvirt logs requested in bug 1643707, please reopen once the issue is reproduced and logs can be provided