Bug 1785939

Summary:

LSM failed and Live merge fails and Images have been marked illegal and can no longer be previewed/reverted - Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52) - Top volume is still in still in qemu chain

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Avihai <aefrat>

Component:

vdsm

Assignee:

Eyal Shenitzky <eshenitz>

Status:

CLOSED ERRATA

QA Contact:

Lukas Svaty <lsvaty>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

4.4.0

CC:

aoconnor, bugs, eshenitz, izuckerm, kwolf, lsurette, michal.skrivanek, mtessun, nsoffer, pelauter, pkrempa, sfishbai, srevivo, tnisan, ybenshim, ycui

Target Milestone:

ovirt-4.4.0

Keywords:

AutomationBlocker, Regression, TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rhv-4.4.0-27

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-08-04 13:27:22 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1803092, 1805143

Bug Blocks:

1686650, 1690475, 1733843

Attachments:

Description	Flags
engine vdsm libvirt logs	none
LSM_engine_log	none
libvirtd logs. RHEL 8.2	none

Description Avihai 2019-12-22 16:09:28 UTC

Created attachment 1647180 [details]
engine vdsm libvirt logs

Description of problem:
Delete a live snapshot fails in live merge with the following Errors.

Engine:
2019-12-22 17:26:48,138+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] Failed in 'MergeVDS' method
2019-12-22 17:26:48,156+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command MergeVDS failed: Merge failed
2019-12-22 17:26:48,157+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=52, message=Merge failed]]'
2019-12-22 17:26:48,157+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] HostName = host_mixed_3
2019-12-22 17:26:48,158+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command 'MergeVDSCommand(HostName = host_mixed_3, MergeVDSCommandParameters:{hostId='1e7e5ff5-dc6a-4160-ac39-485ecd56546a', vmId='e72e5af2-9a14-41f0-bddf-fe4d6394121a', storagePoolId='6fa67130-314b-4b16-a2f5-dc93d5ab4e1e', storageDomainId='98c55ff6-4857-4b5e-8e56-31cb3707d5d8', imageGroupId='32c49b05-90f2-4289-b15e-521e1c81fa0f', imageId='fd40794a-dbd8-43d0-9357-9c68bb586870', baseImageId='19fd5cc1-3128-45f5-903e-3f6a9859b98d', topImageId='fd40794a-dbd8-43d0-9357-9c68bb586870', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
2019-12-22 17:26:48,158+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] FINISH, MergeVDSCommand, return: , log id: 3bfc124
2019-12-22 17:26:48,158+02 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-9) [b754ed48-34d6-4d16-a82c-931fc7af1721] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2116)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1326)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2005)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137)
:

2019-12-22 17:26:50,002+02 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-10) [b754ed48-34d6-4d16-a82c-931fc7af1721] Failed to live merge. Top volume fd40794a-dbd8-43d0-9357-9c68bb586870 is still in qemu chain [709ce457-defb-479e-ba50-43a73dfecab6, 19fd5cc1-3128-45f5-903e-3f6a9859b98d, fd40794a-dbd8-43d0-9357-9c68bb586870]
2019-12-22 17:26:50,872+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-97) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command id: '91e1f8ed-9643-4118-93cf-4167e649ada3 failed child command status for step 'MERGE_STATUS'
2019-12-22 17:26:50,872+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-97) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command 'RemoveSnapshotSingleDiskLive' id: '91e1f8ed-9643-4118-93cf-4167e649ada3' child commands '[09605d58-8cf1-4d4c-a8e4-c517e64ab027, af9b5c8e-0bdd-4f34-bb13-91cd05eb8c5f, 7bcc49b6-4bbd-43ff-99bf-1bab69661f45]' executions were completed, status 'FAILED'
2019-12-22 17:26:51,884+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command 'RemoveSnapshot' (id: 'b37fd078-0134-4eaf-8f35-18b79facbaa3') waiting on child command id: '91e1f8ed-9643-4118-93cf-4167e649ada3' type:'RemoveSnapshotSingleDiskLive' to complete
2019-12-22 17:26:51,895+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [b754ed48-34d6-4d16-a82c-931fc7af1721] Merging of snapshot '11859df3-1d16-400e-b09a-7fa9426b3f0f' images '19fd5cc1-3128-45f5-903e-3f6a9859b98d'..'fd40794a-dbd8-43d0-9357-9c68bb586870' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2019-12-22 17:26:51,902+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [b754ed48-34d6-4d16-a82c-931fc7af1721] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand' with failure.
2019-12-22 17:26:52,919+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-57) [b754ed48-34d6-4d16-a82c-931fc7af1721] Command 'RemoveSnapshot' id: 'b37fd078-0134-4eaf-8f35-18b79facbaa3' child commands '[91e1f8ed-9643-4118-93cf-4167e649ada3]' executions were completed, status 'FAILED'
2019-12-22 17:26:54,083+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [b754ed48-34d6-4d16-a82c-931fc7af1721] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' with failure.
2019-12-22 17:26:54,142+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [b754ed48-34d6-4d16-a82c-931fc7af1721] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Failed to delete snapshot 's2' for VM 'vm_nfs_test'.

The host time is 7 hours behind engine.
VDSM log:
2019-12-22 10:26:48,094-0500 ERROR (jsonrpc/2) [virt.vm] (vmId='e72e5af2-9a14-41f0-bddf-fe4d6394121a') Live merge failed (job: de256113-299b-45f9-9d44-e69ff4d884b8) (vm:5544)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5542, in merge
    bandwidth, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 728, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirt.libvirtError: internal error: qemu block name 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870"}}' doesn't match expected '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b
2019-12-22 10:26:48,130-0500 INFO  (jsonrpc/2) [api.virt] FINISH merge return={'status': {'code': 52, 'message': 'Merge failed'}} from=::ffff:10.35.162.6,41762, flow_id=b754ed48-34d6-4d16-a82c-931fc7af1721, vmId=e72e5af2-9a14-41f0-bddf-fe4d6394121a (api:54)

Libvirt log:
2019-12-22 15:26:09.508+0000: 4529: error : virProcessRunInFork:1170 : internal error: child reported (status=125): internal error: child reported (status=125): 

2019-12-22 15:26:48.055+0000: 4529: error : qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name 'json:{"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-cente
r/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab6"}}, "drive
r": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4
289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d"}}' doesn't match expected '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-3
1cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d'


2019-12-22 15:26:48.063+0000: 4529: error : qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "/rhe
v/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab
6"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c
49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storag
e__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870"}}' doesn't match expected '/rhev/data-center/mnt/yellow-vdsb.qa.lab.t
lv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b



Version-Release number of selected component (if applicable):
http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-13/

ovirt-engine-4.4.0-0.13.master.el7.noarch
vdsm-4.40.0-164.git38a19bb.el8ev.x86_64
libvirt-5.6.0-6.module+el8.1.0+4244+9aa4e6bb.x86_64
qemu-kvm-4.1.0-14.module+el8.1.0+4754+8d38b36b.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create VM from template(rhel8.0 based) on iscsi or nfs disk(tried both)
2. Start VM
3. Create snapshot
4. Delete snapshot

Actual results:
Snapshot is not deleted and the snapshot disk change to illegal state making it impossible to undo/preview or delete the snapshot.

Expected results:


Additional info:
Also moving the VM to down state still the snapshot can not be deleted.

This is a very basic flow which occur on all storage flavors and blocks tests and automation thus the severity.

Comment 1 Avihai 2019-12-22 16:21:10 UTC

More info:
Tested also live merge on a VM without template -> merge fails with same error.
Cold merge (delete snapshot when VM is down) -> works.

Comment 2 Eyal Shenitzky 2019-12-23 06:07:07 UTC

This error come from QEMU and can be seen in libvirtd logs.

2019-12-22 15:26:48.063+0000: 4529: error : qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870"}}' doesn't match expected '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b

If I remember correctly live merge is broken in qemu-kvm-4.1.0-14 - 
https://bugs.launchpad.net/qemu/+bug/1846427

Comment 3 Kevin Wolf 2020-01-07 11:31:16 UTC

(In reply to Eyal Shenitzky from comment #2)
> If I remember correctly live merge is broken in qemu-kvm-4.1.0-14 - 
> https://bugs.launchpad.net/qemu/+bug/1846427

That Launchpad bug is not related to live merge. It has been fixed for RHEL-AV in bug 1764721.

> 2019-12-22 15:26:48.063+0000: 4529: error :
> qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name
> 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file",
> "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-
> ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file",
> "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-
> 903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file",
> "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-
> 9357-9c68bb586870"}}' doesn't match expected
> '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b
 
Something must have led to a situation where a backing file link in the qcow2 header of an image in the backing chain didn't match the file name with which the backing file was actually opened in QEMU, so that QEMU ended up generating a json: filename describing the whole backing chain instead of a plain one when adding a snapshot. The libvirt logs show this in query-block after calling blockdev-snapshot-sync (line 13131):

"file": "json:{\"backing\": {\"backing\": {\"driver\": \"qcow2\", \"file\": {\"driver\": \"file\",
 \"filename\": \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab6\"}}, \"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d\"}}, \"driver\": \"qcow2\", \"file\": {\"driver\": \"file\", \"filename\": \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870\"}}"


The command to create the snapshot was:

{"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-sync","data":{"device":"drive-ua-32c49b05-90f2-4289-b15e-521e1c81fa0f","snapshot-file":"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870","format":"qcow2","mode":"existing"}}]},"id":"libvirt-315"}


To make this a bit more legible, we have the following backing chain (only using the last component of the filename):

    43a73dfecab6 (existing backing file) <- 3f6a9859b98d (existing overlay) <- 9c68bb586870 (newly created snapshot)

That we use json: filenames must mean that the backing file link in 3f6a9859b98d differs from the path we have for 43a73dfecab6. This part of the backing chain is preexisting. I'm not sure if the backing file link stored in the image header of 3f6a9859b98d is somewhere in the libvirt logs, but I couldn't find it at least. Maybe manually checking with 'qemu-img info' could help here.


Immediately after starting blockdev-snapshot-sync, there is this chunk in the log file, which doens't look particularly healthy either, but I'm not sure what it's trying to tell me. Did the QEMU monitor connection die, but QEMU survived? Is this an indication of a problem in libvirt or one in QEMU?

2019-12-22 15:26:09.474+0000: 4527: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0x7efd10036280 reply={"return": {}, "id": "libvirt-315"}
2019-12-22 15:26:09.474+0000: 4529: debug : qemuDomainObjExitMonitorInternal:8243 : Exited monitor (mon=0x7efd10036280 vm=0x7efd0400cf40 name=vm_nfs_test)
2019-12-22 15:26:09.474+0000: 4529: debug : qemuDomainObjEndJob:8104 : Stopping job: async nested (async=snapshot vm=0x7efd0400cf40 name=vm_nfs_test)
2019-12-22 15:26:09.479+0000: 4529: debug : virFileClose:114 : Closed fd 36
2019-12-22 15:26:09.493+0000: 4529: error : virProcessRunInFork:1170 : internal error: child reported (status=125): 
2019-12-22 15:26:09.493+0000: 4529: debug : virFileClose:114 : Closed fd 34
2019-12-22 15:26:09.497+0000: 4529: debug : virFileClose:114 : Closed fd 36
2019-12-22 15:26:09.508+0000: 4529: error : virProcessRunInFork:1170 : internal error: child reported (status=125): internal error: child reported (status=125): 
2019-12-22 15:26:09.508+0000: 4529: debug : virFileClose:114 : Closed fd 34
2019-12-22 15:26:09.508+0000: 4529: warning : qemuDomainSnapshotUpdateDiskSources:15405 : Unable to move disk metadata on vm vm_nfs_test

Comment 4 Peter Krempa 2020-01-07 11:59:22 UTC

(In reply to Kevin Wolf from comment #3)
> (In reply to Eyal Shenitzky from comment #2)
> > If I remember correctly live merge is broken in qemu-kvm-4.1.0-14 - 
> > https://bugs.launchpad.net/qemu/+bug/1846427
> 
> That Launchpad bug is not related to live merge. It has been fixed for
> RHEL-AV in bug 1764721.
> 
> > 2019-12-22 15:26:48.063+0000: 4529: error :
> > qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name
> > 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file",
> > "filename":
> > "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> > _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> > 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-
> > ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file",
> > "filename":
> > "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> > _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> > 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-
> > 903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file",
> > "filename":
> > "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> > _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> > 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-
> > 9357-9c68bb586870"}}' doesn't match expected
> > '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> > _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b
>  
> Something must have led to a situation where a backing file link in the
> qcow2 header of an image in the backing chain didn't match the file name
> with which the backing file was actually opened in QEMU, so that QEMU ended
> up generating a json: filename describing the whole backing chain instead of
> a plain one when adding a snapshot. The libvirt logs show this in
> query-block after calling blockdev-snapshot-sync (line 13131):
> 
> "file": "json:{\"backing\": {\"backing\": {\"driver\": \"qcow2\", \"file\":
> {\"driver\": \"file\",
>  \"filename\":
> \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-
> ba50-43a73dfecab6\"}}, \"driver\": \"qcow2\", \"file\": {\"driver\":
> \"file\", \"filename\":
> \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-
> 903e-3f6a9859b98d\"}}, \"driver\": \"qcow2\", \"file\": {\"driver\":
> \"file\", \"filename\":
> \"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-
> 9357-9c68bb586870\"}}"
> 
> 
> The command to create the snapshot was:
> 
> {"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-
> sync","data":{"device":"drive-ua-32c49b05-90f2-4289-b15e-521e1c81fa0f",
> "snapshot-file":"/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-
> 9357-9c68bb586870","format":"qcow2","mode":"existing"}}]},"id":"libvirt-315"}
> 
> 
> To make this a bit more legible, we have the following backing chain (only
> using the last component of the filename):
> 
>     43a73dfecab6 (existing backing file) <- 3f6a9859b98d (existing overlay)
> <- 9c68bb586870 (newly created snapshot)
> 
> That we use json: filenames must mean that the backing file link in
> 3f6a9859b98d differs from the path we have for 43a73dfecab6. This part of
> the backing chain is preexisting. I'm not sure if the backing file link
> stored in the image header of 3f6a9859b98d is somewhere in the libvirt logs,
> but I couldn't find it at least. Maybe manually checking with 'qemu-img
> info' could help here.
> 
> 
> Immediately after starting blockdev-snapshot-sync, there is this chunk in
> the log file, which doens't look particularly healthy either, but I'm not
> sure what it's trying to tell me. Did the QEMU monitor connection die, but
> QEMU survived? Is this an indication of a problem in libvirt or one in QEMU?
> 
> 2019-12-22 15:26:09.474+0000: 4527: info : qemuMonitorJSONIOProcessLine:242
> : QEMU_MONITOR_RECV_REPLY: mon=0x7efd10036280 reply={"return": {}, "id":
> "libvirt-315"}

This is the reply from qemu.

> 2019-12-22 15:26:09.474+0000: 4529: debug :
> qemuDomainObjExitMonitorInternal:8243 : Exited monitor (mon=0x7efd10036280
> vm=0x7efd0400cf40 name=vm_nfs_test)

This is healthy. This message means that the control flow left the monitor context and returned into the domain context.

> 2019-12-22 15:26:09.474+0000: 4529: debug : qemuDomainObjEndJob:8104 :
> Stopping job: async nested (async=snapshot vm=0x7efd0400cf40
> name=vm_nfs_test)

This is that the control flow left the domain job. Also healthy and expected.

> 2019-12-22 15:26:09.479+0000: 4529: debug : virFileClose:114 : Closed fd 36
> 2019-12-22 15:26:09.493+0000: 4529: error : virProcessRunInFork:1170 :
> internal error: child reported (status=125): 
> 2019-12-22 15:26:09.493+0000: 4529: debug : virFileClose:114 : Closed fd 34
> 2019-12-22 15:26:09.497+0000: 4529: debug : virFileClose:114 : Closed fd 36
> 2019-12-22 15:26:09.508+0000: 4529: error : virProcessRunInFork:1170 :
> internal error: child reported (status=125): internal error: child reported
> (status=125): 
> 2019-12-22 15:26:09.508+0000: 4529: debug : virFileClose:114 : Closed fd 34
> 2019-12-22 15:26:09.508+0000: 4529: warning :
> qemuDomainSnapshotUpdateDiskSources:15405 : Unable to move disk metadata on
> vm vm_nfs_test

This last block is from the security labelling code the last message is produced by the code that is meant to remove XATTR labels libvirt uses for tracking the original owner of the image. The image looks somewhat like a block device, in which case XATTR labels would not work, but other than reporting a warning this shouldn't pose a problem.

Comment 5 Peter Krempa 2020-01-07 12:00:17 UTC

(In reply to Eyal Shenitzky from comment #2)
> This error come from QEMU and can be seen in libvirtd logs.
> 
> 2019-12-22 15:26:48.063+0000: 4529: error :
> qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name
> 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file",
> "filename":

This above should be fixed by using blockdev since we don't use qemuMonitorJSONDiskNameLookupOne in that control flow any more.

Comment 6 Tal Nisan 2020-01-13 15:26:28 UTC

(In reply to Peter Krempa from comment #5)
> (In reply to Eyal Shenitzky from comment #2)
> > This error come from QEMU and can be seen in libvirtd logs.
> > 
> > 2019-12-22 15:26:48.063+0000: 4529: error :
> > qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name
> > 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file",
> > "filename":
> 
> This above should be fixed by using blockdev since we don't use
> qemuMonitorJSONDiskNameLookupOne in that control flow any more.

Peter, who should fix this?

Comment 7 Peter Krempa 2020-01-13 15:46:22 UTC

Well, that depends on what version of qemu this should be fixed in.

First of all I don't know really what happened in qemu to that it switched to the JSON description for the image name. This means that probably the qemu team should investigate it. If that happens libvirt can't do anything about that. The idea of qemuMonitorJSONDiskNameLookupOne was to feed qemu the same string it might think it uses. This clearly doesn't work with the JSON based strings though. The only sane and stable fix is to use node-names which are used starting from qemu-4.2 and libvirt-5.10.

It would be beneficial to see what the output of 'qemu-img info' of the individual images looks like (sorry I will not try to figure out which particular file I'm interrested in at this point since oVirt uses those insane hash-based names).

At any rate, if you want to fix it without upgrading qemu or libvirt, we first need to understand what is happening with the json stuff and thus the qemu team is probably the best bet.

Comment 8 Benny Zlotnik 2020-01-13 15:48:05 UTC

*** Bug 1787378 has been marked as a duplicate of this bug. ***

Comment 9 Martin Tessun 2020-01-20 14:35:04 UTC

(In reply to Peter Krempa from comment #7)
> Well, that depends on what version of qemu this should be fixed in.
> 

This should be RHEL-AV 8.2 (qemu-4.2) for RHV 4.4.

> First of all I don't know really what happened in qemu to that it switched
> to the JSON description for the image name. This means that probably the
> qemu team should investigate it. If that happens libvirt can't do anything
> about that. The idea of qemuMonitorJSONDiskNameLookupOne was to feed qemu
> the same string it might think it uses. This clearly doesn't work with the
> JSON based strings though. The only sane and stable fix is to use node-names
> which are used starting from qemu-4.2 and libvirt-5.10.


This looks like something RHV/oVirt should switch to then - using blockjobs. So how does vdsm get the qemuMonitor command? Does it issue it directly or does it check via libvirt?
Is oVirt/RHV 4.4 using blockdev-*?

@Peter: It should, even if not doing anything as it is default in libvirt 5.10 now?

> It would be beneficial to see what the output of 'qemu-img info' of the
> individual images looks like (sorry I will not try to figure out which
> particular file I'm interrested in at this point since oVirt uses those
> insane hash-based names).
> 
> At any rate, if you want to fix it without upgrading qemu or libvirt, we
> first need to understand what is happening with the json stuff and thus the
> qemu team is probably the best bet.

@Tal: Please open an appropriate qemu BZ with the affected flow on libvirt/qemu level (what is triggered by vdsm).

Comment 10 Peter Krempa 2020-01-20 14:41:55 UTC

(In reply to Martin Tessun from comment #9)
> (In reply to Peter Krempa from comment #7)
> > Well, that depends on what version of qemu this should be fixed in.
> > 
> 
> This should be RHEL-AV 8.2 (qemu-4.2) for RHV 4.4.
> 
> > First of all I don't know really what happened in qemu to that it switched
> > to the JSON description for the image name. This means that probably the
> > qemu team should investigate it. If that happens libvirt can't do anything
> > about that. The idea of qemuMonitorJSONDiskNameLookupOne was to feed qemu
> > the same string it might think it uses. This clearly doesn't work with the
> > JSON based strings though. The only sane and stable fix is to use node-names
> > which are used starting from qemu-4.2 and libvirt-5.10.
> 
> 
> This looks like something RHV/oVirt should switch to then - using blockjobs.
> So how does vdsm get the qemuMonitor command? Does it issue it directly or
> does it check via libvirt?
> Is oVirt/RHV 4.4 using blockdev-*?

I'm not sure whether they are based on RHEL-AV 8.2 currently.

> 
> @Peter: It should, even if not doing anything as it is default in libvirt
> 5.10 now?

Yes that will be the only (and thus default) way on RHEL-AV 8.2 thus no change other than upgrade of libvirt and qemu should be necessary.

Comment 11 RHEL Program Management 2020-01-20 15:15:56 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 14 Nir Soffer 2020-02-13 12:12:36 UTC

Is this reproducible with RHEL 8.2 with advanced virt module?

We don't care about any issue in RHEL 8.1 since RHV 4.4 will be shipped
with RHEL 8.2.

Comment 16 Yosi Ben Shimon 2020-02-17 14:32:14 UTC

Tested on 4.4.0-20
RedHat release:
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

I tried the steps in the bug description and faced an issue while creating a live snapshot (with memory) - https://bugzilla.redhat.com/show_bug.cgi?id=1798462
The VM was shut down during the process.
I started the VM again and delete the snapshot.

Besides the crash after the snapshot creation, no errors found and all went well.
Note that taking a live snapshot without memory worked fine.

IMO, this bug can be closed.

Comment 17 Michal Skrivanek 2020-02-19 08:12:59 UTC

(In reply to Nir Soffer from comment #14)
> Is this reproducible with RHEL 8.2 with advanced virt module?
> 
> We don't care about any issue in RHEL 8.1 since RHV 4.4 will be shipped
> with RHEL 8.2.

well, we do for oVirt. There's no 8.2 AV in oVirt yet and the current plans are not to wait for it

Comment 18 Michal Skrivanek 2020-02-19 08:16:09 UTC

(In reply to Yosi Ben Shimon from comment #16)
> IMO, this bug can be closed.

since this is an oVirt bug it can't be closed until this is solved for oVirt

Comment 19 Nir Soffer 2020-02-19 09:38:34 UTC

We cannot solve this in oVirt. I don't think there is a chance to get a fix in libvirt
for this issue at this point, since libvirt 6.0 uses totally different code.

Comment 20 Peter Krempa 2020-03-17 17:05:56 UTC

Please make sure to file appropriate bugs for libvirt once it's known what's required.

Comment 21 Avihai 2020-03-18 07:26:23 UTC

We are still seeing this issue in RHEL8.2 hosts and engine so the issue is still there.
Daniel M saw this when he tried to verify 1690475 and will add all the info soon.

Please check it out.

Comment 22 Daniel 2020-03-18 09:37:22 UTC

As Avichai mentioned, bug still occurs on rhel8.2.
Found it out while trying to verify 1690475. 

working version:
4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

bug reproduced:
100%

steps:
1. create nfs \ gluster disk and attach it to VM
2. make LSM to different nfs \ gluster

Expected:
LSM should work properly and the disk should migrated.

Actual:
An error raised when trying to LSM in one of the next cases:
nfs -> nfs
nfs -> gluster
gluster -> gluster
gluster -> nfs

Engine Logs:
2020-03-17 12:38:59,268+02 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Command CopyData id:
 '19240136-d097-4416-b981-6deaee59c48a': execution was completed, the command status is 'FAILED'
2020-03-17 12:38:59,544+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Co
mmand 'LiveMigrateDisk' (id: '4f37c28a-597f-4749-bc0a-a890358b9cf5') waiting on child command id: '0cedd7c9-06cd-417b-8b45-11146bb16134' type:'CopyImageGroupVolumesData' to complete
2020-03-17 12:39:00,557+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2020-03-17 12:39:00,573+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'CopyImageGroupVolumesData' id: '0cedd7c9-06cd-417b-8b45-11146bb16134' child commands '[19240136-d097-4416-b981-6deaee59c48a]' executions were completed, status 'FAILED'
2020-03-17 12:39:01,587+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-
a1d1d1e765b4] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2020-03-17 12:39:01,662+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'LiveMigrateDisk' id: '4f37c28a-597f-4749-bc0a-a890358b9cf5' child commands '[a469ef90-ee0b-46f4-b263-cc6c6ddced92, 2239dbf0-5dff-4384-ae2d-f531b2c9ad72, 0cedd7c9-06cd-417b-8b45-11146bb16134]' executions 
were completed, status 'FAILED'
2020-03-17 12:39:02,751+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2020-03-17 12:39:02,752+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Fai
led during live storage migration of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238', attempting to end replication before deleting the target disk
2020-03-17 12:39:02,753+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_1, VmReplicateDiskParameters:{hostId='9804e9ec-b212-40ce-9c8b-934675f1e896', vmId='0deb41e6-deb6-41c5-9f42-bc3dda1a1238', storagePoolId='677
451ec-4159-4fdc-8136-dec3fb9d861b', srcStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', targetStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', imageGroupId='1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c', i
mageId='af3437c4-9fd3-4ebd-89c3-e3a9014eb206'}), log id: 387ea7c5
2020-03-17 12:39:02,882+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 387ea7c5
2020-03-17 12:39:02,883+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Attempting to delete the target of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238'


Attachments:
Extended logs attached.

Comment 23 Daniel 2020-03-18 09:38:39 UTC

Created attachment 1671018 [details]
LSM_engine_log

Comment 24 Nir Soffer 2020-03-18 09:52:29 UTC

(In reply to Daniel from comment #22)
> As Avichai mentioned, bug still occurs on rhel8.2.
> Found it out while trying to verify 1690475. 
> 
> working version:
> 4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

This is meaningless. Please show output of:

rpm -q libvirt-daemon
rpm -q qemu

> 
> bug reproduced:
> 100%

This is meaningless. How many tries? how many failures?

> steps:
> 1. create nfs \ gluster disk and attach it to VM
> 2. make LSM to different nfs \ gluster
> 
> Expected:
> LSM should work properly and the disk should migrated.
> 
> Actual:
> An error raised when trying to LSM in one of the next cases:

What do you mean by "in one"? All flows failed? only one failed?

> nfs -> nfs
> nfs -> gluster
> gluster -> gluster
> gluster -> nfs
> 
> Engine Logs:

Are not helpful. You must check the error in vdsm and libvirt logs.

Note that the original bug was this error in libvirt logs:

error : qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block name 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-9357-9c68bb586870"}}' doesn't match expected '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b

If the flow does not fail with this error, this bug is fixed and we can close it.

If the flow still fails with 8.2, please file a new bug about a new issue, unless
we already have a bug.

Comment 25 Liran Rotenberg 2020-03-19 13:10:41 UTC

*** Bug 1779644 has been marked as a duplicate of this bug. ***

Comment 26 Daniel 2020-03-22 08:20:55 UTC

(In reply to Nir Soffer from comment #24)
> 
> This is meaningless. Please show output of:
> 
> rpm -q libvirt-daemon

libvirt-daemon-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64

> rpm -q qemu

rpm -qa | grep qemu:
qemu-img-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64

> 
> > 
> > bug reproduced:
> > 100%
> 
> This is meaningless. How many tries? how many failures?
I meant that bug reproducible 100% of the times that tried (more than 10)
> 
> > steps:
> > 1. create nfs \ gluster disk and attach it to VM
> > 2. make LSM to different nfs \ gluster
> > 
> > Expected:
> > LSM should work properly and the disk should migrated.
> > 
> > Actual:
> > An error raised when trying to LSM in one of the next cases:
> 
> What do you mean by "in one"? All flows failed? only one failed?
my bad. An error raised when trying to LSM in every one of the next cases:
nfs -> nfs
nfs -> gluster
gluster -> gluster
gluster -> nfs

> > Engine Logs:
> 
> Are not helpful. You must check the error in vdsm and libvirt logs.
> 
> Note that the original bug was this error in libvirt logs:
> 
> error : qemuMonitorJSONDiskNameLookupOne:4648 : internal error: qemu block
> name 'json:{"backing": {"backing": {"driver": "qcow2", "file": {"driver":
> "file", "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/709ce457-defb-479e-
> ba50-43a73dfecab6"}}, "driver": "qcow2", "file": {"driver": "file",
> "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/19fd5cc1-3128-45f5-
> 903e-3f6a9859b98d"}}, "driver": "qcow2", "file": {"driver": "file",
> "filename":
> "/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b5e-8e56-
> 31cb3707d5d8/images/32c49b05-90f2-4289-b15e-521e1c81fa0f/fd40794a-dbd8-43d0-
> 9357-9c68bb586870"}}' doesn't match expected
> '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge7__nfs__0/98c55ff6-4857-4b
> 

The above error does not occurs in libvirtd.log.
vdsm logs are clean. 
engine logs got the error which I attached in my first comment.


> If the flow does not fail with this error, this bug is fixed and we can
> close it.
> 
> If the flow still fails with 8.2, please file a new bug about a new issue,
> unless
> we already have a bug.


However, there is are some errors which raised in libvirtd.log. 
Can you advise rather its something new or related please? (Attached it as libvirtd.log)

2020-03-22 07:33:11.883+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D PREROUTING -i vnet0 -j libvirt-J-vnet0' failed: ebtables: Bad rule (does a matching rule exist in that chain?)
..
2020-03-22 07:33:11.912+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-P-vnet0' failed: ebtables: Bad rule (does a matching rule exist in that chain?)

2020-03-22 07:33:11.912+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0'
2020-03-22 07:33:11.912+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.927+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.927+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0'
2020-03-22 07:33:11.927+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.935+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.935+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0'
2020-03-22 07:33:11.935+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.943+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.944+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0'
2020-03-22 07:33:11.944+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.952+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.952+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0'
2020-03-22 07:33:11.952+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.960+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.960+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0'
2020-03-22 07:33:11.960+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.968+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.968+0000: 3816323: info : virFirewallApplyGroup:774 : Starting transaction for firewall=0x7f61f81179a0 group=0x7f61f8122580 flags=0x0
2020-03-22 07:33:11.968+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0'
2020-03-22 07:33:11.968+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.976+0000: 3816323: info : virDBusCall:1572 : DBUS_METHOD_REPLY: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.976+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F J-vnet0-mac'
2020-03-22 07:33:11.976+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.985+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F J-vnet0-mac' failed: ebtables: No chain/target/match by that name

2020-03-22 07:33:11.985+0000: 3816323: info : virFirewallApplyRule:720 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X J-vnet0-mac'
2020-03-22 07:33:11.985+0000: 3816323: info : virDBusCall:1545 : DBUS_METHOD_CALL: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1'
2020-03-22 07:33:11.994+0000: 3816323: info : virDBusCall:1555 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X J-vnet0-mac' failed: ebtables: No chain/target/match by that name
..
2020-03-22 07:36:52.225+0000: 3816312: info : virObjectUnref:348 : OBJECT_UNREF: obj=0x7f61f81028d0
2020-03-22 07:36:52.230+0000: 3816312: error : virProcessRunInFork:1161 : internal error: child reported (status=125): 
2020-03-22 07:36:52.235+0000: 3816312: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): 
2020-03-22 07:36:52.236+0000: 3816312: warning : qemuBlockJobProcessEventCompletedActiveCommit:1203 : Unable to move disk metadata on vm golden_env_mixed_virtio_1_0
2020-03-22 07:36:52.237+0000: 3816312: info : virObjectRef:386 : OBJECT_REF: obj=0x7f6204002470

Comment 27 Daniel 2020-03-22 08:37:26 UTC

Created attachment 1672297 [details]
libvirtd logs. RHEL 8.2

Comment 28 Michal Skrivanek 2020-03-24 11:18:46 UTC

> rpm -q libvirt-daemon

libvirt-daemon-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64

> rpm -q qemu

rpm -qa | grep qemu:
qemu-img-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64

This is still old, please retest with up to date AV stack

Comment 29 Avihai 2020-03-24 13:43:30 UTC

(In reply to Michal Skrivanek from comment #28)
> > rpm -q libvirt-daemon
> 
> libvirt-daemon-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64
> 
> > rpm -q qemu
> 
> rpm -qa | grep qemu:
> qemu-img-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
> 
> This is still old, please retest with up to date AV stack

Clearing Nir/Daniel NEEDINFO's as Q's were answered.(add if nessasary)
Once QE get an official 4.4 compose with the needed libvirt/qemu versions we please notify and we will retest. 

Michal, this is official compose we got for rhv-4.4.0-24 and the issue Daniel answered Nir's questions while reproducing the issue.

When should the fixed libvirt/qemu version will be included in official compose QE get?
Can you please add the patch that deals with this issue to the bug?

For current/latest rhv-4.4.0-26 we have the following, is this good enough?

rpm -q rpm -q libvirt-daemon
rpm-4.14.2-37.el8.x86_64
libvirt-daemon-6.0.0-13.module+el8.2.0+6048+0fa476b4.x86_64

rpm -qa | grep qemu:
qemu-img-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64

Comment 30 Michal Skrivanek 2020-03-26 08:26:16 UTC

(In reply to Avihai from comment #29)
> (In reply to Michal Skrivanek from comment #28)
> > > rpm -q libvirt-daemon
> > 
> > libvirt-daemon-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64
> > 
> > > rpm -q qemu
> > 
> > rpm -qa | grep qemu:
> > qemu-img-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
> > 
> > This is still old, please retest with up to date AV stack
> 
> Clearing Nir/Daniel NEEDINFO's as Q's were answered.(add if nessasary)
> Once QE get an official 4.4 compose with the needed libvirt/qemu versions we
> please notify and we will retest. 

Acceptance of the build is not a Dev responsibility, so you need to talk to your internal QE team to find out what's preventing you from using it. That said, it shouldn't really be needed, see below.

> Michal, this is official compose we got for rhv-4.4.0-24 and the issue
> Daniel answered Nir's questions while reproducing the issue.

AV stack we're using point to nightly repos so if you use the correct ones you should have seen up to date repos This is not a RHV deliverable, but a platform one. You will see those updates regardless what RHV compose you use as long as you have the right nightly repos for AV

That said, we'll bump up reqs to make sure it' there, but you do not need to wait for that or anything.
As of today(i.e. if you install today), the versions are:
qemu-kvm-4.2.0-16.module+el8.2.0+6092+4f2391c1.x86_64   
libvirt-6.0.0-15.module+el8.2.0+6106+b6345808.x86_64

Comment 31 Avihai 2020-03-29 08:19:49 UTC

(In reply to Michal Skrivanek from comment #30)
> (In reply to Avihai from comment #29)
> > (In reply to Michal Skrivanek from comment #28)
> > > > rpm -q libvirt-daemon
> > > 
> > > libvirt-daemon-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64
> > > 
> > > > rpm -q qemu
> > > 
> > > rpm -qa | grep qemu:
> > > qemu-img-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
> > > 
> > > This is still old, please retest with up to date AV stack
> > 
> > Clearing Nir/Daniel NEEDINFO's as Q's were answered.(add if nessasary)
> > Once QE get an official 4.4 compose with the needed libvirt/qemu versions we
> > please notify and we will retest. 
> 
> Acceptance of the build is not a Dev responsibility, so you need to talk to
> your internal QE team to find out what's preventing you from using it. That
> said, it shouldn't really be needed, see below.
> 
> > Michal, this is official compose we got for rhv-4.4.0-24 and the issue
> > Daniel answered Nir's questions while reproducing the issue.
> 
> AV stack we're using point to nightly repos so if you use the correct ones
> you should have seen up to date repos This is not a RHV deliverable, but a
> platform one. You will see those updates regardless what RHV compose you use
> as long as you have the right nightly repos for AV
> 
> That said, we'll bump up reqs to make sure it' there, but you do not need to
> wait for that or anything.
> As of today(i.e. if you install today), the versions are:
> qemu-kvm-4.2.0-16.module+el8.2.0+6092+4f2391c1.x86_64   
> libvirt-6.0.0-15.module+el8.2.0+6106+b6345808.x86_64
We are working only with downsteam build composed and not nightly builds for verifing bugs.
I still did not get an answer to the simple question of on what libvirt/qemu version was this fixed, can you please address that ?

Comment 32 Avihai 2020-03-29 08:40:27 UTC

Ilan , please retest on latest downstream RHV-4.4.0-27 which have the needed packages:

libvirt-6.0.0-15.module+el8.2.0+6106+b6345808.x86_64
qemu-img-4.2.0-16.module+el8.2.0+6092+4f2391c1.x86_64

Comment 33 Michal Skrivanek 2020-03-31 10:35:44 UTC

(In reply to Avihai from comment #31)
> We are working only with downsteam build composed and not nightly builds for
> verifing bugs.

Not correct, you are working with repos provided by CI team which are set to RHEL nightly builds.

> I still did not get an answer to the simple question of on what libvirt/qemu
> version was this fixed, can you please address that ?

you can see that in dependencies of bug 1798072, let me add that here too

Comment 34 Michal Skrivanek 2020-03-31 10:43:29 UTC

you know what, to distinguish let's flip this one to RHV bug. For upstream tracking we still have bug 1798072 (until CentOS 8.2 is around) and we can have more accurate state from QE perspective this way

Comment 41 Daniel 2020-03-31 13:19:03 UTC

Bug verified via web-admin.
Snapshot deleted successfully and all engine, vdsm and libvirtd logs are clean of errors as it should.
In addition, LSM worked as expected as well.

worked on:
rhvm-4.4.0-0.29.master.el8ev.noarch
vdsm-jsonrpc-java-1.5.4-1.el8ev.noarch
qemu-kvm-4.2.0-16.module+el8.2.0+6092+4f2391c1.x86_64
libvirt-client-6.0.0-15.module+el8.2.0+6106+b6345808.x86_64

Comment 49 errata-xmlrpc 2020-08-04 13:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246