Bug 1552344 - [Backup restore API][libgfapi] Start VM, with snapshot disk attached, resides on Gluster domain, fails with libvirtError
Summary: [Backup restore API][libgfapi] Start VM, with snapshot disk attached, resides...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.19
Hardware: x86_64
OS: Unspecified
unspecified
high with 1 vote
Target Milestone: ---
: ---
Assignee: bugs@ovirt.org
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-07 00:07 UTC by Elad
Modified: 2022-01-18 11:18 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-29 10:21:03 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
logs (976.75 KB, application/x-gzip)
2018-03-07 00:07 UTC, Elad
no flags Details
vdsm.log-iscsi (670.39 KB, application/x-gzip)
2018-03-07 14:01 UTC, Elad
no flags Details
w/o libgfsapi (176.61 KB, application/x-gzip)
2018-03-08 09:04 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-37614 0 None None None 2021-08-23 12:30:45 UTC

Description Elad 2018-03-07 00:07:18 UTC
Created attachment 1405076 [details]
logs

Description of problem:
Basic backup restore API scenario of VM start while it has a snapshot disk attached, resides on a Gluster domain, fails with libvirtError: Cannot access storage file (the transient volume).


Version-Release number of selected component (if applicable):
vdsm-4.20.20-1.el7ev.x86_64
libvirt-3.9.0-13.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
rhvm-4.2.2.2-0.1.el7.noarch


How reproducible:
Reproduced only with Gluster. Tested also with NFS and iSCSI.

Steps to Reproduce:
1. Create VM (source VM) with disk attached on Gluster domain (replica3)
2. Create a snapshot for the source VM
3. Create a second VM (backup VM)
4. Attach the backup disk snapshot of source VM to the backup VM (via REST)
5. Start backup VM


Actual results:

Start VM fails:

2018-03-07 01:05:25,486+0200 ERROR (vm/3a1ad46b) [virt.vm] (vmId='3a1ad46b-3854-4602-bd85-330c799eba96') The vm start process failed (vm:940)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 869, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2832, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Cannot access storage file '/var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-7125-43ce-a440-58abd08a4ebf.CXRJka' (as uid:107, gid:107): No such file or directory
2018-03-07 01:05:25,488+0200 INFO  (vm/3a1ad46b) [virt.vm] (vmId='3a1ad46b-3854-4602-bd85-330c799eba96') Changed state to Down: Cannot access storage file '/var/lib/vdsm/transient/d8930944-2305-4859-b616​-03e1359
68ad9-534d83aa-7125-43ce-a440-58abd08a4ebf.CXRJka' (as uid:107, gid:107): No such file or directory (code=1) (vm:1677)


libvirtd.log:

2018-03-06 23:05:25.409+0000: 1498: error : virStorageFileReportBrokenChain:416 : Cannot access storage file '/var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-7125-43ce-a440-58abd08a4ebf.CXR
Jka' (as uid:107, gid:107): No such file or directory


Expected results:
VM start should succeed

Additional info:
logs

Comment 1 Idan Shaby 2018-03-07 13:20:49 UTC
It seems like the problem is that vdsm couldn't access this file because it couldn't create it in the first place:

2018-03-07 01:05:23,423+0200 DEBUG (vm/3a1ad46b) [storage.operation] /usr/bin/taskset --cpu-list 0-0 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img create -f qcow2 -o compat=1.1 -b storage_local_ge6_
volume_0/d8930944-2305-4859-b616-03e135968ad9/images/a9dc2e8a-d6ad-4c8a-aa3f-0132bab5b705/534d83aa-7125-43ce-a440-58abd08a4ebf -F qcow2 /var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-7125-
43ce-a440-58abd08a4ebf.CXRJka (cwd /var/lib/vdsm/transient) (operation:150)
2018-03-07 01:05:23,706+0200 DEBUG (vm/3a1ad46b) [storage.operation] FAILED: <err> = "qemu-img: /var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-7125-43ce-a440-58abd08a4ebf.CXRJka: Could not
 open '/var/lib/vdsm/transient/storage_local_ge6_volume_0/d8930944-2305-4859-b616-03e135968ad9/images/a9dc2e8a-d6ad-4c8a-aa3f-0132bab5b705/534d83aa-7125-43ce-a440-58abd08a4ebf': No such file or directory\nCould 
not open backing image to determine size.\n"; <rc> = 1 (operation:169)
2018-03-07 01:05:23,707+0200 INFO  (vm/3a1ad46b) [virt.vm] (vmId='3a1ad46b-3854-4602-bd85-330c799eba96') Unlinking transient disk volume '/var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-712
5-43ce-a440-58abd08a4ebf.CXRJka' (vm:3758)
2018-03-07 01:05:23,708+0200 ERROR (vm/3a1ad46b) [virt.vm] (vmId='3a1ad46b-3854-4602-bd85-330c799eba96') Failed to create the transient disk for volume 534d83aa-7125-43ce-a440-58abd08a4ebf (vm:3761)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3755, in _createTransientDisk
    operation.run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 81, in run
    self._finalize(out, err)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 178, in _finalize
    raise cmdutils.Error(self._cmd, rc, out, err)
Error: Command ['/usr/bin/taskset', '--cpu-list', '0-0', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'create', '-f', 'qcow2', '-o', 'compat=1.1', '-b', u'storage_local_ge6_vol
ume_0/d8930944-2305-4859-b616-03e135968ad9/images/a9dc2e8a-d6ad-4c8a-aa3f-0132bab5b705/534d83aa-7125-43ce-a440-58abd08a4ebf', '-F', 'qcow2', '/var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa
-7125-43ce-a440-58abd08a4ebf.CXRJka'] failed with rc=1 out='' err="qemu-img: /var/lib/vdsm/transient/d8930944-2305-4859-b616-03e135968ad9-534d83aa-7125-43ce-a440-58abd08a4ebf.CXRJka: Could not open '/var/lib/vds
m/transient/storage_local_ge6_volume_0/d8930944-2305-4859-b616-03e135968ad9/images/a9dc2e8a-d6ad-4c8a-aa3f-0132bab5b705/534d83aa-7125-43ce-a440-58abd08a4ebf': No such file or directory\nCould not open backing im
age to determine size.\n"

This may be related to the new qemu version in rhel 7.5.
Are you using rhel 7.5?
Also, when you tested NFS and iSCSI, did you use the same host (same qemu version)?

Comment 2 Elad 2018-03-07 14:01:49 UTC
Created attachment 1405356 [details]
vdsm.log-iscsi

(In reply to Idan Shaby from comment #1)
> 
> This may be related to the new qemu version in rhel 7.5.
> Are you using rhel 7.5?
Yes
> Also, when you tested NFS and iSCSI, did you use the same host (same qemu
> version)?



This was found in an automation run, in which the test is being executed on iSCSI, NFS and Gluster in a sequence using the same environment. 
All environment hosts use the same qemu version - qemu-kvm-rhev-2.10.0-21.el7.x86_64


In the attached logs, here is the same case executed on iSCSI (attach snapshot disk to backup VM):

2018-03-07 00:57:25,404+02 INFO  [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-7) [diskattachments_create_f0ff61b3-1394] Running command: AttachDiskToVmCommand internal: false. Entities affected :  ID: 763f7609-1eac-413a-8aca-bf24eb004707 Type: VMAction group CONFIGURE_VM_STORAGE with role type USER,  ID: 69cc6dea-d017-4a21-97d7-d92763983812 Type: DiskAction group ATTACH_DISK with role type USER


vdsm.log (from another host, vdsm.log is attached):

2018-03-07 00:57:28,324+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.05 seconds (__init__:573)


Gluster volume details:


Volume Name: storage_local_ge6_volume_0
Type: Replicate
Volume ID: 4a33039e-f131-4315-a74b-0d7ac7664f03
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster01.scl.lab.tlv.redhat.com:/gluster_volumes/storage_local_ge6_volume_0
Brick2: gluster02.scl.lab.tlv.redhat.com:/gluster_volumes/storage_local_ge6_volume_0
Brick3: gluster03.scl.lab.tlv.redhat.com:/gluster_volumes/storage_local_ge6_volume_0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
storage.owner-gid: 36
storage.owner-uid: 36





Packages on hosts:

vdsm-yajsonrpc-4.20.20-1.el7ev.noarch
vdsm-http-4.20.20-1.el7ev.noarch
vdsm-network-4.20.20-1.el7ev.x86_64
vdsm-4.20.20-1.el7ev.x86_64
vdsm-hook-vfio-mdev-4.20.20-1.el7ev.noarch
vdsm-python-4.20.20-1.el7ev.noarch
vdsm-client-4.20.20-1.el7ev.noarch
vdsm-api-4.20.20-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.20-1.el7ev.noarch
vdsm-hook-openstacknet-4.20.20-1.el7ev.noarch
vdsm-common-4.20.20-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.20.20-1.el7ev.noarch
vdsm-hook-fcoe-4.20.20-1.el7ev.noarch
vdsm-hook-vhostmd-4.20.20-1.el7ev.noarch
vdsm-jsonrpc-4.20.20-1.el7ev.noarch
ovirt-setup-lib-1.1.4-1.el7ev.noarch
cockpit-ovirt-dashboard-0.11.14-0.1.el7ev.noarch
ovirt-imageio-common-1.2.1-0.el7ev.noarch
ovirt-host-dependencies-4.2.2-1.el7ev.x86_64
ovirt-provider-ovn-driver-1.2.8-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-host-deploy-1.7.2-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch
ovirt-imageio-daemon-1.2.1-0.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
python-ovirt-engine-sdk4-4.2.4-1.el7ev.x86_64
ovirt-host-4.2.2-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch
libvirt-libs-3.9.0-13.el7.x86_64
libvirt-daemon-driver-secret-3.9.0-13.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-3.9.0-13.el7.x86_64
libvirt-client-3.9.0-13.el7.x86_64
libvirt-lock-sanlock-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-13.el7.x86_64
libvirt-daemon-kvm-3.9.0-13.el7.x86_64
libvirt-daemon-driver-network-3.9.0-13.el7.x86_64
libvirt-daemon-config-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-interface-3.9.0-13.el7.x86_64
libvirt-daemon-driver-lxc-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-13.el7.x86_64
libvirt-python-3.9.0-1.el7.x86_64
libvirt-daemon-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-13.el7.x86_64
libvirt-daemon-config-network-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-core-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-img-rhev-2.10.0-21.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
glusterfs-libs-3.8.4-54.el7.x86_64
glusterfs-cli-3.8.4-54.el7.x86_64
glusterfs-client-xlators-3.8.4-54.el7.x86_64
glusterfs-fuse-3.8.4-54.el7.x86_64
glusterfs-3.8.4-54.el7.x86_64
glusterfs-api-3.8.4-54.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64
glusterfs-rdma-3.8.4-54.el7.x86_64
sanlock-python-3.6.0-1.el7.x86_64
libvirt-lock-sanlock-3.9.0-13.el7.x86_64
sanlock-3.6.0-1.el7.x86_64
Asanlock-lib-3.6.0-1.el7.x86_64
libselinux-utils-2.5-12.el7.x86_64
selinux-policy-3.13.1-192.el7.noarch
uselinux-policy-targeted-3.13.1-192.el7.noarch
nlibselinux-python-2.5-12.el7.x86_64
libselinux-2.5-12.el7.x86_64
kernel 3.10.0-858.el7.x86_64 
Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 3 Idan Shaby 2018-03-08 07:10:33 UTC
Thanks for the details, Elad.
I saw that you're using libgfapi. Can you please try to run the same test without it and see if it works?

Comment 4 Elad 2018-03-08 09:04:17 UTC
Created attachment 1405766 [details]
w/o libgfsapi

Doesn't reproduce while libgfapi is disabled.



2018-03-08 10:48:55,271+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-21) [diskattachments_create_66e05680-82c4] EVENT_ID: USER_ATTACH_DISK_TO_VM(2,016), Disk sour
ce_vm_TestCase6183_0810475513_Disk_0 was successfully attached to VM backup_vm_TestCase6183_0810483573 by admin@internal-authz.


2018-03-08 10:48:59,308+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-27) [vms_syncAction_ad7fb3ed-2ed0-4764] EVENT_ID: USER_STARTED_VM(153), VM backup_vm_TestCase
6183_0810483573 was started by admin@internal-authz (Host: host_mixed_1).

2018-03-08 10:50:10,816+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] VM 'ace84102-718f-43d0-88be-ab3e6751e0be'(backup_vm_TestCase6183_08
10483573) moved from 'PoweringUp' --> 'Up'



2018-03-08 10:48:59,214+0200 INFO  (jsonrpc/6) [api.virt] START create

2018-03-08 10:49:02,490+0200 INFO  (vm/ace84102) [virt.vm] (vmId='ace84102-718f-43d0-88be-ab3e6751e0be') <?xml version='1.0' encoding='utf-8'?>




RHEL 7.5
rhvm-4.2.2.2-0.1.el7.noarch
vdsm-4.20.20-1.el7ev.x86_64
libvirt-3.9.0-13.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64


[root@storage-ge-04 ~]# engine-config -a | grep -i libgfapi
LibgfApiSupported: false version: 3.6
LibgfApiSupported: false version: 4.0
LibgfApiSupported: false version: 4.1
LibgfApiSupported: false version: 4.2

Comment 5 Allon Mureinik 2018-03-08 09:13:27 UTC
as per comment 4, pushing out to 4.3, as we currently don't use libgfapi.

Comment 6 Idan Shaby 2018-03-08 11:31:24 UTC
For later inspection - what we should basically do is create the transient disk (with qemu-img create) using the -u option (unsafe) to avoid checking if the backing file exists when using libgfapi.

Comment 7 Allon Mureinik 2018-03-08 13:27:38 UTC
(In reply to Idan Shaby from comment #6)
> For later inspection - what we should basically do is create the transient
> disk (with qemu-img create) using the -u option (unsafe) to avoid checking
> if the backing file exists when using libgfapi.

This sounds necessary, but probably not sufficient - I wonder if the snapshot is usable by the vm once you pass the qemu-img create blocker.

Comment 9 Tal Nisan 2019-03-01 03:50:37 UTC
We currently don't use libgfapi in 4.3, pushing out to 4.4

Comment 12 Lukas Svaty 2020-03-24 12:25:34 UTC
Does not seem like a regression to me if the support was not introduced. Removing Regression flag to unblock 4.4 Beta. Feel free to re-add it if you believe this is indeed regression.

Comment 13 Eyal Shenitzky 2021-08-23 12:29:49 UTC
This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 14 Guillaume Pavese 2021-08-26 20:35:41 UTC
It would be nice to have libgfapi in oVirt. this bug was one of the blockers if I remember well.
users on users have reported seeing huge IO performance increase with libgfapi over fuse.

Comment 15 Michal Skrivanek 2021-09-29 10:21:03 UTC
unfortunately ligfapi is not really planned in foreseeable future. This bug is 3,5 years old and stale, closing it.

If you feel ligfapi still should be added please bring it up as a feature, with a plan to contribute the code and testing. Unfortunately it won't happen otherwise as the oVirt development team does not plan to add it.


Note You need to log in before you can comment on or make changes to this bug.