1855321 – VM on gluster unable to start after snapshot (Could not read L1 table: Input/output error)

Bug 1855321 - VM on gluster unable to start after snapshot (Could not read L1 table: Input/output error)

Summary: VM on gluster unable to start after snapshot (Could not read L1 table: Input/...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Gluster
Sub Component:
Version:	4.3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Kaustav Majumder
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-09 14:41 UTC by Timo Eissler
Modified:	2020-09-07 07:42 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-09-07 07:42:04 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Timo Eissler 2020-07-09 14:41:30 UTC

Description of problem:

VM is unable to be started after a snapshot of a disk on a gluster volume (libgfapi enabled) was created.


Version-Release number of selected component (if applicable):

oVirt 4.3.10.4-1.el7

# rpm -qa gluster*
gluster-ansible-cluster-1.0.0-1.el7.noarch
glusterfs-events-6.9-1.el7.x86_64
gluster-ansible-maintenance-1.0.1-2.el7.noarch
glusterfs-client-xlators-6.9-1.el7.x86_64
glusterfs-6.9-1.el7.x86_64
glusterfs-cli-6.9-1.el7.x86_64
glusterfs-api-6.9-1.el7.x86_64
glusterfs-fuse-6.9-1.el7.x86_64
glusterfs-server-6.9-1.el7.x86_64
gluster-ansible-repositories-1.0.1-2.el7.noarch
gluster-ansible-features-1.0.5-6.el7.noarch
glusterfs-geo-replication-6.9-1.el7.x86_64
glusterfs-libs-6.9-1.el7.x86_64
gluster-ansible-infra-1.0.4-8.el7.noarch
glusterfs-rdma-6.9-1.el7.x86_64
gluster-ansible-roles-1.0.5-10.el7.noarch


How reproducible:

Always


Steps to Reproduce:
1. Create a VM on a libgfapi enabled gluster volume
2. Stop VM
3. Create a disk snapshot
4. Try to start VM

Actual results:
VM fails to start.


Expected results:
VM starts normally.

Additional info:

vdsm.log:

2020-07-09 15:02:08,971+0200 ERROR (vm/fdca3bae) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') The vm start process failed (vm:934)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 868, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2895, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: qemu unexpectedly closed the monitor: [2020-07-09 13:02:06.221697] E [MSGID: 108006] [afr-common.c:5325:__afr_handle_child_down_event] 0-vmstore-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2020-07-09 13:02:06.222023] E [MSGID: 108006] [afr-common.c:5325:__afr_handle_child_down_event] 0-vmstore-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
[2020-07-09 13:02:06.222316] I [io-stats.c:4027:fini] 0-vmstore: io-stats translator unloaded
2020-07-09T13:02:08.000577Z qemu-kvm: -drive file=gluster://192.168.0.21:24007/vmstore/d47ccd1a-9750-4012-938e-3d93e81ad9e4/images/c267bd3d-ecc8-40d1-9bbb-31597c253815/86de26db-3d15-425e-ab9f-aae397b01464,file.debug=4,format=qcow2,if=none,id=drive-ua-c267bd3d-ecc8-40d1-9bbb-31597c253815,serial=c267bd3d-ecc8-40d1-9bbb-31597c253815,werror=stop,rerror=stop,cache=none,aio=native: Could not read L1 table: Input/output error
2020-07-09 15:02:08,972+0200 INFO  (vm/fdca3bae) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') Changed state to Down: internal error: qemu unexpectedly closed the monitor: [2020-07-09 13:02:06.221697] E [MSGID: 108006] [afr-common.c:5325:__afr_handle_child_down_event] 0-vmstore-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2020-07-09 13:02:06.222023] E [MSGID: 108006] [afr-common.c:5325:__afr_handle_child_down_event] 0-vmstore-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up.
[2020-07-09 13:02:06.222316] I [io-stats.c:4027:fini] 0-vmstore: io-stats translator unloaded
2020-07-09T13:02:08.000577Z qemu-kvm: -drive file=gluster://192.168.0.21:24007/vmstore/d47ccd1a-9750-4012-938e-3d93e81ad9e4/images/c267bd3d-ecc8-40d1-9bbb-31597c253815/86de26db-3d15-425e-ab9f-aae397b01464,file.debug=4,format=qcow2,if=none,id=drive-ua-c267bd3d-ecc8-40d1-9bbb-31597c253815,serial=c267bd3d-ecc8-40d1-9bbb-31597c253815,werror=stop,rerror=stop,cache=none,aio=native: Could not read L1 table: Input/output error (code=1) (vm:1702)
2020-07-09 15:02:08,978+0200 INFO  (vm/fdca3bae) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') Stopping connection (guestagent:455)
2020-07-09 15:02:08,984+0200 INFO  (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.10.0.20,60536, vmId=fdca3bae-28c4-40da-b73b-eabf3cacd1fc (api:48)
2020-07-09 15:02:08,984+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') Release VM resources (vm:5186)
2020-07-09 15:02:08,984+0200 WARN  (jsonrpc/7) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') trying to set state to Powering down when already Down (vm:626)
2020-07-09 15:02:08,984+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') Stopping connection (guestagent:455)
2020-07-09 15:02:08,985+0200 INFO  (jsonrpc/7) [vdsm.api] START teardownImage(sdUUID='d47ccd1a-9750-4012-938e-3d93e81ad9e4', spUUID='cbd35eda-9ab9-11ea-a226-00163e399c8f', imgUUID='c267bd3d-ecc8-40d1-9bbb-31597c253815', volUUID=None) from=::ffff:10.10.0.20,60536, task_id=5913b3e6-b5b7-44c6-b34c-b5744eb133cb (api:48)
2020-07-09 15:02:08,986+0200 INFO  (jsonrpc/7) [storage.StorageDomain] Removing image rundir link u'/var/run/vdsm/storage/d47ccd1a-9750-4012-938e-3d93e81ad9e4/c267bd3d-ecc8-40d1-9bbb-31597c253815' (fileSD:592)
2020-07-09 15:02:08,986+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH teardownImage return=None from=::ffff:10.10.0.20,60536, task_id=5913b3e6-b5b7-44c6-b34c-b5744eb133cb (api:54)
2020-07-09 15:02:08,986+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') Stopping connection (guestagent:455)
2020-07-09 15:02:08,986+0200 WARN  (jsonrpc/7) [root] File: /var/lib/libvirt/qemu/channels/fdca3bae-28c4-40da-b73b-eabf3cacd1fc.ovirt-guest-agent.0 already removed (fileutils:54)
2020-07-09 15:02:08,987+0200 WARN  (jsonrpc/7) [root] File: /var/lib/libvirt/qemu/channels/fdca3bae-28c4-40da-b73b-eabf3cacd1fc.org.qemu.guest_agent.0 already removed (fileutils:54)
2020-07-09 15:02:08,987+0200 INFO  (jsonrpc/7) [vdsm.api] START inappropriateDevices(thiefId='fdca3bae-28c4-40da-b73b-eabf3cacd1fc') from=::ffff:10.10.0.20,60536, task_id=fb813d21-6307-419e-9b1e-40784ccfc4af (api:48)

Comment 1 Timo Eissler 2020-07-10 09:47:55 UTC

This seems to be the same problem: https://www.mail-archive.com/users@ovirt.org/msg59460.html

Comment 2 Andreas Mack 2020-07-15 07:18:05 UTC

Same problem, on a 4.3 installation.

The culprit seems to be that the new image is created with the backing file just as a filename, not the complete glusterfs url:

This is created by ovirt via the web ui (renamed :
> qemu-img info oldsnap 
image: oldsnap
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 193K
cluster_size: 65536
backing file: d06c3e2b-5895-414d-902f-cba7fe44252d
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

This is created by hand as a workaround, with this, the snapshotted VM can be started:

>qemu-img info 41c1120b-aaf1-4eec-aa85-e800f3480241
image: 41c1120b-aaf1-4eec-aa85-e800f3480241
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 124M
cluster_size: 65536
backing file: gluster://192.168.1.11:24007/ssd-01/d97c648c-d26a-4874-8715-d1ba7b4f5c74/images/44831548-c3c6-4391-b3f3-db22c83b05d1/d06c3e2b-5895-414d-902f-cba7fe44252d
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

Comment 3 RHEL Program Management 2020-08-25 11:38:51 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 4 Gobinda Das 2020-09-07 07:42:04 UTC

libgfapi doesn't provide value over non-libgfapi and has some limitations. Could you please try with glusterfs-fuse enabled?
I am closing this for now, please reopen if you hit issue.

Note You need to log in before you can comment on or make changes to this bug.