Bug 1388560 - I/O Errors seen while accessing VM images on gluster volumes using libgfapi
Summary: I/O Errors seen while accessing VM images on gluster volumes using libgfapi
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: libgfapi
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: rjoseph
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1390521
Blocks: Gluster-HC-2 1351528
TreeView+ depends on / blocked
 
Reported: 2016-10-25 16:16 UTC by SATHEESARAN
Modified: 2017-03-23 06:14 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
RHEL 7.2
Last Closed: 2017-03-23 06:14:47 UTC
Embargoed:


Attachments (Terms of Use)
QEMU logs from the RHEL 7.2 hypervisor (92.27 KB, text/plain)
2016-10-25 16:29 UTC, SATHEESARAN
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description SATHEESARAN 2016-10-25 16:16:59 UTC
Description of problem:
-----------------------
When VMs are created with qemu's native driver for glusterfs ( which uses libgfapi ), I/O errors are seen.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEL 7.2
RHGS 3.2.0 interim build ( glusterfs-3.8.4-3.el7rhgs )
libvirt-1.2.17-13.el7_2.5.x86_64
qemu-img-1.5.3-105.el7_2.7.x86_64
qemu-kvm-1.5.3-105.el7_2.7.x86_64
qemu-kvm-common-1.5.3-105.el7_2.7.x86_64

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a replica 3 volume and optimize the volume for storing VM images
2. Create a VM image on the volume
3. Create a VM to use that image file and start the VM

Actual results:
---------------
Unable to install OS on the VM. I/O errors are observed

Expected results:
-----------------
No I/O Errors

Comment 2 SATHEESARAN 2016-10-25 16:23:28 UTC
Error messages as reported in QEMU log : /var/log/libvirt/qemu/vm2.log

2016-10-25 15:03:55.357+0000: starting up libvirt version: 1.2.17, package: 13.el7_2.5 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-05-26-07:48:46, x86-020.build.eng.bos.redhat.com), qemu version: 
1.5.3 (qemu-kvm-1.5.3-105.el7_2.7)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name vm2 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off
 -smp 2,sockets=2,cores=1,threads=1 -uuid abf54c1a-e1be-4c8e-a3ef-fd04191395ba -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm2/monitor.sock,server,nowait -mon cha
rdev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=
on,strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci
.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=gluster://dhcp37-172.lab.eng.blr.redhat.com:24
007/rep3vol/vm3.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=24,id=hostnet0 -device rtl8139,netdev=hostn
et0,id=net0,mac=52:54:00:ac:13:40,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -global qxl-vga.vgamem_mb=16 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
char device redirected to /dev/pts/1 (label charserial0)
[2016-10-25 15:03:56.483026] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 7268732d-636c-6965-6e74-31302e6c6162 (0) coming up
[2016-10-25 15:03:56.483076] I [MSGID: 114020] [client.c:2356:notify] 0-rep3vol-client-0: parent translators are ready, attempting connect on transport
[2016-10-25 15:03:56.487260] I [MSGID: 114020] [client.c:2356:notify] 0-rep3vol-client-1: parent translators are ready, attempting connect on transport
[2016-10-25 15:03:56.489211] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-rep3vol-client-0: changing port to 49152 (from 0)
[2016-10-25 15:03:56.490961] I [MSGID: 114020] [client.c:2356:notify] 0-rep3vol-client-2: parent translators are ready, attempting connect on transport
[2016-10-25 15:03:56.495305] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-rep3vol-client-1: changing port to 49152 (from 0)
[2016-10-25 15:03:56.496954] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-rep3vol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-10-25 15:03:56.498682] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-rep3vol-client-2: changing port to 49152 (from 0)
[2016-10-25 15:03:56.500015] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-rep3vol-client-0: Connected to rep3vol-client-0, attached to remote volume '/gluster/brick1/b1'.
[2016-10-25 15:03:56.500040] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-rep3vol-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2016-10-25 15:03:56.500112] I [MSGID: 108005] [afr-common.c:4430:afr_notify] 0-rep3vol-replicate-0: Subvolume 'rep3vol-client-0' came back up; going online.
[2016-10-25 15:03:56.500882] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-rep3vol-client-0: Server lk version = 1
[2016-10-25 15:03:56.501540] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-rep3vol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-10-25 15:03:56.503869] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-rep3vol-client-1: Connected to rep3vol-client-1, attached to remote volume '/gluster/brick1/b1'.
[2016-10-25 15:03:56.503889] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-rep3vol-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2016-10-25 15:03:56.504367] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-rep3vol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-10-25 15:03:56.504522] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-rep3vol-client-1: Server lk version = 1
[2016-10-25 15:03:56.506745] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-rep3vol-client-2: Connected to rep3vol-client-2, attached to remote volume '/gluster/brick1/b1'.
[2016-10-25 15:03:56.506767] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-rep3vol-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2016-10-25 15:03:56.525522] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-rep3vol-client-2: Server lk version = 1
[2016-10-25 15:03:56.529824] I [MSGID: 104041] [glfs-resolve.c:885:__glfs_active_subvol] 0-rep3vol: switched to graph 7268732d-636c-6965-6e74-31302e6c6162 (0)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 537.039000 ms, bitrate 372552 bps (0.355293 Mbps) LOW BANDWIDTH
red_dispatcher_set_cursor_peer: 
inputs_connect: inputs channel client create
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)

Comment 3 SATHEESARAN 2016-10-25 16:29:01 UTC
Created attachment 1213992 [details]
QEMU logs from the RHEL 7.2 hypervisor

Comment 4 SATHEESARAN 2016-10-25 16:39:47 UTC
I have disabled compound-fops and client-io-threads , then tested the same. Still I see the same problem - I/O errors

Comment 5 Michael Adam 2016-11-01 06:46:15 UTC
> [...] 0-rep3vol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [...] 0-rep3vol-client-0: Server and Client lk-version numbers are not same, reopening the fds

Is it possible that simply the client version in the test setup is too old?

Comment 6 Michael Adam 2016-11-01 07:11:33 UTC
(In reply to Michael Adam from comment #5)
> > [...] 0-rep3vol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
> > [...] 0-rep3vol-client-0: Server and Client lk-version numbers are not same, reopening the fds
> 
> Is it possible that simply the client version in the test setup is too old?

Ok, confused client version and client LK version...
And that is only informational.

But still, the client seems very old.
Would it be possible that there are some incompatible changes between 3.3 and 3.8?

Comment 7 Daryl Lee 2016-11-02 21:29:24 UTC
I'm seeing almost the same issue when trying to connect to a QCOW2 drive with a VM.  This seems to be limited to glusterfs package 3.8.5-1.el7.   3.8.4-1.el7 works fine when I downgrade back to it. I am using a replica 3 arbiter volume.   I've been seeing those informational client version and client LK version alerts since I started working with gluster.  

GlusterFS Installed when it works:
glusterfs.x86_64                                                                3.8.4-1.el7  
glusterfs-api.x86_64                                                            3.8.4-1.el7 
glusterfs-client-xlators.x86_64                                                 3.8.4-1.el7 
glusterfs-fuse.x86_64                                                           3.8.4-1.el7
glusterfs-libs.x86_64          

GlusterFS  Installed when it doesn't work:
glusterfs                                   x86_64                    3.8.5-1.el7 
glusterfs-api                               x86_64                    3.8.5-1.el7
glusterfs-client-xlators                    x86_64                    3.8.5-1.el7
glusterfs-fuse                              x86_64                    3.8.5-1.el7
glusterfs-libs                              x86_64                    3.8.5-1.el7

Error:
----------------------------------------------------------------------------------
2016-11-02T17:28:51.970295Z qemu-kvm: -drive file=gluster://gluster1:24007/opennebula/361d9f69c43ca458f037b8afb23eed5a,if=none,id=drive-ide0-1-0,format=qcow2,cache=none: could not open disk image gluster://gluster1:24007/opennebula/361d9f69c43ca458f037b8afb23eed5a: Could not read L1 table: Input/output error
2016-11-02 17:28:51.996+0000: shutting down

Comment 8 SATHEESARAN 2016-11-04 06:28:07 UTC
(In reply to Daryl Lee from comment #7)
> I'm seeing almost the same issue when trying to connect to a QCOW2 drive
> with a VM.  This seems to be limited to glusterfs package 3.8.5-1.el7.  
> 3.8.4-1.el7 works fine when I downgrade back to it. I am using a replica 3
> arbiter volume.   I've been seeing those informational client version and
> client LK version alerts since I started working with gluster.  
> 
> GlusterFS Installed when it works:
> glusterfs.x86_64                                                            
> 3.8.4-1.el7  
> glusterfs-api.x86_64                                                        
> 3.8.4-1.el7 
> glusterfs-client-xlators.x86_64                                             
> 3.8.4-1.el7 
> glusterfs-fuse.x86_64                                                       
> 3.8.4-1.el7
> glusterfs-libs.x86_64          
> 
> GlusterFS  Installed when it doesn't work:
> glusterfs                                   x86_64                   
> 3.8.5-1.el7 
> glusterfs-api                               x86_64                   
> 3.8.5-1.el7
> glusterfs-client-xlators                    x86_64                   
> 3.8.5-1.el7
> glusterfs-fuse                              x86_64                   
> 3.8.5-1.el7
> glusterfs-libs                              x86_64                   
> 3.8.5-1.el7
> 
> Error:
> -----------------------------------------------------------------------------
> -----
> 2016-11-02T17:28:51.970295Z qemu-kvm: -drive
> file=gluster://gluster1:24007/opennebula/361d9f69c43ca458f037b8afb23eed5a,
> if=none,id=drive-ide0-1-0,format=qcow2,cache=none: could not open disk image
> gluster://gluster1:24007/opennebula/361d9f69c43ca458f037b8afb23eed5a: Could
> not read L1 table: Input/output error
> 2016-11-02 17:28:51.996+0000: shutting down

Hi Daryl,

This issue is reported on the product - 'Red Hat Gluster Storage' which is the downstream version of GlusterFS product.

There is the issue[1] reported for the same with the product - 'GlusterFS'
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1390521

Please follow up with that one for the resolution of the issue

Comment 9 SATHEESARAN 2016-11-04 06:31:36 UTC
(In reply to Michael Adam from comment #6)
> 
> But still, the client seems very old.
> Would it be possible that there are some incompatible changes between 3.3
> and 3.8?

No, I am using the latest client binaries on RHEL 7.3.

I am not seeing this issue with previous interim build - glusterfs-3.8.4-2.el7rhgs. This is a regression with glusterfs-3.8.4-3.el7rhgs build.

Comment 11 rjoseph 2016-11-08 05:55:43 UTC
From the initial investigation this issue looks similar to BZ 1391086. Fix for this bug is already merged downstream. Can we retest this with the latest build?

Comment 12 SATHEESARAN 2016-11-08 06:23:48 UTC
(In reply to rjoseph from comment #11)
> From the initial investigation this issue looks similar to BZ 1391086. Fix
> for this bug is already merged downstream. Can we retest this with the
> latest build?

I don't see any new downstream build post the interim build - glusterfs-3.8.4-3.el7rhgs. I will check this issue with the next downstream build

Comment 13 SATHEESARAN 2016-11-14 08:18:12 UTC
All,

I have tested with the latest downstream RHGS 3.2.0 interim build - glusterfs-3.8.4-5.el7rhgs - on RHEL 7.3.

Version of the other components:
--------------------------------
qemu-kvm-rhev-2.6.0-27.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64
ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch
qemu-img-rhev-2.6.0-27.el7.x86_64
qemu-kvm-tools-rhev-2.6.0-27.el7.x86_64
qemu-kvm-common-rhev-2.6.0-27.el7.x86_64
qemu-guest-agent-2.5.0-3.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-driver-network-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-10.el7.x86_64
libvirt-daemon-config-network-2.0.0-10.el7.x86_64
libvirt-daemon-driver-secret-2.0.0-10.el7.x86_64
libvirt-client-2.0.0-10.el7.x86_64
libvirt-daemon-driver-storage-2.0.0-10.el7.x86_64
libvirt-daemon-driver-lxc-2.0.0-10.el7.x86_64
libvirt-2.0.0-10.el7.x86_64
libvirt-daemon-2.0.0-10.el7.x86_64
libvirt-daemon-driver-interface-2.0.0-10.el7.x86_64
libvirt-daemon-config-nwfilter-2.0.0-10.el7.x86_64


I am not seeing this issue anymore.

Please provide the patch URL for the fix, move it to ON_QA with the proper fixed-in-version, so that this bug could be VERIFIED

Comment 14 rjoseph 2016-11-14 08:36:22 UTC
Fix for BZ1391093 also fixes this issue. Following is the corresponding downstream patch:
https://code.engineering.redhat.com/gerrit/89229

Therefore moving the bug to ON_QA.

Comment 20 SATHEESARAN 2016-11-14 10:03:03 UTC
Tested with RHGS 3.2.0 interim build - glusterfs-3.8.4-5.el7rhgs installed on RHEL 7.3 with the following components :
qemu-img-1.5.3-126.el7.x86_64
ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch
qemu-kvm-common-1.5.3-126.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64
qemu-kvm-1.5.3-126.el7.x86_64
libvirt-daemon-config-network-2.0.0-10.el7.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-driver-storage-2.0.0-10.el7.x86_64
libvirt-daemon-2.0.0-10.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64
libvirt-daemon-driver-interface-2.0.0-10.el7.x86_64
libvirt-daemon-driver-network-2.0.0-10.el7.x86_64
libvirt-daemon-config-nwfilter-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-10.el7.x86_64
libvirt-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-10.el7.x86_64
libvirt-daemon-driver-secret-2.0.0-10.el7.x86_64
libvirt-client-2.0.0-10.el7.x86_64
libvirt-daemon-driver-lxc-2.0.0-10.el7.x86_64


VMs could access the disks through gfapi without any I/O Errors.

Comment 22 errata-xmlrpc 2017-03-23 06:14:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.