Description of problem: While running fio workload on the vms created in RHHI setup i see that fio job gives the below error in json.out file. json.out file: ======================= ostname=dhcp35-71.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 hostname=dhcp35-168.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 hostname=dhcp35-141.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> { <dhcp35-168.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-168.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-168.lab.eng.blr.redhat.com> <dhcp35-168.lab.eng.blr.redhat.com> ] <dhcp35-168.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.168>: exited with error 1 <dhcp35-71.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error <dhcp35-71.lab.eng.blr.redhat.com> { <dhcp35-71.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-71.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-71.lab.eng.blr.redhat.com> <dhcp35-71.lab.eng.blr.redhat.com> ] <dhcp35-71.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.71>: exited with error 1 <dhcp35-141.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error <dhcp35-141.lab.eng.blr.redhat.com> { <dhcp35-141.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-141.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-141.lab.eng.blr.redhat.com> <dhcp35-141.lab.eng.blr.redhat.com> ] <dhcp35-141.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.141>: exited with error 1 { "fio version" : "fio-2.1.7", "client_stats" : [ { "jobname" : "workload", "groupid" : 0, Attaching the error which is see on the vm console during fio run. Version-Release number of selected component (if applicable): glusterfs-3.8.4-45.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create RHHI setup with latest glusterfs & RHV bits 2. Create 3 app vms. 3. start running fio. Actual results: Running fio results in fsync, error=input/output error. Expected results: fsync should run successfully. Additional info: screenshot for the vm console when fio runs is captured . attaching json.out file attaching all the sos reports.
Created attachment 1332346 [details] screenshot for the errors shown on vm console
sosreports have been copied to the location below: ====================================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1497156/
Are using gfapi access for VM disks?
sahina, i was seeing this issue with fuse access.
We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI running in a HyperConverged environment. If disk type is changed to the older VirtIO backend, this behavior disappears.
Kasturi, Could you enable strict-write-ordering and check if you see this issue? # gluster volume set <VOL> strict-write-ordering on -Krutika
(In reply to Evgheni Dereveanchin from comment #6) > We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI > running in a HyperConverged environment. If disk type is changed to the > older VirtIO backend, this behavior disappears. Hi Evgheni, Which version of glusterfs did you see this issue with? -Krutika
Hi Krutika, I got this behavior on CentOS7 with glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node NG image of oVirt 4.1.6 As this is 4.1, it wasn't using libgfapi - after enabling it the problem went away, so it's visible only when qemu writes to a mounted Gluster volume with VirtIO-SCSI used in the VM.
Out of interest I ran a test with libgfapi disabled and strict-write-ordering on to still get the same behavior on VirtIO-SCSI: SCSI aborts and lost writes appeared almost instantly after starting bonnie++
Hi Krutika, I ran with strict-write-ordering on & write-behind off but i still same the issue. I have emailed the setup details to you. Thanks kasturi.
(In reply to RamaKasturi from comment #11) > Hi Krutika, > > I ran with strict-write-ordering on & write-behind off but i still see the same issue. I have emailed the setup details to you. > > Thanks > kasturi.
Hi Raghavendra, We have encountered an issue where fio from within vms is failing with EIO, evident from the output Kasturi has pasted in the Description comment above. I traced the return status of fsync through the client stack and all translators have returned success up to fuse-bridge: [2017-10-18 10:53:32.357700] T [MSGID: 0] [client-rpc-fops.c:978:client3_3_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-client-1 returned 0 [2017-10-18 10:53:32.357747] T [MSGID: 0] [afr-common.c:3232:afr_fsync_unwind_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-replicate-0 returned 0 [2017-10-18 10:53:32.357769] T [MSGID: 0] [dht-inode-read.c:925:dht_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-dht returned 0 [2017-10-18 10:53:32.357788] T [MSGID: 0] [shard.c:4228:shard_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-shard returned 0 [2017-10-18 10:53:32.357800] T [MSGID: 0] [io-stats.c:2150:io_stats_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data returned 0 [2017-10-18 10:53:32.357819] T [fuse-bridge.c:1281:fuse_err_cbk] 0-glusterfs-fuse: 105177: FSYNC() ERR => 0 This still does not explain how the 0 here turned into EIO. We thought of capturing fuse-dump through rhev but unfortunately the mount.glusterfs.in script does NOT parse dump-fuse option and it is only parsed and recognised when vol is mounted through 'glusterfs --volfile-server=...' command and rhev uses 'mount -t glusterfs ...' command. Is there any other way to know what transpired between fuse-bridge and kernel? -Krutika
(In reply to Evgheni Dereveanchin from comment #9) > Hi Krutika, I got this behavior on CentOS7 with > glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node > NG image of oVirt 4.1.6 > > As this is 4.1, it wasn't using libgfapi - after enabling it the problem > went away, so it's visible only when qemu writes to a mounted Gluster volume > with VirtIO-SCSI used in the VM. Hi Evgheni, Could you provide the output of the following command from your host machines: rpm -qa | grep qemu -Krutika
Hi Krutika, here are the requested qemu package versions we have on our 4.1.6 nodes: qemu-img-ev-2.9.0-16.el7_4.5.1.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64 qemu-kvm-common-ev-2.9.0-16.el7_4.5.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-kvm-tools-ev-2.9.0-16.el7_4.5.1.x86_64 qemu-guest-agent-2.8.0-2.el7.x86_64 qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
We tried the same fio job (randrw8020_100iops.job) on the dev setup as well as the setup that performance engineering is using to measure runs - we did not hit the error in either.
Removing the flags and Test blocker as per QE status sent to rhhi-release-team on Nov 13, 2017
(In reply to Sahina Bose from comment #20) > Removing the flags and Test blocker as per QE status sent to > rhhi-release-team on Nov 13, 2017 Thanks Sahina for updating this information
sahina, sure. Please move it to ON_QA