Bug 1497156
| Summary: | [gfapi] Running fio on vms results in fsync, error=Input/output error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> | ||||
| Component: | rhhi | Assignee: | Sahina Bose <sabose> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | SATHEESARAN <sasundar> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | unspecified | CC: | ederevea, godas, guillaume.pavese, kdhananj, knarra, rcyriac, rhs-bugs, sasundar | ||||
| Target Milestone: | --- | Keywords: | Tracking | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1501309 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-05-15 06:35:02 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1501309, 1515107 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
Created attachment 1332346 [details]
screenshot for the errors shown on vm console
sosreports have been copied to the location below: ====================================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1497156/ Are using gfapi access for VM disks? sahina, i was seeing this issue with fuse access. We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI running in a HyperConverged environment. If disk type is changed to the older VirtIO backend, this behavior disappears. Kasturi, Could you enable strict-write-ordering and check if you see this issue? # gluster volume set <VOL> strict-write-ordering on -Krutika (In reply to Evgheni Dereveanchin from comment #6) > We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI > running in a HyperConverged environment. If disk type is changed to the > older VirtIO backend, this behavior disappears. Hi Evgheni, Which version of glusterfs did you see this issue with? -Krutika Hi Krutika, I got this behavior on CentOS7 with glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node NG image of oVirt 4.1.6 As this is 4.1, it wasn't using libgfapi - after enabling it the problem went away, so it's visible only when qemu writes to a mounted Gluster volume with VirtIO-SCSI used in the VM. Out of interest I ran a test with libgfapi disabled and strict-write-ordering on to still get the same behavior on VirtIO-SCSI: SCSI aborts and lost writes appeared almost instantly after starting bonnie++ Hi Krutika, I ran with strict-write-ordering on & write-behind off but i still same the issue. I have emailed the setup details to you. Thanks kasturi. (In reply to RamaKasturi from comment #11) > Hi Krutika, > > I ran with strict-write-ordering on & write-behind off but i still see the same issue. I have emailed the setup details to you. > > Thanks > kasturi. Hi Raghavendra, We have encountered an issue where fio from within vms is failing with EIO, evident from the output Kasturi has pasted in the Description comment above. I traced the return status of fsync through the client stack and all translators have returned success up to fuse-bridge: [2017-10-18 10:53:32.357700] T [MSGID: 0] [client-rpc-fops.c:978:client3_3_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-client-1 returned 0 [2017-10-18 10:53:32.357747] T [MSGID: 0] [afr-common.c:3232:afr_fsync_unwind_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-replicate-0 returned 0 [2017-10-18 10:53:32.357769] T [MSGID: 0] [dht-inode-read.c:925:dht_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-dht returned 0 [2017-10-18 10:53:32.357788] T [MSGID: 0] [shard.c:4228:shard_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-shard returned 0 [2017-10-18 10:53:32.357800] T [MSGID: 0] [io-stats.c:2150:io_stats_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data returned 0 [2017-10-18 10:53:32.357819] T [fuse-bridge.c:1281:fuse_err_cbk] 0-glusterfs-fuse: 105177: FSYNC() ERR => 0 This still does not explain how the 0 here turned into EIO. We thought of capturing fuse-dump through rhev but unfortunately the mount.glusterfs.in script does NOT parse dump-fuse option and it is only parsed and recognised when vol is mounted through 'glusterfs --volfile-server=...' command and rhev uses 'mount -t glusterfs ...' command. Is there any other way to know what transpired between fuse-bridge and kernel? -Krutika (In reply to Evgheni Dereveanchin from comment #9) > Hi Krutika, I got this behavior on CentOS7 with > glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node > NG image of oVirt 4.1.6 > > As this is 4.1, it wasn't using libgfapi - after enabling it the problem > went away, so it's visible only when qemu writes to a mounted Gluster volume > with VirtIO-SCSI used in the VM. Hi Evgheni, Could you provide the output of the following command from your host machines: rpm -qa | grep qemu -Krutika Hi Krutika, here are the requested qemu package versions we have on our 4.1.6 nodes: qemu-img-ev-2.9.0-16.el7_4.5.1.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64 qemu-kvm-common-ev-2.9.0-16.el7_4.5.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-kvm-tools-ev-2.9.0-16.el7_4.5.1.x86_64 qemu-guest-agent-2.8.0-2.el7.x86_64 qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64 We tried the same fio job (randrw8020_100iops.job) on the dev setup as well as the setup that performance engineering is using to measure runs - we did not hit the error in either. Removing the flags and Test blocker as per QE status sent to rhhi-release-team on Nov 13, 2017 (In reply to Sahina Bose from comment #20) > Removing the flags and Test blocker as per QE status sent to > rhhi-release-team on Nov 13, 2017 Thanks Sahina for updating this information sahina, sure. Please move it to ON_QA |
Description of problem: While running fio workload on the vms created in RHHI setup i see that fio job gives the below error in json.out file. json.out file: ======================= ostname=dhcp35-71.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 hostname=dhcp35-168.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 hostname=dhcp35-141.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1 <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error <dhcp35-168.lab.eng.blr.redhat.com> { <dhcp35-168.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-168.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-168.lab.eng.blr.redhat.com> <dhcp35-168.lab.eng.blr.redhat.com> ] <dhcp35-168.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.168>: exited with error 1 <dhcp35-71.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error <dhcp35-71.lab.eng.blr.redhat.com> { <dhcp35-71.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-71.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-71.lab.eng.blr.redhat.com> <dhcp35-71.lab.eng.blr.redhat.com> ] <dhcp35-71.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.71>: exited with error 1 <dhcp35-141.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error <dhcp35-141.lab.eng.blr.redhat.com> { <dhcp35-141.lab.eng.blr.redhat.com> "fio version" : "fio-2.1.7", <dhcp35-141.lab.eng.blr.redhat.com> "jobs" : [ <dhcp35-141.lab.eng.blr.redhat.com> <dhcp35-141.lab.eng.blr.redhat.com> ] <dhcp35-141.lab.eng.blr.redhat.com> } Run status group 0 (all jobs): client <10.70.35.141>: exited with error 1 { "fio version" : "fio-2.1.7", "client_stats" : [ { "jobname" : "workload", "groupid" : 0, Attaching the error which is see on the vm console during fio run. Version-Release number of selected component (if applicable): glusterfs-3.8.4-45.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create RHHI setup with latest glusterfs & RHV bits 2. Create 3 app vms. 3. start running fio. Actual results: Running fio results in fsync, error=input/output error. Expected results: fsync should run successfully. Additional info: screenshot for the vm console when fio runs is captured . attaching json.out file attaching all the sos reports.