Bug 1497156

Summary:

[gfapi] Running fio on vms results in fsync, error=Input/output error

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

RamaKasturi <knarra>

Component:

rhhi

Assignee:

Sahina Bose <sabose>

Status:

CLOSED DEFERRED

QA Contact:

SATHEESARAN <sasundar>

Severity:

high

Docs Contact:

Priority:

medium

Version:

unspecified

CC:

ederevea, godas, guillaume.pavese, kdhananj, knarra, rcyriac, rhs-bugs, sasundar

Target Milestone:

---

Keywords:

Tracking

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1501309 (view as bug list)

Environment:

Last Closed:

2020-05-15 06:35:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1501309, 1515107

Bug Blocks:

Attachments:

Description	Flags
screenshot for the errors shown on vm console	none

Description RamaKasturi 2017-09-29 10:43:44 UTC

Description of problem:
While running fio workload on the vms created in RHHI setup i see that fio job gives the below error in json.out file.

json.out file:
=======================
ostname=dhcp35-71.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1
hostname=dhcp35-168.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1
hostname=dhcp35-141.lab.eng.blr.redhat.com, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-2.1.7, flags=1
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> file:filesetup.c:270, func=fstat, error=Input/output error
<dhcp35-168.lab.eng.blr.redhat.com> {
<dhcp35-168.lab.eng.blr.redhat.com>   "fio version" : "fio-2.1.7",
<dhcp35-168.lab.eng.blr.redhat.com>   "jobs" : [
<dhcp35-168.lab.eng.blr.redhat.com>
<dhcp35-168.lab.eng.blr.redhat.com>   ]
<dhcp35-168.lab.eng.blr.redhat.com> }

Run status group 0 (all jobs):
client <10.70.35.168>: exited with error 1
<dhcp35-71.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error
<dhcp35-71.lab.eng.blr.redhat.com> {
<dhcp35-71.lab.eng.blr.redhat.com>   "fio version" : "fio-2.1.7",
<dhcp35-71.lab.eng.blr.redhat.com>   "jobs" : [
<dhcp35-71.lab.eng.blr.redhat.com>
<dhcp35-71.lab.eng.blr.redhat.com>   ]
<dhcp35-71.lab.eng.blr.redhat.com> }

Run status group 0 (all jobs):
client <10.70.35.71>: exited with error 1
<dhcp35-141.lab.eng.blr.redhat.com> fio: pid=0, err=5/file:filesetup.c:174, func=fsync, error=Input/output error
<dhcp35-141.lab.eng.blr.redhat.com> {
<dhcp35-141.lab.eng.blr.redhat.com>   "fio version" : "fio-2.1.7",
<dhcp35-141.lab.eng.blr.redhat.com>   "jobs" : [
<dhcp35-141.lab.eng.blr.redhat.com>
<dhcp35-141.lab.eng.blr.redhat.com>   ]
<dhcp35-141.lab.eng.blr.redhat.com> }

Run status group 0 (all jobs):
client <10.70.35.141>: exited with error 1
{
  "fio version" : "fio-2.1.7",
  "client_stats" : [
    {
      "jobname" : "workload",
      "groupid" : 0,

Attaching the error which is see on the vm console during fio run.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-45.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create RHHI setup with latest glusterfs & RHV bits
2. Create 3 app vms.
3. start running fio.

Actual results:
Running fio results in fsync, error=input/output error.

Expected results:
fsync should run successfully.

Additional info:

screenshot for the vm console when fio runs is captured .
attaching json.out file 
attaching all the sos reports.

Comment 2 RamaKasturi 2017-09-29 10:47:50 UTC

Created attachment 1332346 [details]
screenshot for the errors shown on vm console

Comment 3 RamaKasturi 2017-09-29 10:50:11 UTC

sosreports have been copied to the location below:
======================================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1497156/

Comment 4 Sahina Bose 2017-10-11 11:25:31 UTC

Are using gfapi access for VM disks?

Comment 5 RamaKasturi 2017-10-11 14:52:22 UTC

sahina, i was seeing this issue with fuse access.

Comment 6 Evgheni Dereveanchin 2017-10-13 12:31:48 UTC

We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI running in a HyperConverged environment. If disk type is changed to the older VirtIO backend, this behavior disappears.

Comment 7 Krutika Dhananjay 2017-10-16 08:35:52 UTC

Kasturi,

Could you enable strict-write-ordering and check if you see this issue?

# gluster volume set <VOL> strict-write-ordering on

-Krutika

Comment 8 Krutika Dhananjay 2017-10-16 10:55:50 UTC

(In reply to Evgheni Dereveanchin from comment #6)
> We get the same results while testing with bonnie++ on a VM with VirtIO-SCSI
> running in a HyperConverged environment. If disk type is changed to the
> older VirtIO backend, this behavior disappears.

Hi Evgheni,

Which version of glusterfs did you see this issue with?

-Krutika

Comment 9 Evgheni Dereveanchin 2017-10-16 11:11:43 UTC

Hi Krutika, I got this behavior on CentOS7 with glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node NG image of oVirt 4.1.6

As this is 4.1, it wasn't using libgfapi - after enabling it the problem went away, so it's visible only when qemu writes to a mounted Gluster volume with VirtIO-SCSI used in the VM.

Comment 10 Evgheni Dereveanchin 2017-10-16 13:02:54 UTC

Out of interest I ran a test with libgfapi disabled and strict-write-ordering on to still get the same behavior on VirtIO-SCSI: SCSI aborts and lost writes appeared almost instantly after starting bonnie++

Comment 11 RamaKasturi 2017-10-16 16:48:37 UTC

Hi Krutika,

   I ran with strict-write-ordering on & write-behind off but i still same the issue. I have emailed the setup details to you.

Thanks
kasturi.

Comment 13 RamaKasturi 2017-10-16 16:52:07 UTC

(In reply to RamaKasturi from comment #11)
> Hi Krutika,
> 
>    I ran with strict-write-ordering on & write-behind off but i still see the same issue. I have emailed the setup details to you.
> 
> Thanks
> kasturi.

Comment 14 Krutika Dhananjay 2017-10-18 11:07:41 UTC

Hi Raghavendra,

We have encountered an issue where fio from within vms is failing with EIO, evident from the output Kasturi has pasted in the Description comment above.

I traced the return status of fsync through the client stack and all translators have returned success up to fuse-bridge:

[2017-10-18 10:53:32.357700] T [MSGID: 0] [client-rpc-fops.c:978:client3_3_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-client-1 returned 0
[2017-10-18 10:53:32.357747] T [MSGID: 0] [afr-common.c:3232:afr_fsync_unwind_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-replicate-0 returned 0
[2017-10-18 10:53:32.357769] T [MSGID: 0] [dht-inode-read.c:925:dht_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-dht returned 0
[2017-10-18 10:53:32.357788] T [MSGID: 0] [shard.c:4228:shard_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data-shard returned 0
[2017-10-18 10:53:32.357800] T [MSGID: 0] [io-stats.c:2150:io_stats_fsync_cbk] 0-stack-trace: stack-address: 0x7f88b4028e20, data returned 0
[2017-10-18 10:53:32.357819] T [fuse-bridge.c:1281:fuse_err_cbk] 0-glusterfs-fuse: 105177: FSYNC() ERR => 0


This still does not explain how the 0 here turned into EIO.

We thought of capturing fuse-dump through rhev but unfortunately the mount.glusterfs.in script does NOT parse dump-fuse option and it is only parsed and recognised when vol is mounted through 'glusterfs --volfile-server=...' command and rhev uses 'mount -t glusterfs ...' command.

Is there any other way to know what transpired between fuse-bridge and kernel?

-Krutika

Comment 16 Krutika Dhananjay 2017-10-25 07:41:14 UTC

(In reply to Evgheni Dereveanchin from comment #9)
> Hi Krutika, I got this behavior on CentOS7 with
> glusterfs-3.8.15-2.el7.x86_64 which is included into the latest stable Node
> NG image of oVirt 4.1.6
> 
> As this is 4.1, it wasn't using libgfapi - after enabling it the problem
> went away, so it's visible only when qemu writes to a mounted Gluster volume
> with VirtIO-SCSI used in the VM.

Hi Evgheni,

Could you provide the output of the following command from your host machines:

rpm -qa | grep qemu

-Krutika

Comment 17 Evgheni Dereveanchin 2017-11-01 16:40:23 UTC

Hi Krutika,

here are the requested qemu package versions we have on our 4.1.6 nodes:

qemu-img-ev-2.9.0-16.el7_4.5.1.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64
qemu-kvm-common-ev-2.9.0-16.el7_4.5.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-tools-ev-2.9.0-16.el7_4.5.1.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64

Comment 19 Sahina Bose 2017-11-03 11:56:38 UTC

We tried the same fio job (randrw8020_100iops.job) on the dev setup as well as the setup that performance engineering is using to measure runs - we did not hit the error in either.

Comment 20 Sahina Bose 2017-11-14 10:46:51 UTC

Removing the flags and Test blocker as per QE status sent to rhhi-release-team on Nov 13, 2017

Comment 22 SATHEESARAN 2017-11-14 12:12:00 UTC

(In reply to Sahina Bose from comment #20)
> Removing the flags and Test blocker as per QE status sent to
> rhhi-release-team on Nov 13, 2017

Thanks Sahina for updating this information

Comment 25 RamaKasturi 2017-11-24 12:31:16 UTC

sahina, sure. Please move it to ON_QA