Bug 1507776 - [Ganesha] : Filebench errors out with an "Input/Output Error" almost immediately when IO resumes post failback.
Summary: [Ganesha] : Filebench errors out with an "Input/Output Error" almost immediat...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Soumya Koduri
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-31 06:29 UTC by Ambarish
Modified: 2020-07-26 09:11 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-06 11:56:12 UTC
Embargoed:


Attachments (Terms of Use)

Description Ambarish 2017-10-31 06:29:29 UTC
Description of problem:
------------------------

6 Node Ganesha/Gluster cluster with a 12*(4+2) Volume mounted via v3/v4 on 6 clients.

6 clients run Filebench , WML shared in comments.

Killed Ganesha on one of the nodes and restarted it after a while to simulate failover /failback.

I/O errored with an EIO on 2/6 clients (I could hit it thrice out of five times) :


**Attempt 1** :

<snip>

117.183: Running...
246.827: Failed to open file 9503, /gluster-mount/d4/bigfileset/00000001/00009504, with status 10: Input/output error
246.827: failed to create file createfile1
246.827: proxycache-10: flowop createfile1-1 failed
246.847: Failed to open file 9505, /gluster-mount/d4/bigfileset/00000001/00009506, with status 10: Input/output error
246.847: failed to create file createfile1
246.847: proxycache-16: flowop createfile1-1 failed
246.848: Failed to open file 9506, /gluster-mount/d4/bigfileset/00000001/00009507, with status 10: Input/output error
246.848: failed to create file createfile1
246.848: proxycache-69: flowop createfile1-1 failed
246.849: Failed to open file 9509, /gluster-mount/d4/bigfileset/00000001/00009510, with status 10: Input/output error
246.849: failed to create file createfile1
246.849: proxycache-2: flowop createfile1-1 failed
247.203: Run took 130 seconds...
247.203: NO VALID RESULTS! Filebench run terminated prematurely around line 65
247.203: Shutting down processes
[root@gqac010 /]# 

</snip>


*Attempt 2* :

<snip>

178.144: Running...
346.952: Failed to open file 9169, /gluster-mount/d6/bigfileset/00000001/00009170, with status 10: Input/output error
346.952: failed to create file createfile1
346.952: proxycache-86: flowop createfile1-1 failed
346.967: Failed to open file 9173, /gluster-mount/d6/bigfileset/00000001/00009174, with status 10: Input/output error
346.967: failed to create file createfile1
346.967: proxycache-48: flowop createfile1-1 failed
347.168: Run took 169 seconds...
347.168: NO VALID RESULTS! Filebench run terminated prematurely around line 65
347.168: Shutting down processes
[root@gqac011 /]# 

</snip>

*Attempt 3* :

<snip>

426.741: Failed to open file 5696, /gluster-mount/d1/bigfileset/00000001/00005697, with status 10: Input/output error
426.741: failed to create file createfile1
426.741: proxycache-24: flowop createfile1-1 failed
426.747: Failed to open file 5698, /gluster-mount/d1/bigfileset/00000001/00005699, with status 10: Input/output error
426.748: failed to create file createfile1
426.748: proxycache-30: flowop createfile1-1 failed
427.600: Run took 288 seconds...
427.600: NO VALID RESULTS! Filebench run terminated prematurely around line 65
427.600: Shutting down processes
[root@gqac008 /]# 
[root@gqac008 /]# 

AND..

134.257: Waiting for pre-allocation to finish (in case of a parallel pre-allocation)
134.257: Population and pre-allocation of filesets completed
134.257: Starting 1 proxycache instances
135.264: Running...
426.737: Failed to open file 8354, /gluster-mount/d6/bigfileset/00000001/00008355, with status 10: Input/output error
426.737: failed to create file createfile1
426.737: proxycache-98: flowop createfile1-1 failed
426.738: Failed to open file 8356, /gluster-mount/d6/bigfileset/00000001/00008357, with status 10: Input/output error
426.738: failed to create file createfile1
426.738: proxycache-74: flowop createfile1-1 failed
427.309: Run took 292 seconds...
427.309: NO VALID RESULTS! Filebench run terminated prematurely around line 65
427.309: Shutting down processes  

</snip>


Logs,tcpdumps will be shared in comments.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-gluster-2.4.4-17.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-50.el7rhgs.x86_64

How reproducible:
-----------------

Often,3/5.

Actual results:
---------------

EIO on the application side.

Expected results:
-----------------

Seamless failovers,failbacks.0 error status on client workloads.

Additional info:
----------------

Comment 2 Ambarish 2017-10-31 06:31:47 UTC
FileBench WML :

set $dir=/gluster-mount/d1
set $nfiles=100000
set $meandirwidth=200
set $filesize=cvar(type=cvar-gamma,parameters=mean:131072;gamma:1.5)
set $nthreads=50
set $iosize=1m
set $meanappendsize=16k

define fileset name=bigfileset,path=$dir,size=$filesize,entries=$nfiles,dirwidth=$meandirwidth,prealloc=80

define process name=filereader,instances=1
{
  thread name=filereaderthread,memsize=10m,instances=$nthreads
  {
    flowop createfile name=createfile1,filesetname=bigfileset,fd=1
    flowop writewholefile name=wrtfile1,srcfd=1,fd=1,iosize=$iosize
    flowop closefile name=closefile1,fd=1
    flowop openfile name=openfile1,filesetname=bigfileset,fd=1
    flowop appendfilerand name=appendfilerand1,iosize=$meanappendsize,fd=1
    flowop closefile name=closefile2,fd=1
    flowop openfile name=openfile2,filesetname=bigfileset,fd=1
    flowop readwholefile name=readfile1,fd=1,iosize=$iosize
    flowop closefile name=closefile3,fd=1
    flowop deletefile name=deletefile1,filesetname=bigfileset
    flowop statfile name=statfile1,filesetname=bigfileset
  }
}

echo  "File-server Version 3.0 personality successfully loaded"


run 72000
~

Comment 3 Ambarish 2017-10-31 06:33:12 UTC
I dont think this is a regression(Can try if required).I have hit bugs like this  before (with Bonnie etc) where I've failed to provide a reproducer to Dev.

With Filebench,we have an easy reproducer.

Comment 4 Ambarish 2017-10-31 06:39:44 UTC
From brick logs:

[2017-10-30 09:37:14.084037] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-butcher-server: releasing lock on 00000000-0000-0000-0000-000000000001 held by {client=0x7f41b0013400, pid=-3 lk-owner=60da0318fd7e0000}
[2017-10-30 09:37:14.084071] I [MSGID: 115013] [server-helpers.c:289:do_fd_cleanup] 0-butcher-server: fd cleanup on /
[2017-10-30 09:37:14.084221] I [MSGID: 101055] [client_t.c:440:gf_client_unref] 0-butcher-server: Shutting down connection gqas007.sbu.lab.eng.bos.redhat.com-28664-2017/10/30-09:36:28:32553-butcher-client-23-0-0
[2017-10-30 09:49:00.864932] I [MSGID: 113109] [posix.c:1613:posix_mkdir] 0-butcher-posix: mkdir (00000000-0000-0000-0000-000000000001/bigfileset): failing preop of mkdir (/bricks4/brick/bigfileset) as on-disk xattr value differs from argument value for key trusted.glusterfs.dht [Input/output error]
[2017-10-30 09:49:00.865015] E [MSGID: 115056] [server-rpc-fops.c:516:server_mkdir_cbk] 0-butcher-server: 24: MKDIR /bigfileset (00000000-0000-0000-0000-000000000001/bigfileset) client: gqas013.sbu.lab.eng.bos.redhat.com-31011-2017/10/30-09:35:03:531335-butcher-client-23-0-0 [Input/output error]
[2017-10-30 09:49:35.552548] I [MSGID: 115056] [server-rpc-fops.c:472:server_rmdir_cbk] 0-butcher-server: 192: RMDIR /bigfileset/00000005/00000003/00000012/00000005/00000009/00000017/00000085 (fe369743-f7f0-4d7b-8892-b3a2e21f869a/00000085) ==> (No such file or directory) [No such file or directory]

Comment 5 Soumya Koduri 2017-10-31 06:56:39 UTC
(In reply to Ambarish from comment #3)
> I dont think this is a regression(Can try if required).I have hit bugs like
> this  before (with Bonnie etc) where I've failed to provide a reproducer to
> Dev.
> 
> With Filebench,we have an easy reproducer.

As nfs-ganesha sources haven't changed from RHGS 3.3.0 to RHGS 3.3.1, this is definitely not a regression with ganesha build. Most likely the error seem to have generated from gluster stack. Will look into sos reports and tcpdumps once attached.


Note You need to log in before you can comment on or make changes to this bug.