Description of problem: ------------------------ 6 Node Ganesha/Gluster cluster with a 12*(4+2) Volume mounted via v3/v4 on 6 clients. 6 clients run Filebench , WML shared in comments. Killed Ganesha on one of the nodes and restarted it after a while to simulate failover /failback. I/O errored with an EIO on 2/6 clients (I could hit it thrice out of five times) : **Attempt 1** : <snip> 117.183: Running... 246.827: Failed to open file 9503, /gluster-mount/d4/bigfileset/00000001/00009504, with status 10: Input/output error 246.827: failed to create file createfile1 246.827: proxycache-10: flowop createfile1-1 failed 246.847: Failed to open file 9505, /gluster-mount/d4/bigfileset/00000001/00009506, with status 10: Input/output error 246.847: failed to create file createfile1 246.847: proxycache-16: flowop createfile1-1 failed 246.848: Failed to open file 9506, /gluster-mount/d4/bigfileset/00000001/00009507, with status 10: Input/output error 246.848: failed to create file createfile1 246.848: proxycache-69: flowop createfile1-1 failed 246.849: Failed to open file 9509, /gluster-mount/d4/bigfileset/00000001/00009510, with status 10: Input/output error 246.849: failed to create file createfile1 246.849: proxycache-2: flowop createfile1-1 failed 247.203: Run took 130 seconds... 247.203: NO VALID RESULTS! Filebench run terminated prematurely around line 65 247.203: Shutting down processes [root@gqac010 /]# </snip> *Attempt 2* : <snip> 178.144: Running... 346.952: Failed to open file 9169, /gluster-mount/d6/bigfileset/00000001/00009170, with status 10: Input/output error 346.952: failed to create file createfile1 346.952: proxycache-86: flowop createfile1-1 failed 346.967: Failed to open file 9173, /gluster-mount/d6/bigfileset/00000001/00009174, with status 10: Input/output error 346.967: failed to create file createfile1 346.967: proxycache-48: flowop createfile1-1 failed 347.168: Run took 169 seconds... 347.168: NO VALID RESULTS! Filebench run terminated prematurely around line 65 347.168: Shutting down processes [root@gqac011 /]# </snip> *Attempt 3* : <snip> 426.741: Failed to open file 5696, /gluster-mount/d1/bigfileset/00000001/00005697, with status 10: Input/output error 426.741: failed to create file createfile1 426.741: proxycache-24: flowop createfile1-1 failed 426.747: Failed to open file 5698, /gluster-mount/d1/bigfileset/00000001/00005699, with status 10: Input/output error 426.748: failed to create file createfile1 426.748: proxycache-30: flowop createfile1-1 failed 427.600: Run took 288 seconds... 427.600: NO VALID RESULTS! Filebench run terminated prematurely around line 65 427.600: Shutting down processes [root@gqac008 /]# [root@gqac008 /]# AND.. 134.257: Waiting for pre-allocation to finish (in case of a parallel pre-allocation) 134.257: Population and pre-allocation of filesets completed 134.257: Starting 1 proxycache instances 135.264: Running... 426.737: Failed to open file 8354, /gluster-mount/d6/bigfileset/00000001/00008355, with status 10: Input/output error 426.737: failed to create file createfile1 426.737: proxycache-98: flowop createfile1-1 failed 426.738: Failed to open file 8356, /gluster-mount/d6/bigfileset/00000001/00008357, with status 10: Input/output error 426.738: failed to create file createfile1 426.738: proxycache-74: flowop createfile1-1 failed 427.309: Run took 292 seconds... 427.309: NO VALID RESULTS! Filebench run terminated prematurely around line 65 427.309: Shutting down processes </snip> Logs,tcpdumps will be shared in comments. Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-gluster-2.4.4-17.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-50.el7rhgs.x86_64 How reproducible: ----------------- Often,3/5. Actual results: --------------- EIO on the application side. Expected results: ----------------- Seamless failovers,failbacks.0 error status on client workloads. Additional info: ----------------
FileBench WML : set $dir=/gluster-mount/d1 set $nfiles=100000 set $meandirwidth=200 set $filesize=cvar(type=cvar-gamma,parameters=mean:131072;gamma:1.5) set $nthreads=50 set $iosize=1m set $meanappendsize=16k define fileset name=bigfileset,path=$dir,size=$filesize,entries=$nfiles,dirwidth=$meandirwidth,prealloc=80 define process name=filereader,instances=1 { thread name=filereaderthread,memsize=10m,instances=$nthreads { flowop createfile name=createfile1,filesetname=bigfileset,fd=1 flowop writewholefile name=wrtfile1,srcfd=1,fd=1,iosize=$iosize flowop closefile name=closefile1,fd=1 flowop openfile name=openfile1,filesetname=bigfileset,fd=1 flowop appendfilerand name=appendfilerand1,iosize=$meanappendsize,fd=1 flowop closefile name=closefile2,fd=1 flowop openfile name=openfile2,filesetname=bigfileset,fd=1 flowop readwholefile name=readfile1,fd=1,iosize=$iosize flowop closefile name=closefile3,fd=1 flowop deletefile name=deletefile1,filesetname=bigfileset flowop statfile name=statfile1,filesetname=bigfileset } } echo "File-server Version 3.0 personality successfully loaded" run 72000 ~
I dont think this is a regression(Can try if required).I have hit bugs like this before (with Bonnie etc) where I've failed to provide a reproducer to Dev. With Filebench,we have an easy reproducer.
From brick logs: [2017-10-30 09:37:14.084037] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-butcher-server: releasing lock on 00000000-0000-0000-0000-000000000001 held by {client=0x7f41b0013400, pid=-3 lk-owner=60da0318fd7e0000} [2017-10-30 09:37:14.084071] I [MSGID: 115013] [server-helpers.c:289:do_fd_cleanup] 0-butcher-server: fd cleanup on / [2017-10-30 09:37:14.084221] I [MSGID: 101055] [client_t.c:440:gf_client_unref] 0-butcher-server: Shutting down connection gqas007.sbu.lab.eng.bos.redhat.com-28664-2017/10/30-09:36:28:32553-butcher-client-23-0-0 [2017-10-30 09:49:00.864932] I [MSGID: 113109] [posix.c:1613:posix_mkdir] 0-butcher-posix: mkdir (00000000-0000-0000-0000-000000000001/bigfileset): failing preop of mkdir (/bricks4/brick/bigfileset) as on-disk xattr value differs from argument value for key trusted.glusterfs.dht [Input/output error] [2017-10-30 09:49:00.865015] E [MSGID: 115056] [server-rpc-fops.c:516:server_mkdir_cbk] 0-butcher-server: 24: MKDIR /bigfileset (00000000-0000-0000-0000-000000000001/bigfileset) client: gqas013.sbu.lab.eng.bos.redhat.com-31011-2017/10/30-09:35:03:531335-butcher-client-23-0-0 [Input/output error] [2017-10-30 09:49:35.552548] I [MSGID: 115056] [server-rpc-fops.c:472:server_rmdir_cbk] 0-butcher-server: 192: RMDIR /bigfileset/00000005/00000003/00000012/00000005/00000009/00000017/00000085 (fe369743-f7f0-4d7b-8892-b3a2e21f869a/00000085) ==> (No such file or directory) [No such file or directory]
(In reply to Ambarish from comment #3) > I dont think this is a regression(Can try if required).I have hit bugs like > this before (with Bonnie etc) where I've failed to provide a reproducer to > Dev. > > With Filebench,we have an easy reproducer. As nfs-ganesha sources haven't changed from RHGS 3.3.0 to RHGS 3.3.1, this is definitely not a regression with ganesha build. Most likely the error seem to have generated from gluster stack. Will look into sos reports and tcpdumps once attached.