Bug 1411219

Summary: [Ganesha] : I/O Error post failover/failback
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Soumya Koduri <skoduri>
Status: CLOSED CURRENTRELEASE QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-21 12:44:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2017-01-09 07:02:19 UTC
Description of problem:
-----------------------

4 node cluster,EC 2*(4+2),Mount vers=4.

Was running bonnie++ on 1 client(4 instances of Bonnie,different directories).

All four instances got errored out :

******************
gqac005 Instance 1 
******************

Changing to the specified mountpoint
/gluster-mount/run3671
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...
done
Rewriting...Bonnie: drastic I/O error (re-write read): Device or resource busy
Can't read a full block, only got 8193 bytes.
Can't read a full block, only got 8193 bytes.
Can't read a full block, only got 8193 bytes.
Can't write block.: Stale file handle

real    69m35.539s
user    0m2.267s
sys     1m17.413s
bonnie failed

*******************
gqac005 Instance 2 
*******************
Changing to the specified mountpoint
/gluster-mount/run3670
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...
done
Rewriting...

Bonnie: drastic I/O error (re-write read): Stale file handle
Can't write block.: Bad file descriptor
Can't sync file.

real	80m19.318s
user	1m24.500s
sys	10m38.587s
bonnie failed
0

*******************
gqac005 Instance 3 
*******************

Changing to the specified mountpoint
/gluster-mount/run3648
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...

Can't write block.: Stale file handle
Can't write block 11132488.

real    69m44.168s
user    0m1.992s
sys     1m0.215s
bonnie failed
0
Total 0 tests were successful

*******************
gqac005 Instance 4
*******************

/gluster-mount/run3647
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...

Can't write block.: Stale file handle
Can't write block 11148389.

real	69m43.853s
user	0m2.034s
sys	1m0.031s
bonnie failed
0
Total 0 tests were successful


On the first try I could see an error in creates as well :

<snip>
Changing to the specified mountpoint
/gluster-mount/run6446
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...
done
Rewriting...
done
Reading a byte at a time...done
Reading intelligently...

done
start 'em...
done...done...done...done...done...
Create files in sequential order...Can't create file 000000133e1lebGWeE
Cleaning up test directory after error.

real    75m26.431s
user    0m6.082s
sys     3m0.132s
bonnie failed 

</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.1-3.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.1-3.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-3.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-10.el7rhgs.x86_64


How reproducible:
-----------------

3/6,Fairly.
MTTF : 1.5 hours.

Steps to Reproduce:
-------------------

1. Create EC volume ,mount it via v4.

2. Run Bonnie++

3. Failover/failback multiple times

Actual results:
----------------

IO Errors on application.

Expected results:
-----------------

Zero error status from the workload generator.

Additional info:
----------------

*Server & Client* : RHEL 7.3

sosreports in description.

Comment 6 Soumya Koduri 2017-01-12 10:10:41 UTC
I could reproduce this error after running 4 instances of Bonnie++ as done by Ambarish.


Changing to the specified mountpoint
/mnt/nfs/dir1/run7163
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Bonnie: drastic I/O error (re-write read): Stale file handle

Total 0 tests were successful
Switching over to the previous working directory
Removing /mnt/nfs/dir2/run7164/
rmdir: failed to remove ‘/mnt/nfs/dir2/run7164/’: Stale file handle
rmdir failed:Directory not empty
[root@dhcp9 ~]# 

...
...

The mount point itself has become stale and hence all 4 bonnie instances also failed with STALE_FILE_HANDLE. This error was returned after failback (i.e, when nfs-ganesha is restarted)

[root@dhcp46-111 ~]# showmount -e
Export list for dhcp46-111.lab.eng.blr.redhat.com:
[root@dhcp46-111 ~]#

12/01/2017 15:17:26 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] glusterfs_create_export :FSAL :EVENT :Volume vol_ec exported at : '/'
12/01/2017 15:17:30 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] glusterfs_create_export :FSAL :CRIT :Unable to initialize volume. Export: /vol_ec
12/01/2017 15:17:31 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/vol_ec) to (/vol_ec)
12/01/2017 15:17:32 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!!


[2017-01-12 09:46:17.6320[2017-01-12 09:47:26.703717] E [socket.c:2309:socket_connect_finish] 0-gfapi: connection to ::1:24007 failed (Connection refused)
[2017-01-12 09:47:26.703930] E [MSGID: 104024] [glfs-mgmt.c:735:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with remote-host: localhost (Transport endpoint is not connected) [Transport endpoint is not connected]
[2017-01-12 09:47:30.672554] E [MSGID: 104007] [glfs-mgmt.c:633:glfs_mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:vol_ec) [Invalid argument]
[2017-01-12 09:47:30.689123] E [MSGID: 104024] [glfs-mgmt.c:735:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with remote-host: localhost (No space left on device) [No space left on device]


Volume initialization failed either because of network connection failures or no space on the machine. Either case, it resulted in nfs-ganesha server not exporting the volume and hence the mount point has become stale. 

I shall give a test run once more and see if we hit same issue.
Since the volume hasn't got unexported

Comment 7 Ambarish 2017-01-25 10:44:59 UTC
I did not hit this issue while trying to repro it.

I hit some other n/w network issue,even before the EIO was hit(at the time of reporting).


So ,till now I have not got a clean run with Bonnie++ on Ganesha mounts with continuous failover/failback.

I would say defer this and bring it back in if and when QE gets a reproducer.

Comment 11 Kaleb KEITHLEY 2017-08-21 12:44:05 UTC
reopen if seen again