Description of problem: ----------------------- 4 node cluster,EC 2*(4+2),Mount vers=4. Was running bonnie++ on 1 client(4 instances of Bonnie,different directories). All four instances got errored out : ****************** gqac005 Instance 1 ****************** Changing to the specified mountpoint /gluster-mount/run3671 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently... done Rewriting...Bonnie: drastic I/O error (re-write read): Device or resource busy Can't read a full block, only got 8193 bytes. Can't read a full block, only got 8193 bytes. Can't read a full block, only got 8193 bytes. Can't write block.: Stale file handle real 69m35.539s user 0m2.267s sys 1m17.413s bonnie failed ******************* gqac005 Instance 2 ******************* Changing to the specified mountpoint /gluster-mount/run3670 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently... done Rewriting... Bonnie: drastic I/O error (re-write read): Stale file handle Can't write block.: Bad file descriptor Can't sync file. real 80m19.318s user 1m24.500s sys 10m38.587s bonnie failed 0 ******************* gqac005 Instance 3 ******************* Changing to the specified mountpoint /gluster-mount/run3648 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently... Can't write block.: Stale file handle Can't write block 11132488. real 69m44.168s user 0m1.992s sys 1m0.215s bonnie failed 0 Total 0 tests were successful ******************* gqac005 Instance 4 ******************* /gluster-mount/run3647 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently... Can't write block.: Stale file handle Can't write block 11148389. real 69m43.853s user 0m2.034s sys 1m0.031s bonnie failed 0 Total 0 tests were successful On the first try I could see an error in creates as well : <snip> Changing to the specified mountpoint /gluster-mount/run6446 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently... done Rewriting... done Reading a byte at a time...done Reading intelligently... done start 'em... done...done...done...done...done... Create files in sequential order...Can't create file 000000133e1lebGWeE Cleaning up test directory after error. real 75m26.431s user 0m6.082s sys 3m0.132s bonnie failed </snip> Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.1-3.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.1-3.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.1-3.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-10.el7rhgs.x86_64 How reproducible: ----------------- 3/6,Fairly. MTTF : 1.5 hours. Steps to Reproduce: ------------------- 1. Create EC volume ,mount it via v4. 2. Run Bonnie++ 3. Failover/failback multiple times Actual results: ---------------- IO Errors on application. Expected results: ----------------- Zero error status from the workload generator. Additional info: ---------------- *Server & Client* : RHEL 7.3 sosreports in description.
I could reproduce this error after running 4 instances of Bonnie++ as done by Ambarish. Changing to the specified mountpoint /mnt/nfs/dir1/run7163 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...Bonnie: drastic I/O error (re-write read): Stale file handle Total 0 tests were successful Switching over to the previous working directory Removing /mnt/nfs/dir2/run7164/ rmdir: failed to remove ‘/mnt/nfs/dir2/run7164/’: Stale file handle rmdir failed:Directory not empty [root@dhcp9 ~]# ... ... The mount point itself has become stale and hence all 4 bonnie instances also failed with STALE_FILE_HANDLE. This error was returned after failback (i.e, when nfs-ganesha is restarted) [root@dhcp46-111 ~]# showmount -e Export list for dhcp46-111.lab.eng.blr.redhat.com: [root@dhcp46-111 ~]# 12/01/2017 15:17:26 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] glusterfs_create_export :FSAL :EVENT :Volume vol_ec exported at : '/' 12/01/2017 15:17:30 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] glusterfs_create_export :FSAL :CRIT :Unable to initialize volume. Export: /vol_ec 12/01/2017 15:17:31 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/vol_ec) to (/vol_ec) 12/01/2017 15:17:32 : epoch c2810000 : dhcp46-111.lab.eng.blr.redhat.com : ganesha.nfsd-11504[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!! [2017-01-12 09:46:17.6320[2017-01-12 09:47:26.703717] E [socket.c:2309:socket_connect_finish] 0-gfapi: connection to ::1:24007 failed (Connection refused) [2017-01-12 09:47:26.703930] E [MSGID: 104024] [glfs-mgmt.c:735:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with remote-host: localhost (Transport endpoint is not connected) [Transport endpoint is not connected] [2017-01-12 09:47:30.672554] E [MSGID: 104007] [glfs-mgmt.c:633:glfs_mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:vol_ec) [Invalid argument] [2017-01-12 09:47:30.689123] E [MSGID: 104024] [glfs-mgmt.c:735:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with remote-host: localhost (No space left on device) [No space left on device] Volume initialization failed either because of network connection failures or no space on the machine. Either case, it resulted in nfs-ganesha server not exporting the volume and hence the mount point has become stale. I shall give a test run once more and see if we hit same issue. Since the volume hasn't got unexported
I did not hit this issue while trying to repro it. I hit some other n/w network issue,even before the EIO was hit(at the time of reporting). So ,till now I have not got a clean run with Bonnie++ on Ganesha mounts with continuous failover/failback. I would say defer this and bring it back in if and when QE gets a reproducer.
reopen if seen again