Description of problem: ----------------------- 4-Node Cluster,2*2 volume,mounted via v3 on 2 clients and via v4 on 2 clients. Was running IO (dd ,untar,iozone reads) from different clients(v3 and v4). From one of the clients dd started erroring out: dd: failed to open ‘/gluster-mount/ambarnew7666.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7667.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7668.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7669.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7670.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7671.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7672.txt’: Remote I/O error dd: failed to open ‘/gluster-mount/ambarnew7673.txt’: Remote I/O error 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.410611 s, 255 MB/s dd: failed to open ‘/gluster-mount/ambarnew7675.txt’: Remote I/O error 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.437637 s, 240 MB/s 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.424957 s, 247 MB/s 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.414986 s, 253 MB/s dd: failed to open ‘/gluster-mount/ambarnew7679.txt’: Remote I/O error On another client,untar failed as well(it got exited due to previous errors). tar: Exiting with failure status due to previous errors Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 How reproducible: ------------------ 1/1 Steps to Reproduce: ------------------- 1. Mount a 2*2 volume via v3 and v4 on 4 clients(2 each). 2. Run dds,iozone,tarball untar Actual results: --------------- Remote IO Error on one of th clients. Expected results: ----------------- No Errors on the application side. Additional info: ----------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
gfapi logs are flooded with frame unwind error messages : [2016-11-27 07:23:29.245456] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff154296642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff15405c75e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff15405c86e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7ff15405dfc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7ff15405e8a0] ))))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:23:28.595291 (xid=0x139abc) And EBADFD : [2016-11-27 07:23:11.231093] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-3: (ec3fca88-2c5e-4391-b699-27f27426c687) remote_fd is -1. EBADFD [File descriptor in bad state] [2016-11-27 07:23:25.321676] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-2: (aeafb564-9f8f-4e18-9a61-7dd82334e865) remote_fd is -1. EBADFD [File descriptor in bad state]
From the Ganesha logs, the i/o errors were generated around 27/11/2016 03:20:02. But gfapi logs which are attached with the bug doesn't have any entries at that point of time. In fact, gfapi logs were only present from 2016-11-27 04:29:51. And there is no information regarding any errors in bricks logs. So can you please provide the log information for clients. If you want to reproduce the bugs to get the exact logs, please do let me know.
I have seen some disconnection with xdr_decode failing, I have posted more information of this in bug 1397473. Since those disconnections had taken place only after the i/o errors are logged, I'm not suspecting this as root cause of the bug. But this is also a severe bug which we need to debug
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days