Bug 1398930

Summary: [Ganesha] : "Remote I/O Error" during IO from heterogeneous mounts.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, bturner, dang, ffilz, jthottan, mbenjamin, nbalacha, pkarampu, rgowdapp, rhinduja, rhs-bugs, rjoseph, rkavunga, rtalur, skoduri, spalai, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-23 12:27:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1385605    
Bug Blocks:    

Description Ambarish 2016-11-27 12:32:45 UTC
Description of problem:
-----------------------

4-Node Cluster,2*2 volume,mounted via v3 on 2 clients and via v4 on 2 clients.

Was running IO (dd ,untar,iozone reads) from different clients(v3 and v4).

From one of the clients dd started erroring out:

dd: failed to open ‘/gluster-mount/ambarnew7666.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7667.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7668.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7669.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7670.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7671.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7672.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7673.txt’: Remote I/O error
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.410611 s, 255 MB/s
dd: failed to open ‘/gluster-mount/ambarnew7675.txt’: Remote I/O error
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.437637 s, 240 MB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.424957 s, 247 MB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.414986 s, 253 MB/s
dd: failed to open ‘/gluster-mount/ambarnew7679.txt’: Remote I/O error

On another client,untar failed as well(it got exited due to previous errors).

tar: Exiting with failure status due to previous errors


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64



How reproducible:
------------------

1/1

Steps to Reproduce:
-------------------

1. Mount a 2*2 volume via v3 and v4 on 4 clients(2 each).

2. Run dds,iozone,tarball untar


Actual results:
---------------

Remote IO Error on one of th clients.

Expected results:
-----------------

No Errors on the application side.

Additional info:
-----------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 2 Ambarish 2016-11-27 12:34:15 UTC
gfapi logs are flooded with frame unwind error messages  :

[2016-11-27 07:23:29.245456] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff154296642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff15405c75e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff15405c86e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7ff15405dfc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7ff15405e8a0] ))))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:23:28.595291 (xid=0x139abc)


And EBADFD :

[2016-11-27 07:23:11.231093] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-3:  (ec3fca88-2c5e-4391-b699-27f27426c687) remote_fd is -1. EBADFD [File descriptor in bad state]
[2016-11-27 07:23:25.321676] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-2:  (aeafb564-9f8f-4e18-9a61-7dd82334e865) remote_fd is -1. EBADFD [File descriptor in bad state]

Comment 8 Mohammed Rafi KC 2016-11-29 14:46:34 UTC
From the Ganesha logs, the i/o errors were generated around 27/11/2016 03:20:02. But gfapi logs which are attached with the bug doesn't have any entries at that point of time. In fact, gfapi logs were only present from 2016-11-27 04:29:51.

And there is no information regarding any errors in bricks logs. So can you please provide the log information for clients.

If you want to reproduce the bugs to get the exact logs, please do let me know.

Comment 9 Mohammed Rafi KC 2016-11-29 14:52:10 UTC
I have seen some disconnection with xdr_decode failing, I have posted more information of this in bug 1397473. Since those disconnections had taken place only after the i/o errors are logged, I'm not suspecting this as root cause of the bug. But this is also a severe bug which we need to debug

Comment 17 Red Hat Bugzilla 2023-09-14 03:35:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days