Bug 1398930 - [Ganesha] : "Remote I/O Error" during IO from heterogeneous mounts. [NEEDINFO]
Summary: [Ganesha] : "Remote I/O Error" during IO from heterogeneous mounts.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Ambarish
URL:
Whiteboard:
Depends On: 1385605
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-27 12:32 UTC by Ambarish
Modified: 2017-08-23 12:27 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-23 12:27:39 UTC
skoduri: needinfo? (asoman)


Attachments (Terms of Use)

Description Ambarish 2016-11-27 12:32:45 UTC
Description of problem:
-----------------------

4-Node Cluster,2*2 volume,mounted via v3 on 2 clients and via v4 on 2 clients.

Was running IO (dd ,untar,iozone reads) from different clients(v3 and v4).

From one of the clients dd started erroring out:

dd: failed to open ‘/gluster-mount/ambarnew7666.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7667.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7668.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7669.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7670.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7671.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7672.txt’: Remote I/O error
dd: failed to open ‘/gluster-mount/ambarnew7673.txt’: Remote I/O error
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.410611 s, 255 MB/s
dd: failed to open ‘/gluster-mount/ambarnew7675.txt’: Remote I/O error
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.437637 s, 240 MB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.424957 s, 247 MB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.414986 s, 253 MB/s
dd: failed to open ‘/gluster-mount/ambarnew7679.txt’: Remote I/O error

On another client,untar failed as well(it got exited due to previous errors).

tar: Exiting with failure status due to previous errors


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64



How reproducible:
------------------

1/1

Steps to Reproduce:
-------------------

1. Mount a 2*2 volume via v3 and v4 on 4 clients(2 each).

2. Run dds,iozone,tarball untar


Actual results:
---------------

Remote IO Error on one of th clients.

Expected results:
-----------------

No Errors on the application side.

Additional info:
-----------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: aeab0f8a-1e34-4681-bdf4-5b1416e46f27
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 2 Ambarish 2016-11-27 12:34:15 UTC
gfapi logs are flooded with frame unwind error messages  :

[2016-11-27 07:23:29.245456] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff154296642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff15405c75e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff15405c86e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7ff15405dfc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7ff15405e8a0] ))))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:23:28.595291 (xid=0x139abc)


And EBADFD :

[2016-11-27 07:23:11.231093] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-3:  (ec3fca88-2c5e-4391-b699-27f27426c687) remote_fd is -1. EBADFD [File descriptor in bad state]
[2016-11-27 07:23:25.321676] W [MSGID: 114061] [client-common.c:705:client_pre_fstat] 0-testvol-client-2:  (aeafb564-9f8f-4e18-9a61-7dd82334e865) remote_fd is -1. EBADFD [File descriptor in bad state]

Comment 8 Mohammed Rafi KC 2016-11-29 14:46:34 UTC
From the Ganesha logs, the i/o errors were generated around 27/11/2016 03:20:02. But gfapi logs which are attached with the bug doesn't have any entries at that point of time. In fact, gfapi logs were only present from 2016-11-27 04:29:51.

And there is no information regarding any errors in bricks logs. So can you please provide the log information for clients.

If you want to reproduce the bugs to get the exact logs, please do let me know.

Comment 9 Mohammed Rafi KC 2016-11-29 14:52:10 UTC
I have seen some disconnection with xdr_decode failing, I have posted more information of this in bug 1397473. Since those disconnections had taken place only after the i/o errors are logged, I'm not suspecting this as root cause of the bug. But this is also a severe bug which we need to debug


Note You need to log in before you can comment on or make changes to this bug.