Bug 1460639

Summary: [Stress] : IO errored out with ENOTCONN.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: rpcAssignee: Milind Changire <mchangir>
Status: CLOSED ERRATA QA Contact: Rajesh Madaka <rmadaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, bturner, mchangir, msaini, nbalacha, nchilaka, rgowdapp, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1461092 (view as bug list) Environment:
Last Closed: 2018-09-04 06:32:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1461092, 1503134    

Description Ambarish 2017-06-12 09:49:46 UTC
Description of problem:
-----------------------

Plain distribute volume,mounted on 4 clients via FUSE.

<snip>
[root@gqac016 ~]# mount|grep gqas014
gqas014.sbu.lab.eng.bos.redhat.com:/butcher on /gluster-mount2 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
</snip>


Each client has a different workload generator - Kernel untar/dd/smallfile creates.

While creating the data set,dd failed with an ENOTCONN :

<snip>

[root@gqac016 gluster-mount2]# dd if=/dev/zero of=bigFile1 bs=1G count=100 conv=fdatasync
dd: error writing ‘bigFile1’: Transport endpoint is not connected
dd: closing output file ‘bigFile1’: Transport endpoint is not connected
[root@gqac016 gluster-mount2]# 

</snip>

Nothing was manipulated from the back end.

All brick processes are up and running.

Disks have ample space as well :

[root@gqas014 /]# df -Th
Filesystem                    Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_gqas014-root xfs        50G  2.2G   48G   5% /
devtmpfs                      devtmpfs   24G     0   24G   0% /dev
tmpfs                         tmpfs      24G   36K   24G   1% /dev/shm
tmpfs                         tmpfs      24G  121M   24G   1% /run
tmpfs                         tmpfs      24G     0   24G   0% /sys/fs/cgroup
/dev/sda1                     xfs      1014M  141M  874M  14% /boot
/dev/mapper/rhel_gqas014-home xfs       391G   33M  391G   1% /home
tmpfs                         tmpfs     4.7G     0  4.7G   0% /run/user/0
/dev/mapper/vg1-rhs_lv1       xfs       875G  6.2G  869G   1% /bricks1
/dev/mapper/vg2-rhs_lv2       xfs       875G  6.9G  868G   1% /bricks2
/dev/mapper/vg3-rhs_lv3       xfs       875G  6.9G  868G   1% /bricks3
/dev/mapper/vg4-rhs_lv4       xfs       875G  203G  673G  24% /bricks4
/dev/mapper/vg5-rhs_lv5       xfs       875G  6.2G  869G   1% /bricks5
/dev/mapper/vg6-rhs_lv6       xfs       875G  6.1G  869G   1% /bricks6
/dev/mapper/vg7-rhs_lv7       xfs       875G  7.6G  868G   1% /bricks7
/dev/mapper/vg8-rhs_lv8       xfs       875G  7.6G  867G   1% /bricks8
/dev/mapper/vg9-rhs_lv9       xfs       875G  7.6G  867G   1% /bricks9
/dev/mapper/vg10-rhs_lv10     xfs       875G  107G  768G  13% /bricks10
/dev/mapper/vg11-rhs_lv11     xfs       875G   34M  875G   1% /bricks11
/dev/mapper/vg12-rhs_lv12     xfs       875G   34M  875G   1% /bricks12
[root@gqas014 /]# 

Filesystem                    Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_gqas015-root xfs        50G  2.2G   48G   5% /
devtmpfs                      devtmpfs   24G     0   24G   0% /dev
tmpfs                         tmpfs      24G   36K   24G   1% /dev/shm
tmpfs                         tmpfs      24G  345M   24G   2% /run
tmpfs                         tmpfs      24G     0   24G   0% /sys/fs/cgroup
/dev/sda1                     xfs      1014M  141M  874M  14% /boot
/dev/mapper/rhel_gqas015-home xfs       391G   33M  391G   1% /home
tmpfs                         tmpfs     4.7G     0  4.7G   0% /run/user/0
/dev/mapper/vg1-rhs_lv1       xfs       875G  6.2G  869G   1% /bricks1
/dev/mapper/vg2-rhs_lv2       xfs       875G  6.9G  868G   1% /bricks2
/dev/mapper/vg3-rhs_lv3       xfs       875G  6.9G  868G   1% /bricks3
/dev/mapper/vg4-rhs_lv4       xfs       875G  6.9G  868G   1% /bricks4
/dev/mapper/vg5-rhs_lv5       xfs       875G  6.9G  868G   1% /bricks5
/dev/mapper/vg6-rhs_lv6       xfs       875G  6.9G  868G   1% /bricks6
/dev/mapper/vg7-rhs_lv7       xfs       875G  6.2G  869G   1% /bricks7
/dev/mapper/vg8-rhs_lv8       xfs       875G  7.7G  867G   1% /bricks8
/dev/mapper/vg9-rhs_lv9       xfs       875G  7.6G  868G   1% /bricks9
/dev/mapper/vg10-rhs_lv10     xfs       875G  6.9G  868G   1% /bricks10
/dev/mapper/vg11-rhs_lv11     xfs       875G   34M  875G   1% /bricks11
/dev/mapper/vg12-rhs_lv12     xfs       875G   34M  875G   1% /bricks12
[root@gqas015 /]# 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-27

How reproducible:
------------------

Reporting the first occurrence.


Additional info:
----------------

*Vol Info * :
[root@gqas014 /]# gluster v info
 
Volume Name: butcher
Type: Distribute
Volume ID: f297fb8e-f276-4f96-8a58-a1215112d3b2
Status: Started
Snapshot Count: 0
Number of Bricks: 20
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
[root@gqas014 /]# 




*Vol Status* :

[root@gqas014 /]# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks1/A1                                   49152     0          Y       19950
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks1/A1                                   49152     0          Y       19842
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks2/A1                                   49153     0          Y       19973
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks2/A1                                   49153     0          Y       19865
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks3/A1                                   49154     0          Y       19996
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks3/A1                                   49154     0          Y       19888
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks4/A1                                   49155     0          Y       20019
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks4/A1                                   49155     0          Y       19911
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks5/A1                                   49156     0          Y       20042
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks5/A1                                   49156     0          Y       19934
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49157     0          Y       20065
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49157     0          Y       19957
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks7/A1                                   49158     0          Y       20088
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks7/A1                                   49158     0          Y       19980
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49159     0          Y       20111
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49159     0          Y       20003
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks9/A1                                   49160     0          Y       20134
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks9/A1                                   49160     0          Y       20026
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks10/A1                                  49161     0          Y       20157
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks10/A1                                  49161     0          Y       20049
 
Task Status of Volume butcher
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gqas014 /]#

Comment 2 Ambarish 2017-06-12 09:54:11 UTC
I see this in the mount logs :


<snip>

[2017-06-12 08:04:35.948610] W [rpc-clnt.c:1694:rpc_clnt_submit] 0-butcher-client-6: failed to submit rpc-request (XID: 0xfedb7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (butcher-client-6)
[2017-06-12 08:04:35.950750] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-butcher-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-12 08:04:35.951130] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 6491583: WRITE => -1 gfid=f6e34aa2-4050-4cd4-896e-712660b2ef42 fd=0x7fd7e92e3630 (Transport endpoint is not connected)
[2017-06-12 08:04:36.026522] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 6491584: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2017-06-12 08:04:36.026944] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-butcher-client-6: Connected to butcher-client-6, attached to remote volume '/bricks4/A1'.


</snip>


And these errors (from rpc?) in the brick logs at the same time :

</snip>
[2017-06-12 08:04:35.950650] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950658] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950697] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.950719] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950878] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.950909] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951028] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951054] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951185] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951233] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951354] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951382] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951569] I [addr.c:182:gf_auth] 0-/bricks4/A1: allowed = "*", received addr = "192.168.97.182"
[2017-06-12 08:04:35.951611] I [MSGID: 115029] [server-handshake.c:712:server_setvolume] 0-butcher-server: accepted client from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-1 (version: 3.8.4)
[2017-06-12 08:04:35.989077] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.989320] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014154] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014455] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014605] W [socket.c:595:__socket_rwv] 0-tcp.butcher-server: writev on 192.168.97.182:1016 failed (Broken pipe)
[2017-06-12 08:04:36.015270] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.015701] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-06-12 08:04:36.015772] I [MSGID: 115036] [server.c:561:server_rpc_notify] 0-butcher-server: disconnecting connection from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-0
[2017-06-12 08:04:36.015807] I [socket.c:3662:socket_submit_reply] 0-tcp.butcher-server: not connected (priv->connected = -1)
[2017-06-12 08:04:36.015836] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xfedb0, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server)

</snip>

Comment 4 Nithya Balachandran 2017-06-12 15:46:09 UTC
There is nothing to indicate that this is a dht issue. From the logs, it appears to be an rpc/transport issue. Moving this to the rpc component for further analysis.

Comment 9 Atin Mukherjee 2017-06-19 05:29:05 UTC
Setting back the needinfo on Ambarish based on comment 8

Comment 12 Raghavendra G 2017-09-04 06:16:46 UTC
A patch merged upstream [1] is expected to bring down ping-timeout expiries during high load on bricks. We are planning to get this patch backported to rhgs-3.4.0.

With [1] client disconnect will be fixed, but not sure how the hung xfs will affect I/O. @Milind, can you work with Ambarish to move this bug to relevant component once #17105 is taken in?

[1] https://review.gluster.org/17105

Comment 15 Rajesh Madaka 2018-05-09 14:14:28 UTC
Followed the steps mentioned in above desc.

Created plain distributed volume and mounted(fuse mount) on 4 different clients, then started running I/O's on all 4 clients using below tools.

-> dd I/O tool on one client
-> untar on one client
-> small files on two clients.

i didn't find any dd I/O's failing.

Verified Bug with below version:

glusterfs-3.12.2-8

Comment 17 errata-xmlrpc 2018-09-04 06:32:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607