Bug 1460639 - [Stress] : IO errored out with ENOTCONN.
[Stress] : IO errored out with ENOTCONN.
Status: POST
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Milind Changire
Rahul Hinduja
rebase
: ZStream
Depends On:
Blocks: 1461092
  Show dependency treegraph
 
Reported: 2017-06-12 05:49 EDT by Ambarish
Modified: 2017-10-17 10:25 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1461092 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-06-12 05:49:46 EDT
Description of problem:
-----------------------

Plain distribute volume,mounted on 4 clients via FUSE.

<snip>
[root@gqac016 ~]# mount|grep gqas014
gqas014.sbu.lab.eng.bos.redhat.com:/butcher on /gluster-mount2 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
</snip>


Each client has a different workload generator - Kernel untar/dd/smallfile creates.

While creating the data set,dd failed with an ENOTCONN :

<snip>

[root@gqac016 gluster-mount2]# dd if=/dev/zero of=bigFile1 bs=1G count=100 conv=fdatasync
dd: error writing ‘bigFile1’: Transport endpoint is not connected
dd: closing output file ‘bigFile1’: Transport endpoint is not connected
[root@gqac016 gluster-mount2]# 

</snip>

Nothing was manipulated from the back end.

All brick processes are up and running.

Disks have ample space as well :

[root@gqas014 /]# df -Th
Filesystem                    Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_gqas014-root xfs        50G  2.2G   48G   5% /
devtmpfs                      devtmpfs   24G     0   24G   0% /dev
tmpfs                         tmpfs      24G   36K   24G   1% /dev/shm
tmpfs                         tmpfs      24G  121M   24G   1% /run
tmpfs                         tmpfs      24G     0   24G   0% /sys/fs/cgroup
/dev/sda1                     xfs      1014M  141M  874M  14% /boot
/dev/mapper/rhel_gqas014-home xfs       391G   33M  391G   1% /home
tmpfs                         tmpfs     4.7G     0  4.7G   0% /run/user/0
/dev/mapper/vg1-rhs_lv1       xfs       875G  6.2G  869G   1% /bricks1
/dev/mapper/vg2-rhs_lv2       xfs       875G  6.9G  868G   1% /bricks2
/dev/mapper/vg3-rhs_lv3       xfs       875G  6.9G  868G   1% /bricks3
/dev/mapper/vg4-rhs_lv4       xfs       875G  203G  673G  24% /bricks4
/dev/mapper/vg5-rhs_lv5       xfs       875G  6.2G  869G   1% /bricks5
/dev/mapper/vg6-rhs_lv6       xfs       875G  6.1G  869G   1% /bricks6
/dev/mapper/vg7-rhs_lv7       xfs       875G  7.6G  868G   1% /bricks7
/dev/mapper/vg8-rhs_lv8       xfs       875G  7.6G  867G   1% /bricks8
/dev/mapper/vg9-rhs_lv9       xfs       875G  7.6G  867G   1% /bricks9
/dev/mapper/vg10-rhs_lv10     xfs       875G  107G  768G  13% /bricks10
/dev/mapper/vg11-rhs_lv11     xfs       875G   34M  875G   1% /bricks11
/dev/mapper/vg12-rhs_lv12     xfs       875G   34M  875G   1% /bricks12
[root@gqas014 /]# 

Filesystem                    Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_gqas015-root xfs        50G  2.2G   48G   5% /
devtmpfs                      devtmpfs   24G     0   24G   0% /dev
tmpfs                         tmpfs      24G   36K   24G   1% /dev/shm
tmpfs                         tmpfs      24G  345M   24G   2% /run
tmpfs                         tmpfs      24G     0   24G   0% /sys/fs/cgroup
/dev/sda1                     xfs      1014M  141M  874M  14% /boot
/dev/mapper/rhel_gqas015-home xfs       391G   33M  391G   1% /home
tmpfs                         tmpfs     4.7G     0  4.7G   0% /run/user/0
/dev/mapper/vg1-rhs_lv1       xfs       875G  6.2G  869G   1% /bricks1
/dev/mapper/vg2-rhs_lv2       xfs       875G  6.9G  868G   1% /bricks2
/dev/mapper/vg3-rhs_lv3       xfs       875G  6.9G  868G   1% /bricks3
/dev/mapper/vg4-rhs_lv4       xfs       875G  6.9G  868G   1% /bricks4
/dev/mapper/vg5-rhs_lv5       xfs       875G  6.9G  868G   1% /bricks5
/dev/mapper/vg6-rhs_lv6       xfs       875G  6.9G  868G   1% /bricks6
/dev/mapper/vg7-rhs_lv7       xfs       875G  6.2G  869G   1% /bricks7
/dev/mapper/vg8-rhs_lv8       xfs       875G  7.7G  867G   1% /bricks8
/dev/mapper/vg9-rhs_lv9       xfs       875G  7.6G  868G   1% /bricks9
/dev/mapper/vg10-rhs_lv10     xfs       875G  6.9G  868G   1% /bricks10
/dev/mapper/vg11-rhs_lv11     xfs       875G   34M  875G   1% /bricks11
/dev/mapper/vg12-rhs_lv12     xfs       875G   34M  875G   1% /bricks12
[root@gqas015 /]# 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-27

How reproducible:
------------------

Reporting the first occurrence.


Additional info:
----------------

*Vol Info * :
[root@gqas014 /]# gluster v info
 
Volume Name: butcher
Type: Distribute
Volume ID: f297fb8e-f276-4f96-8a58-a1215112d3b2
Status: Started
Snapshot Count: 0
Number of Bricks: 20
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
[root@gqas014 /]# 




*Vol Status* :

[root@gqas014 /]# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks1/A1                                   49152     0          Y       19950
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks1/A1                                   49152     0          Y       19842
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks2/A1                                   49153     0          Y       19973
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks2/A1                                   49153     0          Y       19865
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks3/A1                                   49154     0          Y       19996
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks3/A1                                   49154     0          Y       19888
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks4/A1                                   49155     0          Y       20019
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks4/A1                                   49155     0          Y       19911
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks5/A1                                   49156     0          Y       20042
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks5/A1                                   49156     0          Y       19934
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49157     0          Y       20065
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks6/A1                                   49157     0          Y       19957
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks7/A1                                   49158     0          Y       20088
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks7/A1                                   49158     0          Y       19980
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49159     0          Y       20111
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks8/A1                                   49159     0          Y       20003
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks9/A1                                   49160     0          Y       20134
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks9/A1                                   49160     0          Y       20026
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks10/A1                                  49161     0          Y       20157
Brick gqas015.sbu.lab.eng.bos.redhat.com:/b
ricks10/A1                                  49161     0          Y       20049
 
Task Status of Volume butcher
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gqas014 /]#
Comment 2 Ambarish 2017-06-12 05:54:11 EDT
I see this in the mount logs :


<snip>

[2017-06-12 08:04:35.948610] W [rpc-clnt.c:1694:rpc_clnt_submit] 0-butcher-client-6: failed to submit rpc-request (XID: 0xfedb7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (butcher-client-6)
[2017-06-12 08:04:35.950750] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-butcher-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-12 08:04:35.951130] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 6491583: WRITE => -1 gfid=f6e34aa2-4050-4cd4-896e-712660b2ef42 fd=0x7fd7e92e3630 (Transport endpoint is not connected)
[2017-06-12 08:04:36.026522] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 6491584: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2017-06-12 08:04:36.026944] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-butcher-client-6: Connected to butcher-client-6, attached to remote volume '/bricks4/A1'.


</snip>


And these errors (from rpc?) in the brick logs at the same time :

</snip>
[2017-06-12 08:04:35.950650] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950658] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950697] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.950719] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.950878] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.950909] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951028] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951054] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951185] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951233] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951354] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument]
[2017-06-12 08:04:35.951382] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.951569] I [addr.c:182:gf_auth] 0-/bricks4/A1: allowed = "*", received addr = "192.168.97.182"
[2017-06-12 08:04:35.951611] I [MSGID: 115029] [server-handshake.c:712:server_setvolume] 0-butcher-server: accepted client from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-1 (version: 3.8.4)
[2017-06-12 08:04:35.989077] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:35.989320] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014154] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014455] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.014605] W [socket.c:595:__socket_rwv] 0-tcp.butcher-server: writev on 192.168.97.182:1016 failed (Broken pipe)
[2017-06-12 08:04:36.015270] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-06-12 08:04:36.015701] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2017-06-12 08:04:36.015772] I [MSGID: 115036] [server.c:561:server_rpc_notify] 0-butcher-server: disconnecting connection from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-0
[2017-06-12 08:04:36.015807] I [socket.c:3662:socket_submit_reply] 0-tcp.butcher-server: not connected (priv->connected = -1)
[2017-06-12 08:04:36.015836] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xfedb0, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server)

</snip>
Comment 4 Nithya Balachandran 2017-06-12 11:46:09 EDT
There is nothing to indicate that this is a dht issue. From the logs, it appears to be an rpc/transport issue. Moving this to the rpc component for further analysis.
Comment 9 Atin Mukherjee 2017-06-19 01:29:05 EDT
Setting back the needinfo on Ambarish based on comment 8
Comment 12 Raghavendra G 2017-09-04 02:16:46 EDT
A patch merged upstream [1] is expected to bring down ping-timeout expiries during high load on bricks. We are planning to get this patch backported to rhgs-3.4.0.

With [1] client disconnect will be fixed, but not sure how the hung xfs will affect I/O. @Milind, can you work with Ambarish to move this bug to relevant component once #17105 is taken in?

[1] https://review.gluster.org/17105

Note You need to log in before you can comment on or make changes to this bug.