Description of problem: ----------------------- Plain distribute volume,mounted on 4 clients via FUSE. <snip> [root@gqac016 ~]# mount|grep gqas014 gqas014.sbu.lab.eng.bos.redhat.com:/butcher on /gluster-mount2 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) </snip> Each client has a different workload generator - Kernel untar/dd/smallfile creates. While creating the data set,dd failed with an ENOTCONN : <snip> [root@gqac016 gluster-mount2]# dd if=/dev/zero of=bigFile1 bs=1G count=100 conv=fdatasync dd: error writing ‘bigFile1’: Transport endpoint is not connected dd: closing output file ‘bigFile1’: Transport endpoint is not connected [root@gqac016 gluster-mount2]# </snip> Nothing was manipulated from the back end. All brick processes are up and running. Disks have ample space as well : [root@gqas014 /]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_gqas014-root xfs 50G 2.2G 48G 5% / devtmpfs devtmpfs 24G 0 24G 0% /dev tmpfs tmpfs 24G 36K 24G 1% /dev/shm tmpfs tmpfs 24G 121M 24G 1% /run tmpfs tmpfs 24G 0 24G 0% /sys/fs/cgroup /dev/sda1 xfs 1014M 141M 874M 14% /boot /dev/mapper/rhel_gqas014-home xfs 391G 33M 391G 1% /home tmpfs tmpfs 4.7G 0 4.7G 0% /run/user/0 /dev/mapper/vg1-rhs_lv1 xfs 875G 6.2G 869G 1% /bricks1 /dev/mapper/vg2-rhs_lv2 xfs 875G 6.9G 868G 1% /bricks2 /dev/mapper/vg3-rhs_lv3 xfs 875G 6.9G 868G 1% /bricks3 /dev/mapper/vg4-rhs_lv4 xfs 875G 203G 673G 24% /bricks4 /dev/mapper/vg5-rhs_lv5 xfs 875G 6.2G 869G 1% /bricks5 /dev/mapper/vg6-rhs_lv6 xfs 875G 6.1G 869G 1% /bricks6 /dev/mapper/vg7-rhs_lv7 xfs 875G 7.6G 868G 1% /bricks7 /dev/mapper/vg8-rhs_lv8 xfs 875G 7.6G 867G 1% /bricks8 /dev/mapper/vg9-rhs_lv9 xfs 875G 7.6G 867G 1% /bricks9 /dev/mapper/vg10-rhs_lv10 xfs 875G 107G 768G 13% /bricks10 /dev/mapper/vg11-rhs_lv11 xfs 875G 34M 875G 1% /bricks11 /dev/mapper/vg12-rhs_lv12 xfs 875G 34M 875G 1% /bricks12 [root@gqas014 /]# Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_gqas015-root xfs 50G 2.2G 48G 5% / devtmpfs devtmpfs 24G 0 24G 0% /dev tmpfs tmpfs 24G 36K 24G 1% /dev/shm tmpfs tmpfs 24G 345M 24G 2% /run tmpfs tmpfs 24G 0 24G 0% /sys/fs/cgroup /dev/sda1 xfs 1014M 141M 874M 14% /boot /dev/mapper/rhel_gqas015-home xfs 391G 33M 391G 1% /home tmpfs tmpfs 4.7G 0 4.7G 0% /run/user/0 /dev/mapper/vg1-rhs_lv1 xfs 875G 6.2G 869G 1% /bricks1 /dev/mapper/vg2-rhs_lv2 xfs 875G 6.9G 868G 1% /bricks2 /dev/mapper/vg3-rhs_lv3 xfs 875G 6.9G 868G 1% /bricks3 /dev/mapper/vg4-rhs_lv4 xfs 875G 6.9G 868G 1% /bricks4 /dev/mapper/vg5-rhs_lv5 xfs 875G 6.9G 868G 1% /bricks5 /dev/mapper/vg6-rhs_lv6 xfs 875G 6.9G 868G 1% /bricks6 /dev/mapper/vg7-rhs_lv7 xfs 875G 6.2G 869G 1% /bricks7 /dev/mapper/vg8-rhs_lv8 xfs 875G 7.7G 867G 1% /bricks8 /dev/mapper/vg9-rhs_lv9 xfs 875G 7.6G 868G 1% /bricks9 /dev/mapper/vg10-rhs_lv10 xfs 875G 6.9G 868G 1% /bricks10 /dev/mapper/vg11-rhs_lv11 xfs 875G 34M 875G 1% /bricks11 /dev/mapper/vg12-rhs_lv12 xfs 875G 34M 875G 1% /bricks12 [root@gqas015 /]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.8.4-27 How reproducible: ------------------ Reporting the first occurrence. Additional info: ---------------- *Vol Info * : [root@gqas014 /]# gluster v info Volume Name: butcher Type: Distribute Volume ID: f297fb8e-f276-4f96-8a58-a1215112d3b2 Status: Started Snapshot Count: 0 Number of Bricks: 20 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Options Reconfigured: nfs.disable: on transport.address-family: inet cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 50000 [root@gqas014 /]# *Vol Status* : [root@gqas014 /]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks1/A1 49152 0 Y 19950 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks1/A1 49152 0 Y 19842 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks2/A1 49153 0 Y 19973 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks2/A1 49153 0 Y 19865 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks3/A1 49154 0 Y 19996 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks3/A1 49154 0 Y 19888 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks4/A1 49155 0 Y 20019 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks4/A1 49155 0 Y 19911 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks5/A1 49156 0 Y 20042 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks5/A1 49156 0 Y 19934 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks6/A1 49157 0 Y 20065 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks6/A1 49157 0 Y 19957 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks7/A1 49158 0 Y 20088 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks7/A1 49158 0 Y 19980 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks8/A1 49159 0 Y 20111 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks8/A1 49159 0 Y 20003 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks9/A1 49160 0 Y 20134 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks9/A1 49160 0 Y 20026 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks10/A1 49161 0 Y 20157 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks10/A1 49161 0 Y 20049 Task Status of Volume butcher ------------------------------------------------------------------------------ There are no active volume tasks [root@gqas014 /]#
I see this in the mount logs : <snip> [2017-06-12 08:04:35.948610] W [rpc-clnt.c:1694:rpc_clnt_submit] 0-butcher-client-6: failed to submit rpc-request (XID: 0xfedb7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (butcher-client-6) [2017-06-12 08:04:35.950750] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-butcher-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-06-12 08:04:35.951130] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 6491583: WRITE => -1 gfid=f6e34aa2-4050-4cd4-896e-712660b2ef42 fd=0x7fd7e92e3630 (Transport endpoint is not connected) [2017-06-12 08:04:36.026522] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 6491584: FLUSH() ERR => -1 (Transport endpoint is not connected) [2017-06-12 08:04:36.026944] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-butcher-client-6: Connected to butcher-client-6, attached to remote volume '/bricks4/A1'. </snip> And these errors (from rpc?) in the brick logs at the same time : </snip> [2017-06-12 08:04:35.950650] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.950658] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.950697] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument] [2017-06-12 08:04:35.950719] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.950878] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument] [2017-06-12 08:04:35.950909] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.951028] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument] [2017-06-12 08:04:35.951054] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.951185] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument] [2017-06-12 08:04:35.951233] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.951354] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7efdf6f418c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x29c84) [0x7efde26ecc84] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe084) [0x7efde26d1084] ) 0-server: invalid argument: client [Invalid argument] [2017-06-12 08:04:35.951382] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.951569] I [addr.c:182:gf_auth] 0-/bricks4/A1: allowed = "*", received addr = "192.168.97.182" [2017-06-12 08:04:35.951611] I [MSGID: 115029] [server-handshake.c:712:server_setvolume] 0-butcher-server: accepted client from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-1 (version: 3.8.4) [2017-06-12 08:04:35.989077] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:35.989320] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:36.014154] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:36.014455] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:36.014605] W [socket.c:595:__socket_rwv] 0-tcp.butcher-server: writev on 192.168.97.182:1016 failed (Broken pipe) [2017-06-12 08:04:36.015270] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-12 08:04:36.015701] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-06-12 08:04:36.015772] I [MSGID: 115036] [server.c:561:server_rpc_notify] 0-butcher-server: disconnecting connection from gqac016.sbu.lab.eng.bos.redhat.com-6077-2017/06/12-07:16:31:520108-butcher-client-6-0-0 [2017-06-12 08:04:36.015807] I [socket.c:3662:socket_submit_reply] 0-tcp.butcher-server: not connected (priv->connected = -1) [2017-06-12 08:04:36.015836] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xfedb0, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server) </snip>
There is nothing to indicate that this is a dht issue. From the logs, it appears to be an rpc/transport issue. Moving this to the rpc component for further analysis.
Setting back the needinfo on Ambarish based on comment 8
A patch merged upstream [1] is expected to bring down ping-timeout expiries during high load on bricks. We are planning to get this patch backported to rhgs-3.4.0. With [1] client disconnect will be fixed, but not sure how the hung xfs will affect I/O. @Milind, can you work with Ambarish to move this bug to relevant component once #17105 is taken in? [1] https://review.gluster.org/17105
Followed the steps mentioned in above desc. Created plain distributed volume and mounted(fuse mount) on 4 different clients, then started running I/O's on all 4 clients using below tools. -> dd I/O tool on one client -> untar on one client -> small files on two clients. i didn't find any dd I/O's failing. Verified Bug with below version: glusterfs-3.12.2-8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607