Bug 1409568

Summary:	seeing socket disconnects and transport endpoint not connected frequently on systemic setup
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	rpc	Assignee:	Milind Changire <mchangir>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.2	CC:	amukherj, bkunal, mchangir, moagrawa, nbalacha, rcyriac, rgowdapp, rhs-bugs, sanandpa, sheggodu, smali, storage-qa-internal, vdas
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.4.z Batch Update 4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	rpc-ping-timeout, rebase
Fixed In Version:	glusterfs-3.12.2-41	Doc Type:	Bug Fix
Doc Text:	Cause: Due to heavy load, ping requests are not read fast enough so that they can be responded within the timeout. Consequence: When ping requests are not responded within a timeout value, transport is disconnected. Fix: Two fixes: 1. Event threads no longer execute any code of Glusterfs program. Instead requests are queued to be consumed by threads of Glusterfs program. Since no code of Glusterfs program is executed by event-threads, they are not slowed down. 2. Even with queuing, under still high load default number of event threads are not sufficient to read requests. Hence the number of event-threads are suggested to be increased. In testing of this bug, client.event-threads and server.event-threads were set to 8. But this value can be experimented and we suggest to try a value of 4 and use 8 threads only if it doesn't resolve the issue. Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-27 03:43:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1657798

Description Nag Pavan Chilakam 2017-01-02 13:39:12 UTC

Description of problem:
=======================
On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients.
Each client used different server IP to mount the volume using fuse protocol
Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute
The dir-creations were happening from different users
Eg:
client1(el 7.2) was running the dir-creation using pavan@rhs-client23
client2(el 6.7) as root@rhs-client24
client3(el 7.3) as cli21@rhs-client21


the disconnects are seen wrt different bricks as below in a regular manner


[2017-01-02 13:06:36.538652] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:36.541830] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:36.543837] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:36.546699] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:06:40.545285] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:40.548294] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:40.550357] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:40.553184] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:06:44.551720] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:44.554735] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:44.556822] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:44.559669] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:06:48.558017] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:48.560978] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:48.563198] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:48.565985] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:06:52.564889] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:52.567533] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:52.569617] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:52.572433] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:06:56.570672] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:06:56.573884] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:06:56.575952] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:06:56.578806] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:00.577078] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:07:00.580224] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:07:00.582281] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:07:00.585097] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:04.583348] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:07:04.586460] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:07:04.588591] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:07:04.591429] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:08.589622] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:07:08.592680] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:07:08.594859] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:07:08.597765] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:12.596572] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0)
[2017-01-02 13:07:12.599081] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0)
[2017-01-02 13:07:12.601356] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused)
[2017-01-02 13:07:12.604121] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:12.604121] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused)
[2017-01-02 13:07:15.598699] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-sysvol-client-3: server 10.70.37.154:49154 has not responded in the last 42 seconds, disconnecting.
[2017-01-02 13:07:15.599825] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2017-01-02 13:06:24.394351 (xid=0x492eb3)
[2017-01-02 13:07:15.599855] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected]
[2017-01-02 13:07:15.602789] I [socket.c:3439:socket_submit_request] 0-sysvol-client-3: not connected (priv->connected = 0)
[2017-01-02 13:07:15.602807] W [rpc-clnt.c:1657:rpc_clnt_submit] 0-sysvol-client-3: failed to submit rpc-request (XID: 0x492eb7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 31) to rpc-transport (sysvol-client-3)
[2017-01-02 13:07:15.602817] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected]
[2017-01-02 13:07:15.602827] W [MSGID: 108019] [afr-lk-common.c:1064:afr_log_entry_locks_failure] 0-sysvol-replicate-1: Unable to obtain sufficient blocking entry locks on at least one child while attempting MKDIR on {pgfid:565565d0-7a3c-47f8-af72-ca09d7da8f55, name:level5.34}.
[2017-01-02 13:07:15.602846] I [MSGID: 108019] [afr-transaction.c:1870:afr_post_blocking_entrylk_cbk] 0-sysvol-replicate-1: Blocking entrylks failed.
[2017-01-02 13:07:15.603218] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(STATFS(14)) called at 2017-01-02 13:06:24.391429 (xid=0x492eb2)
[2017-01-02 13:07:15.603236] W [MSGID: 114031] [client-rpc-fops.c:799:client3_3_statfs_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected]
[2017-01-02 13:07:15.603254] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-sysvol-dht: failed to get disk info from sysvol-replicate-1 [Transport endpoint is not connected]
[2017-01-02 13:07:15.603417] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(IPC(47)) called at 2017-01-02 13:06:28.788739 (xid=0x492eb4)
[2017-01-02 13:07:15.603433] W [MSGID: 114031] [client-rpc-fops.c:2159:client3_3_ipc_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected]
[2017-01-02 13:07:15.603646] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-01-02 13:06:33.534599 (xid=0x492eb5)
[2017-01-02 13:07:15.603662] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-sysvol-client-3: socket disconnected
[2017-01-02 13:07:15.603843] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2017-01-02 13:06:42.392304 (xid=0x492eb6)

[2017-01-02 13:07:15.603876] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-sysvol-client-3: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2017-01-02 13:07:15.603901] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2017-01-02 13:07:15.603914] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid = 
[2017-01-02 13:07:15.604023] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-sysvol-client-3: disconnected from sysvol-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-02 13:07:15.604037] E [MSGID: 108006] [afr-common.c:4679:afr_notify] 0-sysvol-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-01-02 13:07:15.604271] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up
[2017-01-02 13:07:15.605886] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /logs, gfid = 8c561b98-75f9-47ca-afa4-cc93fb0b06e4
[2017-01-02 13:07:15.606534] E [MSGID: 108008] [afr-transaction.c:2602:afr_write_txn_refresh_done] 0-sysvol-replicate-2: Failing SETXATTR on gfid 097182ec-1b49-4b38-b11c-09f146297a27: split-brain observed.
The message "I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up" repeated 3 times between [2017-01-02 13:07:15.604271] and [2017-01-02 13:07:15.606907]
[2017-01-02 13:07:15.607766] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /logs/rhs-client23.lab.eng.blr.redhat.com (gfid = 35574ecd-3993-430d-b439-6fa57c61ab3e). Holes=1 overlaps=0
[2017-01-02 13:07:15.607820] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /logs/rhs-client23.lab.eng.blr.redhat.com, gfid = 35574ecd-3993-430d-b439-6fa57c61ab3e
[2017-01-02 13:07:15.609158] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up
[2017-01-02 13:07:15.620555] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-3: changing port to 49154 (from 0)
[2017-01-02 13:07:15.640231] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up
[2017-01-02 13:07:15.641005] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1 (gfid = fa970c7a-8b9b-462f-bddd-0663c15120a2). Holes=1 overlaps=0
[2017-01-02 13:07:15.641044] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1, gfid = fa970c7a-8b9b-462f-bddd-0663c15120a2
[2017-01-02 13:07:15.655349] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up
[2017-01-02 13:07:15.703491] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1/level2.1 (gfid = 1a73aec2-faf1-4646-8e23-c6008c97db26). Holes=1 overlaps=0
[2017-01-02 13:07:15.703520] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1/level2.1, gfid = 1a73aec2-faf1-4646-8e23-c6008c97db26
[2017-01-02 13:07:15.721229] W [MSGID: 109011] [dht-layout.c:186:dht_layout_search] 0-sysvol-dht: no subvolume for hash (value) = 914953072
[2017-01-02 13:07:15.721363] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up
[2017-01-02 13:07:15.723552] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1/level2.1/level3.74 (gfid = ebfe6c0c-9b7c-4dde-9dbd-58666c5ccdb0). Holes=1 overlaps=0
[2017-01-02 13:07:15.723580] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1/level2.1/level3.74, gfid = ebfe6c0c-9b7c-4dde-9dbd-58666c5ccdb0
[2017-01-02 13:07:15.757517] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up



sosreports available at 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client23.lab.eng.blr.redhat.com/

test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609

Version-Release number of selected component (if applicable):
============
3.8.4-10

Comment 2 Nag Pavan Chilakam 2017-01-02 13:43:16 UTC

note that 2 bricks are down for which seperate bz has been raised as below
1409472 - brick crashed on systemic setup
1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO]

But i am seeing disconnects even for bricks which are not down

[root@dhcp35-20 ~]# gluster v status
Status of volume: sysvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.20:/rhs/brick1/sysvol        49154     0          Y       2404 
Brick 10.70.37.86:/rhs/brick1/sysvol        49154     0          Y       26789
Brick 10.70.35.156:/rhs/brick1/sysvol       N/A       N/A        N       N/A  
Brick 10.70.37.154:/rhs/brick1/sysvol       49154     0          Y       2192 
Brick 10.70.35.20:/rhs/brick2/sysvol        49155     0          Y       2424 
Brick 10.70.37.86:/rhs/brick2/sysvol        N/A       N/A        N       N/A  
Brick 10.70.35.156:/rhs/brick2/sysvol       49155     0          Y       26793
Brick 10.70.37.154:/rhs/brick2/sysvol       49155     0          Y       2212 
Snapshot Daemon on localhost                49156     0          Y       2449 
Self-heal Daemon on localhost               N/A       N/A        Y       3131 
Quota Daemon on localhost                   N/A       N/A        Y       3139 
Snapshot Daemon on 10.70.37.86              49156     0          Y       26832
Self-heal Daemon on 10.70.37.86             N/A       N/A        Y       2187 
Quota Daemon on 10.70.37.86                 N/A       N/A        Y       2195 
Snapshot Daemon on 10.70.35.156             49156     0          Y       26816
Self-heal Daemon on 10.70.35.156            N/A       N/A        Y       1376 
Quota Daemon on 10.70.35.156                N/A       N/A        Y       1384 
Snapshot Daemon on 10.70.37.154             49156     0          Y       2235 
Self-heal Daemon on 10.70.37.154            N/A       N/A        Y       9321 
Quota Daemon on 10.70.37.154                N/A       N/A        Y       9329 
 
Task Status of Volume sysvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-20 ~]# gluster v info
 
Volume Name: sysvol
Type: Distributed-Replicate
Volume ID: 4efd4f77-85c7-4eb9-b958-6769d31d84c8
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.70.35.20:/rhs/brick1/sysvol
Brick2: 10.70.37.86:/rhs/brick1/sysvol
Brick3: 10.70.35.156:/rhs/brick1/sysvol
Brick4: 10.70.37.154:/rhs/brick1/sysvol
Brick5: 10.70.35.20:/rhs/brick2/sysvol
Brick6: 10.70.37.86:/rhs/brick2/sysvol
Brick7: 10.70.35.156:/rhs/brick2/sysvol
Brick8: 10.70.37.154:/rhs/brick2/sysvol
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: enable
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 3 Nag Pavan Chilakam 2017-01-02 13:46:24 UTC

also there were traces in dmesg
[85321.290723] INFO: task tee:26646 blocked for more than 120 seconds.
[85321.290762] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[85321.290796] tee             D ffff8804726fa980     0 26646   6196 0x00000080
[85321.290800]  ffff88043274fe18 0000000000000082 ffff88020f3a2280 ffff88043274ffd8
[85321.290805]  ffff88043274ffd8 ffff88043274ffd8 ffff88020f3a2280 ffff880472993200
[85321.290808]  ffff88027360b800 ffff88043274fe48 ffff8804729932e0 ffff8804726fa980
[85321.290811] Call Trace:
[85321.290822]  [<ffffffff8163a909>] schedule+0x29/0x70
[85321.290828]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[85321.290832]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[85321.290838]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[85321.290842]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[85321.290845]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[85321.290849]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[85321.290855]  [<ffffffff811dc254>] filp_close+0x34/0x80
[85321.290859]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[85321.290861]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[85321.290866]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[104761.281512] INFO: task tee:26536 blocked for more than 120 seconds.
[104761.281545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[104761.281579] tee             D ffff880036502200     0 26536   6196 0x00000080
[104761.281584]  ffff8800622fbe18 0000000000000082 ffff8801f8fa1700 ffff8800622fbfd8
[104761.281588]  ffff8800622fbfd8 ffff8800622fbfd8 ffff8801f8fa1700 ffff880078ef1db0
[104761.281591]  ffff88027360b800 ffff8800622fbe48 ffff880078ef1e90 ffff880036502200
[104761.281595] Call Trace:
[104761.281605]  [<ffffffff8163a909>] schedule+0x29/0x70
[104761.281611]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[104761.281615]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[104761.281620]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[104761.281625]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[104761.281628]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[104761.281632]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[104761.281637]  [<ffffffff811dc254>] filp_close+0x34/0x80
[104761.281641]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[104761.281644]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[104761.281648]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[121321.273512] INFO: task tee:29783 blocked for more than 120 seconds.
[121321.273546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121321.273579] tee             D ffff8804726fae00     0 29783   6196 0x00000080
[121321.273584]  ffff88042f817e18 0000000000000086 ffff88043e0eb980 ffff88042f817fd8
[121321.273589]  ffff88042f817fd8 ffff88042f817fd8 ffff88043e0eb980 ffff880472991450
[121321.273592]  ffff88027360b800 ffff88042f817e48 ffff880472991530 ffff8804726fae00
[121321.273595] Call Trace:
[121321.273606]  [<ffffffff8163a909>] schedule+0x29/0x70
[121321.273612]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[121321.273615]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[121321.273621]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[121321.273626]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[121321.273629]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[121321.273633]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[121321.273639]  [<ffffffff811dc254>] filp_close+0x34/0x80
[121321.273643]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[121321.273645]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[121321.273650]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[122041.273141] INFO: task tee:8444 blocked for more than 120 seconds.
[122041.273173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[122041.273208] tee             D ffff880273016f00     0  8444   6196 0x00000080
[122041.273212]  ffff880272a6be18 0000000000000086 ffff88006aa40000 ffff880272a6bfd8
[122041.273217]  ffff880272a6bfd8 ffff880272a6bfd8 ffff88006aa40000 ffff880078ef20d0
[122041.273220]  ffff88027360b800 ffff880272a6be48 ffff880078ef21b0 ffff880273016f00
[122041.273223] Call Trace:
[122041.273233]  [<ffffffff8163a909>] schedule+0x29/0x70
[122041.273239]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[122041.273243]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[122041.273249]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[122041.273253]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[122041.273256]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[122041.273260]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[122041.273266]  [<ffffffff811dc254>] filp_close+0x34/0x80
[122041.273270]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[122041.273273]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[122041.273278]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[133921.267488] INFO: task tee:17200 blocked for more than 120 seconds.
[133921.267520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[133921.267554] tee             D ffff8801f8c2eb00     0 17200   6196 0x00000080
[133921.267558]  ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8
[133921.267563]  ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90
[133921.267566]  ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00
[133921.267569] Call Trace:
[133921.267580]  [<ffffffff8163a909>] schedule+0x29/0x70
[133921.267585]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[133921.267589]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[133921.267595]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[133921.267599]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[133921.267602]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[133921.267606]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[133921.267612]  [<ffffffff811dc254>] filp_close+0x34/0x80
[133921.267616]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[133921.267619]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[133921.267623]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[134041.267457] INFO: task tee:17200 blocked for more than 120 seconds.
[134041.267491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[134041.267525] tee             D ffff8801f8c2eb00     0 17200   6196 0x00000080
[134041.267529]  ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8
[134041.267534]  ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90
[134041.267537]  ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00
[134041.267540] Call Trace:
[134041.267550]  [<ffffffff8163a909>] schedule+0x29/0x70
[134041.267556]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[134041.267560]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[134041.267566]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[134041.267570]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[134041.267574]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[134041.267577]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[134041.267583]  [<ffffffff811dc254>] filp_close+0x34/0x80
[134041.267587]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[134041.267590]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[134041.267594]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[134161.267401] INFO: task tee:17200 blocked for more than 120 seconds.
[134161.267435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[134161.267469] tee             D ffff8801f8c2eb00     0 17200   6196 0x00000080
[134161.267474]  ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8
[134161.267478]  ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90
[134161.267481]  ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00
[134161.267485] Call Trace:
[134161.267496]  [<ffffffff8163a909>] schedule+0x29/0x70
[134161.267502]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[134161.267506]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[134161.267512]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[134161.267516]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[134161.267519]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[134161.267523]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[134161.267529]  [<ffffffff811dc254>] filp_close+0x34/0x80
[134161.267533]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[134161.267535]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[134161.267540]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[150001.259811] INFO: task tee:29470 blocked for more than 120 seconds.
[150001.259844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[150001.259878] tee             D ffff880246a4a700     0 29470   6196 0x00000080
[150001.259882]  ffff8801edfdbe18 0000000000000086 ffff88025f5edc00 ffff8801edfdbfd8
[150001.259887]  ffff8801edfdbfd8 ffff8801edfdbfd8 ffff88025f5edc00 ffff880078ef0e10
[150001.259890]  ffff88027360b800 ffff8801edfdbe48 ffff880078ef0ef0 ffff880246a4a700
[150001.259893] Call Trace:
[150001.259904]  [<ffffffff8163a909>] schedule+0x29/0x70
[150001.259910]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[150001.259914]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[150001.259919]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[150001.259924]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[150001.259927]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[150001.259931]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[150001.259937]  [<ffffffff811dc254>] filp_close+0x34/0x80
[150001.259941]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[150001.259943]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[150001.259948]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[156721.256615] INFO: task tee:15511 blocked for more than 120 seconds.
[156721.256648] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[156721.256681] tee             D ffff8804726fa800     0 15511   6196 0x00000080
[156721.256686]  ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8
[156721.256690]  ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0
[156721.256694]  ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800
[156721.256697] Call Trace:
[156721.256707]  [<ffffffff8163a909>] schedule+0x29/0x70
[156721.256712]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[156721.256716]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[156721.256722]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[156721.256726]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[156721.256729]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[156721.256733]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[156721.256739]  [<ffffffff811dc254>] filp_close+0x34/0x80
[156721.256743]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[156721.256746]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[156721.256750]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[156841.256561] INFO: task tee:15511 blocked for more than 120 seconds.
[156841.256591] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[156841.256625] tee             D ffff8804726fa800     0 15511   6196 0x00000080
[156841.256628]  ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8
[156841.256632]  ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0
[156841.256635]  ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800
[156841.256639] Call Trace:
[156841.256645]  [<ffffffff8163a909>] schedule+0x29/0x70
[156841.256650]  [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[156841.256654]  [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse]
[156841.256658]  [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50
[156841.256660]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[156841.256664]  [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse]
[156841.256667]  [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse]
[156841.256672]  [<ffffffff811dc254>] filp_close+0x34/0x80
[156841.256676]  [<ffffffff811fcb98>] __close_fd+0x78/0xa0
[156841.256678]  [<ffffffff811dd963>] SyS_close+0x23/0x50
[156841.256682]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[271499.233421] SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
[271503.126098] nr_pdflush_threads exported in /proc is scheduled for removal
[271506.509137] warning: `turbostat' uses 32-bit capabilities (legacy support in use)

Comment 4 Nag Pavan Chilakam 2017-01-04 07:13:28 UTC

client sosreports are available at scp -r /var/tmp/$HOSTNAME qe@rhsqe-repo:/var/www/html/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/

Comment 6 Mohit Agrawal 2017-04-18 04:35:58 UTC

Hi Nag,

I have checked bricks logs on the systematic server and mnt logs on the clients.Below is my analysis specific to getting disconnect messages on the server 

Total there are 4 servers and 2 of them are showing disconnect messages(for client/internal-daemon both) and rest of the 2 servers are not showing any disconnect messages (for client , disconnect messages only for internal daemon like quotad,glusterfsshd)

Below are the servers those are not showing any disconnect messages for client 
[root@dhcp35-20 bricks]#
grep -i disconnect * | grep rhs-client | grep -v "2017-04-13"
[root@dhcp35-156 log]#
grep -i disconnect * | grep rhs-client | grep -v "2017-04-13"

Server those are showing disconnect message for client 
[root@dhcp37-86 bricks]# 
[root@dhcp37-154 log]

I have observed one thing on above servers as compared to server those are not showing any disconnect messages there are so many watchdog messages thrown by kernel minimum stuck time is 22s and maximum stuck time is more than 1 min(120 sec).

IMO while task is stuck at the same time if client will try to communicate with server it can get ping-timer-expired message , i have checked mnt logs also almost same time client also showing timer-expired messages so i believe in latest package(glusterfs-3.8.4-22.el7rhgs.x86_64) we are not able to reproduce this issue.

Below are the some messages those are thrown by kernel

Apr 16 19:22:29 dhcp37-154 kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 44s! [glusterfsd:17965]
Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 160s! [glusterfsd:8060]
Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 149s! [glusterfsd:21010]


As per your comment you have also seen Call trace for task those were stuck more than 2 minute on the client side but this time there is no call trace for any task in dmesg that is hung more than 2 minutes.

Thanks
Mohit Agrawal

Comment 8 Sunil Kumar Acharya 2017-09-26 10:25:53 UTC

upstream patch : https://review.gluster.org/#/c/17105/

Comment 15 Atin Mukherjee 2018-11-10 07:05:28 UTC

Can we have a probable release candidate target assigned for this bug?

Comment 34 Nag Pavan Chilakam 2019-03-20 11:39:06 UTC

based on above comments(read from comment#23), QE is moving this bug to verified.
test version:3.12.2-43

Comment 36 errata-xmlrpc 2019-03-27 03:43:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0658