Description of problem: ======================= On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients. Each client used different server IP to mount the volume using fuse protocol Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute The dir-creations were happening from different users Eg: client1(el 7.2) was running the dir-creation using pavan@rhs-client23 client2(el 6.7) as root@rhs-client24 client3(el 7.3) as cli21@rhs-client21 the disconnects are seen wrt different bricks as below in a regular manner [2017-01-02 13:06:36.538652] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:36.541830] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:36.543837] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:36.546699] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:06:40.545285] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:40.548294] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:40.550357] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:40.553184] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:06:44.551720] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:44.554735] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:44.556822] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:44.559669] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:06:48.558017] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:48.560978] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:48.563198] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:48.565985] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:06:52.564889] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:52.567533] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:52.569617] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:52.572433] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:06:56.570672] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:06:56.573884] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:06:56.575952] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:06:56.578806] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:00.577078] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:07:00.580224] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:07:00.582281] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:07:00.585097] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:04.583348] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:07:04.586460] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:07:04.588591] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:07:04.591429] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:08.589622] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:07:08.592680] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:07:08.594859] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:07:08.597765] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:12.596572] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-5: changing port to 49155 (from 0) [2017-01-02 13:07:12.599081] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-2: changing port to 49154 (from 0) [2017-01-02 13:07:12.601356] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-5: connection to 10.70.37.86:49155 failed (Connection refused) [2017-01-02 13:07:12.604121] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:12.604121] E [socket.c:2309:socket_connect_finish] 0-sysvol-client-2: connection to 10.70.35.156:49154 failed (Connection refused) [2017-01-02 13:07:15.598699] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-sysvol-client-3: server 10.70.37.154:49154 has not responded in the last 42 seconds, disconnecting. [2017-01-02 13:07:15.599825] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2017-01-02 13:06:24.394351 (xid=0x492eb3) [2017-01-02 13:07:15.599855] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected] [2017-01-02 13:07:15.602789] I [socket.c:3439:socket_submit_request] 0-sysvol-client-3: not connected (priv->connected = 0) [2017-01-02 13:07:15.602807] W [rpc-clnt.c:1657:rpc_clnt_submit] 0-sysvol-client-3: failed to submit rpc-request (XID: 0x492eb7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 31) to rpc-transport (sysvol-client-3) [2017-01-02 13:07:15.602817] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected] [2017-01-02 13:07:15.602827] W [MSGID: 108019] [afr-lk-common.c:1064:afr_log_entry_locks_failure] 0-sysvol-replicate-1: Unable to obtain sufficient blocking entry locks on at least one child while attempting MKDIR on {pgfid:565565d0-7a3c-47f8-af72-ca09d7da8f55, name:level5.34}. [2017-01-02 13:07:15.602846] I [MSGID: 108019] [afr-transaction.c:1870:afr_post_blocking_entrylk_cbk] 0-sysvol-replicate-1: Blocking entrylks failed. [2017-01-02 13:07:15.603218] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(STATFS(14)) called at 2017-01-02 13:06:24.391429 (xid=0x492eb2) [2017-01-02 13:07:15.603236] W [MSGID: 114031] [client-rpc-fops.c:799:client3_3_statfs_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected] [2017-01-02 13:07:15.603254] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-sysvol-dht: failed to get disk info from sysvol-replicate-1 [Transport endpoint is not connected] [2017-01-02 13:07:15.603417] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(IPC(47)) called at 2017-01-02 13:06:28.788739 (xid=0x492eb4) [2017-01-02 13:07:15.603433] W [MSGID: 114031] [client-rpc-fops.c:2159:client3_3_ipc_cbk] 0-sysvol-client-3: remote operation failed [Transport endpoint is not connected] [2017-01-02 13:07:15.603646] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-01-02 13:06:33.534599 (xid=0x492eb5) [2017-01-02 13:07:15.603662] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-sysvol-client-3: socket disconnected [2017-01-02 13:07:15.603843] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fe4ca33a6a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fe4ca10076e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fe4ca10087e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fe4ca101fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fe4ca102874] ))))) 0-sysvol-client-3: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2017-01-02 13:06:42.392304 (xid=0x492eb6) [2017-01-02 13:07:15.603876] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-sysvol-client-3: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2017-01-02 13:07:15.603901] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0 [2017-01-02 13:07:15.603914] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid = [2017-01-02 13:07:15.604023] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-sysvol-client-3: disconnected from sysvol-client-3. Client process will keep trying to connect to glusterd until brick's port is available [2017-01-02 13:07:15.604037] E [MSGID: 108006] [afr-common.c:4679:afr_notify] 0-sysvol-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up. [2017-01-02 13:07:15.604271] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up [2017-01-02 13:07:15.605886] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /logs, gfid = 8c561b98-75f9-47ca-afa4-cc93fb0b06e4 [2017-01-02 13:07:15.606534] E [MSGID: 108008] [afr-transaction.c:2602:afr_write_txn_refresh_done] 0-sysvol-replicate-2: Failing SETXATTR on gfid 097182ec-1b49-4b38-b11c-09f146297a27: split-brain observed. The message "I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up" repeated 3 times between [2017-01-02 13:07:15.604271] and [2017-01-02 13:07:15.606907] [2017-01-02 13:07:15.607766] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /logs/rhs-client23.lab.eng.blr.redhat.com (gfid = 35574ecd-3993-430d-b439-6fa57c61ab3e). Holes=1 overlaps=0 [2017-01-02 13:07:15.607820] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /logs/rhs-client23.lab.eng.blr.redhat.com, gfid = 35574ecd-3993-430d-b439-6fa57c61ab3e [2017-01-02 13:07:15.609158] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up [2017-01-02 13:07:15.620555] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-sysvol-client-3: changing port to 49154 (from 0) [2017-01-02 13:07:15.640231] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up [2017-01-02 13:07:15.641005] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1 (gfid = fa970c7a-8b9b-462f-bddd-0663c15120a2). Holes=1 overlaps=0 [2017-01-02 13:07:15.641044] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1, gfid = fa970c7a-8b9b-462f-bddd-0663c15120a2 [2017-01-02 13:07:15.655349] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up [2017-01-02 13:07:15.703491] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1/level2.1 (gfid = 1a73aec2-faf1-4646-8e23-c6008c97db26). Holes=1 overlaps=0 [2017-01-02 13:07:15.703520] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1/level2.1, gfid = 1a73aec2-faf1-4646-8e23-c6008c97db26 [2017-01-02 13:07:15.721229] W [MSGID: 109011] [dht-layout.c:186:dht_layout_search] 0-sysvol-dht: no subvolume for hash (value) = 914953072 [2017-01-02 13:07:15.721363] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up [2017-01-02 13:07:15.723552] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-sysvol-dht: Found anomalies in /data/same-dir-creation/level1.1/level2.1/level3.74 (gfid = ebfe6c0c-9b7c-4dde-9dbd-58666c5ccdb0). Holes=1 overlaps=0 [2017-01-02 13:07:15.723580] W [MSGID: 109005] [dht-selfheal.c:2106:dht_selfheal_directory] 0-sysvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /data/same-dir-creation/level1.1/level2.1/level3.74, gfid = ebfe6c0c-9b7c-4dde-9dbd-58666c5ccdb0 [2017-01-02 13:07:15.757517] I [MSGID: 108006] [afr-common.c:4823:afr_local_init] 0-sysvol-replicate-1: no subvolumes up sosreports available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client23.lab.eng.blr.redhat.com/ test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 Version-Release number of selected component (if applicable): ============ 3.8.4-10
note that 2 bricks are down for which seperate bz has been raised as below 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] But i am seeing disconnects even for bricks which are not down [root@dhcp35-20 ~]# gluster v status Status of volume: sysvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.20:/rhs/brick1/sysvol 49154 0 Y 2404 Brick 10.70.37.86:/rhs/brick1/sysvol 49154 0 Y 26789 Brick 10.70.35.156:/rhs/brick1/sysvol N/A N/A N N/A Brick 10.70.37.154:/rhs/brick1/sysvol 49154 0 Y 2192 Brick 10.70.35.20:/rhs/brick2/sysvol 49155 0 Y 2424 Brick 10.70.37.86:/rhs/brick2/sysvol N/A N/A N N/A Brick 10.70.35.156:/rhs/brick2/sysvol 49155 0 Y 26793 Brick 10.70.37.154:/rhs/brick2/sysvol 49155 0 Y 2212 Snapshot Daemon on localhost 49156 0 Y 2449 Self-heal Daemon on localhost N/A N/A Y 3131 Quota Daemon on localhost N/A N/A Y 3139 Snapshot Daemon on 10.70.37.86 49156 0 Y 26832 Self-heal Daemon on 10.70.37.86 N/A N/A Y 2187 Quota Daemon on 10.70.37.86 N/A N/A Y 2195 Snapshot Daemon on 10.70.35.156 49156 0 Y 26816 Self-heal Daemon on 10.70.35.156 N/A N/A Y 1376 Quota Daemon on 10.70.35.156 N/A N/A Y 1384 Snapshot Daemon on 10.70.37.154 49156 0 Y 2235 Self-heal Daemon on 10.70.37.154 N/A N/A Y 9321 Quota Daemon on 10.70.37.154 N/A N/A Y 9329 Task Status of Volume sysvol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-20 ~]# gluster v info Volume Name: sysvol Type: Distributed-Replicate Volume ID: 4efd4f77-85c7-4eb9-b958-6769d31d84c8 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.20:/rhs/brick1/sysvol Brick2: 10.70.37.86:/rhs/brick1/sysvol Brick3: 10.70.35.156:/rhs/brick1/sysvol Brick4: 10.70.37.154:/rhs/brick1/sysvol Brick5: 10.70.35.20:/rhs/brick2/sysvol Brick6: 10.70.37.86:/rhs/brick2/sysvol Brick7: 10.70.35.156:/rhs/brick2/sysvol Brick8: 10.70.37.154:/rhs/brick2/sysvol Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: enable transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
also there were traces in dmesg [85321.290723] INFO: task tee:26646 blocked for more than 120 seconds. [85321.290762] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [85321.290796] tee D ffff8804726fa980 0 26646 6196 0x00000080 [85321.290800] ffff88043274fe18 0000000000000082 ffff88020f3a2280 ffff88043274ffd8 [85321.290805] ffff88043274ffd8 ffff88043274ffd8 ffff88020f3a2280 ffff880472993200 [85321.290808] ffff88027360b800 ffff88043274fe48 ffff8804729932e0 ffff8804726fa980 [85321.290811] Call Trace: [85321.290822] [<ffffffff8163a909>] schedule+0x29/0x70 [85321.290828] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [85321.290832] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [85321.290838] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [85321.290842] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [85321.290845] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [85321.290849] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [85321.290855] [<ffffffff811dc254>] filp_close+0x34/0x80 [85321.290859] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [85321.290861] [<ffffffff811dd963>] SyS_close+0x23/0x50 [85321.290866] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [104761.281512] INFO: task tee:26536 blocked for more than 120 seconds. [104761.281545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [104761.281579] tee D ffff880036502200 0 26536 6196 0x00000080 [104761.281584] ffff8800622fbe18 0000000000000082 ffff8801f8fa1700 ffff8800622fbfd8 [104761.281588] ffff8800622fbfd8 ffff8800622fbfd8 ffff8801f8fa1700 ffff880078ef1db0 [104761.281591] ffff88027360b800 ffff8800622fbe48 ffff880078ef1e90 ffff880036502200 [104761.281595] Call Trace: [104761.281605] [<ffffffff8163a909>] schedule+0x29/0x70 [104761.281611] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [104761.281615] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [104761.281620] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [104761.281625] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [104761.281628] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [104761.281632] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [104761.281637] [<ffffffff811dc254>] filp_close+0x34/0x80 [104761.281641] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [104761.281644] [<ffffffff811dd963>] SyS_close+0x23/0x50 [104761.281648] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [121321.273512] INFO: task tee:29783 blocked for more than 120 seconds. [121321.273546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [121321.273579] tee D ffff8804726fae00 0 29783 6196 0x00000080 [121321.273584] ffff88042f817e18 0000000000000086 ffff88043e0eb980 ffff88042f817fd8 [121321.273589] ffff88042f817fd8 ffff88042f817fd8 ffff88043e0eb980 ffff880472991450 [121321.273592] ffff88027360b800 ffff88042f817e48 ffff880472991530 ffff8804726fae00 [121321.273595] Call Trace: [121321.273606] [<ffffffff8163a909>] schedule+0x29/0x70 [121321.273612] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [121321.273615] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [121321.273621] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [121321.273626] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [121321.273629] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [121321.273633] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [121321.273639] [<ffffffff811dc254>] filp_close+0x34/0x80 [121321.273643] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [121321.273645] [<ffffffff811dd963>] SyS_close+0x23/0x50 [121321.273650] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [122041.273141] INFO: task tee:8444 blocked for more than 120 seconds. [122041.273173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [122041.273208] tee D ffff880273016f00 0 8444 6196 0x00000080 [122041.273212] ffff880272a6be18 0000000000000086 ffff88006aa40000 ffff880272a6bfd8 [122041.273217] ffff880272a6bfd8 ffff880272a6bfd8 ffff88006aa40000 ffff880078ef20d0 [122041.273220] ffff88027360b800 ffff880272a6be48 ffff880078ef21b0 ffff880273016f00 [122041.273223] Call Trace: [122041.273233] [<ffffffff8163a909>] schedule+0x29/0x70 [122041.273239] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [122041.273243] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [122041.273249] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [122041.273253] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [122041.273256] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [122041.273260] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [122041.273266] [<ffffffff811dc254>] filp_close+0x34/0x80 [122041.273270] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [122041.273273] [<ffffffff811dd963>] SyS_close+0x23/0x50 [122041.273278] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [133921.267488] INFO: task tee:17200 blocked for more than 120 seconds. [133921.267520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [133921.267554] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [133921.267558] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [133921.267563] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [133921.267566] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [133921.267569] Call Trace: [133921.267580] [<ffffffff8163a909>] schedule+0x29/0x70 [133921.267585] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [133921.267589] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [133921.267595] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [133921.267599] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [133921.267602] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [133921.267606] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [133921.267612] [<ffffffff811dc254>] filp_close+0x34/0x80 [133921.267616] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [133921.267619] [<ffffffff811dd963>] SyS_close+0x23/0x50 [133921.267623] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [134041.267457] INFO: task tee:17200 blocked for more than 120 seconds. [134041.267491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [134041.267525] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [134041.267529] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [134041.267534] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [134041.267537] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [134041.267540] Call Trace: [134041.267550] [<ffffffff8163a909>] schedule+0x29/0x70 [134041.267556] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [134041.267560] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [134041.267566] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [134041.267570] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [134041.267574] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [134041.267577] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [134041.267583] [<ffffffff811dc254>] filp_close+0x34/0x80 [134041.267587] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [134041.267590] [<ffffffff811dd963>] SyS_close+0x23/0x50 [134041.267594] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [134161.267401] INFO: task tee:17200 blocked for more than 120 seconds. [134161.267435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [134161.267469] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [134161.267474] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [134161.267478] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [134161.267481] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [134161.267485] Call Trace: [134161.267496] [<ffffffff8163a909>] schedule+0x29/0x70 [134161.267502] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [134161.267506] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [134161.267512] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [134161.267516] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [134161.267519] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [134161.267523] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [134161.267529] [<ffffffff811dc254>] filp_close+0x34/0x80 [134161.267533] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [134161.267535] [<ffffffff811dd963>] SyS_close+0x23/0x50 [134161.267540] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [150001.259811] INFO: task tee:29470 blocked for more than 120 seconds. [150001.259844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [150001.259878] tee D ffff880246a4a700 0 29470 6196 0x00000080 [150001.259882] ffff8801edfdbe18 0000000000000086 ffff88025f5edc00 ffff8801edfdbfd8 [150001.259887] ffff8801edfdbfd8 ffff8801edfdbfd8 ffff88025f5edc00 ffff880078ef0e10 [150001.259890] ffff88027360b800 ffff8801edfdbe48 ffff880078ef0ef0 ffff880246a4a700 [150001.259893] Call Trace: [150001.259904] [<ffffffff8163a909>] schedule+0x29/0x70 [150001.259910] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [150001.259914] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [150001.259919] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [150001.259924] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [150001.259927] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [150001.259931] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [150001.259937] [<ffffffff811dc254>] filp_close+0x34/0x80 [150001.259941] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [150001.259943] [<ffffffff811dd963>] SyS_close+0x23/0x50 [150001.259948] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [156721.256615] INFO: task tee:15511 blocked for more than 120 seconds. [156721.256648] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [156721.256681] tee D ffff8804726fa800 0 15511 6196 0x00000080 [156721.256686] ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8 [156721.256690] ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0 [156721.256694] ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800 [156721.256697] Call Trace: [156721.256707] [<ffffffff8163a909>] schedule+0x29/0x70 [156721.256712] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [156721.256716] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [156721.256722] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [156721.256726] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [156721.256729] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [156721.256733] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [156721.256739] [<ffffffff811dc254>] filp_close+0x34/0x80 [156721.256743] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [156721.256746] [<ffffffff811dd963>] SyS_close+0x23/0x50 [156721.256750] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [156841.256561] INFO: task tee:15511 blocked for more than 120 seconds. [156841.256591] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [156841.256625] tee D ffff8804726fa800 0 15511 6196 0x00000080 [156841.256628] ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8 [156841.256632] ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0 [156841.256635] ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800 [156841.256639] Call Trace: [156841.256645] [<ffffffff8163a909>] schedule+0x29/0x70 [156841.256650] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [156841.256654] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [156841.256658] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [156841.256660] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [156841.256664] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [156841.256667] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [156841.256672] [<ffffffff811dc254>] filp_close+0x34/0x80 [156841.256676] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [156841.256678] [<ffffffff811dd963>] SyS_close+0x23/0x50 [156841.256682] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [271499.233421] SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts [271503.126098] nr_pdflush_threads exported in /proc is scheduled for removal [271506.509137] warning: `turbostat' uses 32-bit capabilities (legacy support in use)
client sosreports are available at scp -r /var/tmp/$HOSTNAME qe@rhsqe-repo:/var/www/html/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/
Hi Nag, I have checked bricks logs on the systematic server and mnt logs on the clients.Below is my analysis specific to getting disconnect messages on the server Total there are 4 servers and 2 of them are showing disconnect messages(for client/internal-daemon both) and rest of the 2 servers are not showing any disconnect messages (for client , disconnect messages only for internal daemon like quotad,glusterfsshd) Below are the servers those are not showing any disconnect messages for client [root@dhcp35-20 bricks]# grep -i disconnect * | grep rhs-client | grep -v "2017-04-13" [root@dhcp35-156 log]# grep -i disconnect * | grep rhs-client | grep -v "2017-04-13" Server those are showing disconnect message for client [root@dhcp37-86 bricks]# [root@dhcp37-154 log] I have observed one thing on above servers as compared to server those are not showing any disconnect messages there are so many watchdog messages thrown by kernel minimum stuck time is 22s and maximum stuck time is more than 1 min(120 sec). IMO while task is stuck at the same time if client will try to communicate with server it can get ping-timer-expired message , i have checked mnt logs also almost same time client also showing timer-expired messages so i believe in latest package(glusterfs-3.8.4-22.el7rhgs.x86_64) we are not able to reproduce this issue. Below are the some messages those are thrown by kernel Apr 16 19:22:29 dhcp37-154 kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 44s! [glusterfsd:17965] Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 160s! [glusterfsd:8060] Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 149s! [glusterfsd:21010] As per your comment you have also seen Call trace for task those were stuck more than 2 minute on the client side but this time there is no call trace for any task in dmesg that is hung more than 2 minutes. Thanks Mohit Agrawal
upstream patch : https://review.gluster.org/#/c/17105/
Can we have a probable release candidate target assigned for this bug?
based on above comments(read from comment#23), QE is moving this bug to verified. test version:3.12.2-43
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0658