Description of problem: While running bvt testcases on RHEL 8.1 it was observed that about 8 cases were failing after further investigation it was observed that all of these testcases were failing with "Error: Request timed out" error. ############################################################################################################################################################################################################## 2020-03-12 19:36:28,134 INFO (run) root.eng.blr.redhat.com (cp): gluster volume rebalance testvol_distributed-dispersed status --xml 2020-03-12 19:36:28,135 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com 2020-03-12 19:38:28,194 INFO (_log_results) RETCODE (root.eng.blr.redhat.com): 1 2020-03-12 19:38:28,196 INFO (_log_results) STDOUT (root.eng.blr.redhat.com)... Error : Request timed out ############################################################################################################################################################################################################## 2020-03-02 04:52:01,897 INFO (run) root.eng.blr.redhat.com (cp): gluster volume remove-brick testvol_distributed-dispersed rhsauto041.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick10 rhsauto042.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick11 rhsauto045.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick7 rhsauto025.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick6 rhsauto032.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick9 rhsauto028.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick8 status --mode=script --xml 2020-03-02 04:52:01,897 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com 2020-03-02 04:54:01,950 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m 2020-03-02 04:54:01,951 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)... Error : Request timed out ############################################################################################################################################################################################################## 2020-03-02 06:33:09,642 INFO (run) root.eng.blr.redhat.com (cp): gluster vol status testvol_distributed-replicated --xml 2020-03-02 06:33:09,642 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com 2020-03-02 06:35:09,694 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m 2020-03-02 06:35:09,694 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)... Error : Request timed out ############################################################################################################################################################################################################## When the testcases which failed were reran on local setup it was observed that these testcases were failing even when executed one by one. ############################################################################################################################################################################################################## glusterd logs ############################################################################################################################################################################################################## Snippit 1 ############################################################################################################################################################################################################## [2020-03-12 14:05:37.069585] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_distributed-dispersed [2020-03-12 14:05:37.163688] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id f06410aa-7a2c-4206-ad07-5fb47aca3582 for key rebalance-id [2020-03-12 14:05:42.198882] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-03-12 14:05:42.199261] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict The message "E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict" repeated 40 times between [2020-03-12 14:05:42.199261] and [2020-03-12 14:07:18.776453] [2020-03-12 14:07:29.445663] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict [2020-03-12 14:09:28.620158] I [glusterd-locks.c:740:gd_mgmt_v3_unlock_timer_cbk] 0-management: unlock timer is cancelled for volume testvol_distributed-disperse [2020-03-12 14:10:54.708894] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000 [2020-03-12 14:11:10.409238] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20 ############################################################################################################################################################################################################## ############################################################################################################################################################################################################## Snippit 2 ############################################################################################################################################################################################################## [2020-03-12 14:12:31.550310] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-03-12 14:12:55.662055] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000 [2020-03-12 14:13:08.296521] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto052.lab.eng.blr.redhat.com. Please check log file for details. [2020-03-12 14:13:08.296604] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto022.lab.eng.blr.redhat.com. Please check log file for details. [2020-03-12 14:13:08.296643] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto024.lab.eng.blr.redhat.com. Please check log file for details. [2020-03-12 14:13:08.296677] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto021.lab.eng.blr.redhat.com. Please check log file for details. [2020-03-12 14:13:10.869641] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto020.lab.eng.blr.redhat.com. Please check log file for details. [2020-03-12 14:13:10.869820] E [MSGID: 106117] [glusterd-mgmt.c:2182:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers [2020-03-12 14:13:10.869895] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf156a) [0x7feabf8a356a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf1348) [0x7feabf8a3348] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held [2020-03-12 14:13:10.869914] E [MSGID: 106117] [glusterd-locks.c:342:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for vol testvol_distributed-dispersed on behalf of 1fc8fc20-a1f6-42f9-a403-02826ef934e1. [2020-03-12 14:13:10.869926] E [MSGID: 106146] [glusterd-locks.c:457:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol [2020-03-12 14:13:10.869937] E [MSGID: 106117] [glusterd-mgmt.c:2338:glusterd_mgmt_v3_initiate_all_phases_with_brickop_phase] 0-management: Failed to release mgmt_v3 locks on localhost [2020-03-12 14:13:10.869996] I [socket.c:3844:socket_submit_outgoing_msg] 0-socket.management: not connected (priv->connected = -1) [2020-03-12 14:13:10.870007] E [rpcsvc.c:1583:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 11) to rpc-transport (socket.management) [2020-03-12 14:13:10.870023] E [MSGID: 106430] [glusterd-utils.c:599:glusterd_submit_reply] 0-glusterd: Reply submission failed [2020-03-12 14:13:49.459442] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5 [2020-03-12 14:14:03.720052] I [MSGID: 106493] [glusterd-handler.c:3953:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto024.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0 [2020-03-12 14:14:12.851026] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5 [2020-03-12 14:14:12.851080] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-03-12 14:14:23.258524] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5 [2020-03-12 14:24:34.677880] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd. [2020-03-12 14:24:34.678229] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held [2020-03-12 14:24:34.678256] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol_distributed-dispersed [2020-03-12 14:26:10.328965] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8, host: rhsauto020.lab.eng.blr.redhat.com, port: 0 [2020-03-12 14:26:12.456287] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 091678e4-4723-41be-9696-1ff746e33ca8 [2020-03-12 14:26:12.456354] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-03-12 14:26:33.732291] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8 [2020-03-12 14:31:51.413557] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd. [2020-03-12 14:31:51.413771] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held ############################################################################################################################################################################################################## ############################################################################################################################################################################################################## pstack output: ############################################################################################################################################################################################################## # pstack `pidof glusterd` Thread 9 (Thread 0x7feabd7b4700 (LWP 24010)): #0 0x00007feac4f51211 in poll () from /lib64/libc.so.6 #1 0x00007feac3edccbb in __res_context_send () from /lib64/libresolv.so.2 #2 0x00007feac3eda64a in __res_context_query () from /lib64/libresolv.so.2 #3 0x00007feabe1c25e3 in _nss_dns_gethostbyaddr2_r () from /lib64/libnss_dns.so.2 #4 0x00007feabe1c2917 in _nss_dns_gethostbyaddr_r () from /lib64/libnss_dns.so.2 #5 0x00007feac4f6e242 in gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6 0x00007feac4f76d55 in getnameinfo () from /lib64/libc.so.6 #7 0x00007feac6b958d6 in gf_get_hostname_from_ip () from /lib64/libglusterfs.so.0 #8 0x00007feabf805ba9 in glusterd_remote_hostname_get () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #9 0x00007feabf7dbd4e in __glusterd_handle_incoming_friend_req () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #10 0x00007feabf7d6071 in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #11 0x00007feac692abad in rpcsvc_handle_rpc_call () from /lib64/libgfrpc.so.0 #12 0x00007feac692b21e in rpcsvc_notify () from /lib64/libgfrpc.so.0 #13 0x00007feac692d1fb in rpc_transport_notify () from /lib64/libgfrpc.so.0 #14 0x00007feabe9f904f in socket_event_poll_in (notify_handled=true, this=0x7feab802c1b0) at socket.c:2582 #15 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x7feab802c1b0, gen=7, idx=28, fd=<optimized out>) at socket.c:2991 #16 socket_event_handler (fd=<optimized out>, idx=28, gen=7, data=0x7feab802c1b0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2911 #17 0x00007feac6bea34b in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #18 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #19 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7feabdfb5700 (LWP 24009)): #0 0x00007feac570f47c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007feabf89c90a in hooks_worker () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #3 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7feac04a7700 (LWP 24007)): #0 0x00007feac4f5379f in select () from /lib64/libc.so.6 #1 0x00007feac6c01da5 in runner () from /lib64/libglusterfs.so.0 #2 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #3 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7feac0ca8700 (LWP 24006)): #0 0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0 #2 0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0 #3 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #4 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7feac14a9700 (LWP 24005)): #0 0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0 #2 0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0 #3 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #4 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7feac1caa700 (LWP 24004)): #0 0x00007feac4f28548 in nanosleep () from /lib64/libc.so.6 #1 0x00007feac4f2844e in sleep () from /lib64/libc.so.6 #2 0x00007feac6bb4515 in pool_sweeper () from /lib64/libglusterfs.so.0 #3 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #4 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7feac24ab700 (LWP 24003)): #0 0x00007feac4e986ac in sigtimedwait () from /lib64/libc.so.6 #1 0x00007feac571394c in sigwait () from /lib64/libpthread.so.0 #2 0x00005654f0cfb81b in glusterfs_sigwaiter () #3 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #4 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7feac2cac700 (LWP 24002)): #0 0x00007feac5713250 in nanosleep () from /lib64/libpthread.so.0 #1 0x00007feac6b9975e in gf_timer_proc () from /lib64/libglusterfs.so.0 #2 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0 #3 0x00007feac4f5c133 in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7feac7088580 (LWP 24001)): #0 0x00007feac570a7bd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0 #1 0x00007feac6be9a77 in event_dispatch_epoll () from /lib64/libglusterfs.so.0 #2 0x00005654f0cfb246 in main () ############################################################################################################################################################################################################## Version-Release number of selected component (if applicable): glusterfs-6.0-30.1.el8rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Ran bvt on a RHEL 8.1 Setup https://github.com/gluster/glusto-tests/blob/master/tests/functional/bvt/test_cvt.py Actual results: Commands are failing with "Error: Request timed out" Expected results: Commands shouldn't fail. Additional info: At first this bug might seem a duplicate of Bug 1810516 but however we are fileing a new bug as the frequecny on RHEL 8 machines is much more. And in RHEL 8 this behaviour was observed even when only one testcase is executed which wasn't oberved on RHEL 7. Hence marking the servirty to urgent.