Bug 1813294

Summary:	[RHEL 8] gluster-cli commands getting request timed out
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Kshithij Iyer <kiyer>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED DUPLICATE	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	moagrawa, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Automation, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-13 17:10:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kshithij Iyer 2020-03-13 13:29:51 UTC

Description of problem:
While running bvt testcases on RHEL 8.1 it was observed that about 8 cases were failing after further investigation it was observed that all of these testcases were failing with "Error: Request timed out" error.

##############################################################################################################################################################################################################
2020-03-12 19:36:28,134 INFO (run) root.eng.blr.redhat.com (cp): gluster volume rebalance testvol_distributed-dispersed status --xml
2020-03-12 19:36:28,135 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-12 19:38:28,194 INFO (_log_results) RETCODE (root.eng.blr.redhat.com): 1
2020-03-12 19:38:28,196 INFO (_log_results) STDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################
2020-03-02 04:52:01,897 INFO (run) root.eng.blr.redhat.com (cp): gluster volume remove-brick testvol_distributed-dispersed  rhsauto041.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick10 rhsauto042.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick11 rhsauto045.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick7 rhsauto025.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick6 rhsauto032.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick9 rhsauto028.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick8 status --mode=script --xml
2020-03-02 04:52:01,897 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-02 04:54:01,950 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m
2020-03-02 04:54:01,951 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################
2020-03-02 06:33:09,642 INFO (run) root.eng.blr.redhat.com (cp): gluster vol status testvol_distributed-replicated   --xml
2020-03-02 06:33:09,642 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-02 06:35:09,694 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m
2020-03-02 06:35:09,694 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################


When the testcases which failed were reran on local setup it was observed that these testcases were failing even when executed one by one.

##############################################################################################################################################################################################################
glusterd logs
##############################################################################################################################################################################################################
Snippit 1
##############################################################################################################################################################################################################
[2020-03-12 14:05:37.069585] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_distributed-dispersed
[2020-03-12 14:05:37.163688] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id f06410aa-7a2c-4206-ad07-5fb47aca3582 for key rebalance-id
[2020-03-12 14:05:42.198882] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-03-12 14:05:42.199261] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
The message "E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict" repeated 40 times between [2020-03-12 14:05:42.199261] and [2020-03-12 14:07:18.776453]
[2020-03-12 14:07:29.445663] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2020-03-12 14:09:28.620158] I [glusterd-locks.c:740:gd_mgmt_v3_unlock_timer_cbk] 0-management: unlock timer is cancelled for volume testvol_distributed-disperse
[2020-03-12 14:10:54.708894] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000
[2020-03-12 14:11:10.409238] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20
##############################################################################################################################################################################################################

##############################################################################################################################################################################################################
Snippit 2 
##############################################################################################################################################################################################################
[2020-03-12 14:12:31.550310] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:12:55.662055] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000
[2020-03-12 14:13:08.296521] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto052.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296604] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto022.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296643] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto024.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296677] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto021.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:10.869641] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto020.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:10.869820] E [MSGID: 106117] [glusterd-mgmt.c:2182:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers
[2020-03-12 14:13:10.869895] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf156a) [0x7feabf8a356a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf1348) [0x7feabf8a3348] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
[2020-03-12 14:13:10.869914] E [MSGID: 106117] [glusterd-locks.c:342:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for vol testvol_distributed-dispersed on behalf of 1fc8fc20-a1f6-42f9-a403-02826ef934e1.
[2020-03-12 14:13:10.869926] E [MSGID: 106146] [glusterd-locks.c:457:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol
[2020-03-12 14:13:10.869937] E [MSGID: 106117] [glusterd-mgmt.c:2338:glusterd_mgmt_v3_initiate_all_phases_with_brickop_phase] 0-management: Failed to release mgmt_v3 locks on localhost
[2020-03-12 14:13:10.869996] I [socket.c:3844:socket_submit_outgoing_msg] 0-socket.management: not connected (priv->connected = -1)
[2020-03-12 14:13:10.870007] E [rpcsvc.c:1583:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 11) to rpc-transport (socket.management)
[2020-03-12 14:13:10.870023] E [MSGID: 106430] [glusterd-utils.c:599:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2020-03-12 14:13:49.459442] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:14:03.720052] I [MSGID: 106493] [glusterd-handler.c:3953:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto024.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0
[2020-03-12 14:14:12.851026] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:14:12.851080] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:14:23.258524] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:24:34.677880] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd.
[2020-03-12 14:24:34.678229] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
[2020-03-12 14:24:34.678256] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol_distributed-dispersed
[2020-03-12 14:26:10.328965] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8, host: rhsauto020.lab.eng.blr.redhat.com, port: 0
[2020-03-12 14:26:12.456287] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 091678e4-4723-41be-9696-1ff746e33ca8
[2020-03-12 14:26:12.456354] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:26:33.732291] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8
[2020-03-12 14:31:51.413557] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd.
[2020-03-12 14:31:51.413771] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
##############################################################################################################################################################################################################


##############################################################################################################################################################################################################
pstack output:
##############################################################################################################################################################################################################
# pstack `pidof glusterd`
Thread 9 (Thread 0x7feabd7b4700 (LWP 24010)):
#0  0x00007feac4f51211 in poll () from /lib64/libc.so.6
#1  0x00007feac3edccbb in __res_context_send () from /lib64/libresolv.so.2
#2  0x00007feac3eda64a in __res_context_query () from /lib64/libresolv.so.2
#3  0x00007feabe1c25e3 in _nss_dns_gethostbyaddr2_r () from /lib64/libnss_dns.so.2
#4  0x00007feabe1c2917 in _nss_dns_gethostbyaddr_r () from /lib64/libnss_dns.so.2
#5  0x00007feac4f6e242 in gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00007feac4f76d55 in getnameinfo () from /lib64/libc.so.6
#7  0x00007feac6b958d6 in gf_get_hostname_from_ip () from /lib64/libglusterfs.so.0
#8  0x00007feabf805ba9 in glusterd_remote_hostname_get () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#9  0x00007feabf7dbd4e in __glusterd_handle_incoming_friend_req () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#10 0x00007feabf7d6071 in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#11 0x00007feac692abad in rpcsvc_handle_rpc_call () from /lib64/libgfrpc.so.0
#12 0x00007feac692b21e in rpcsvc_notify () from /lib64/libgfrpc.so.0
#13 0x00007feac692d1fb in rpc_transport_notify () from /lib64/libgfrpc.so.0
#14 0x00007feabe9f904f in socket_event_poll_in (notify_handled=true, this=0x7feab802c1b0) at socket.c:2582
#15 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x7feab802c1b0, gen=7, idx=28, fd=<optimized out>) at socket.c:2991
#16 socket_event_handler (fd=<optimized out>, idx=28, gen=7, data=0x7feab802c1b0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2911
#17 0x00007feac6bea34b in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#18 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#19 0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7feabdfb5700 (LWP 24009)):
#0  0x00007feac570f47c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feabf89c90a in hooks_worker () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7feac04a7700 (LWP 24007)):
#0  0x00007feac4f5379f in select () from /lib64/libc.so.6
#1  0x00007feac6c01da5 in runner () from /lib64/libglusterfs.so.0
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7feac0ca8700 (LWP 24006)):
#0  0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7feac14a9700 (LWP 24005)):
#0  0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7feac1caa700 (LWP 24004)):
#0  0x00007feac4f28548 in nanosleep () from /lib64/libc.so.6
#1  0x00007feac4f2844e in sleep () from /lib64/libc.so.6
#2  0x00007feac6bb4515 in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7feac24ab700 (LWP 24003)):
#0  0x00007feac4e986ac in sigtimedwait () from /lib64/libc.so.6
#1  0x00007feac571394c in sigwait () from /lib64/libpthread.so.0
#2  0x00005654f0cfb81b in glusterfs_sigwaiter ()
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7feac2cac700 (LWP 24002)):
#0  0x00007feac5713250 in nanosleep () from /lib64/libpthread.so.0
#1  0x00007feac6b9975e in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7feac7088580 (LWP 24001)):
#0  0x00007feac570a7bd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007feac6be9a77 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00005654f0cfb246 in main ()
##############################################################################################################################################################################################################

Version-Release number of selected component (if applicable):
glusterfs-6.0-30.1.el8rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Ran bvt on a RHEL 8.1 Setup
https://github.com/gluster/glusto-tests/blob/master/tests/functional/bvt/test_cvt.py

Actual results:
Commands are failing with "Error: Request timed out"

Expected results:
Commands shouldn't fail.

Additional info:
At first this bug might seem a duplicate of Bug 1810516 but however we are fileing a new bug as the frequecny on RHEL 8 machines is much more. And in RHEL 8 this behaviour was observed even when only one testcase is executed which wasn't oberved on RHEL 7. Hence marking the servirty to urgent.