1813294 – [RHEL 8] gluster-cli commands getting request timed out

Bug 1813294 - [RHEL 8] gluster-cli commands getting request timed out

Summary: [RHEL 8] gluster-cli commands getting request timed out

Keywords:
Status:	CLOSED DUPLICATE of bug 1810516
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sanju
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-13 13:29 UTC by Kshithij Iyer
Modified:	2020-05-08 16:36 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-13 17:10:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kshithij Iyer 2020-03-13 13:29:51 UTC

Description of problem:
While running bvt testcases on RHEL 8.1 it was observed that about 8 cases were failing after further investigation it was observed that all of these testcases were failing with "Error: Request timed out" error.

##############################################################################################################################################################################################################
2020-03-12 19:36:28,134 INFO (run) root.eng.blr.redhat.com (cp): gluster volume rebalance testvol_distributed-dispersed status --xml
2020-03-12 19:36:28,135 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-12 19:38:28,194 INFO (_log_results) RETCODE (root.eng.blr.redhat.com): 1
2020-03-12 19:38:28,196 INFO (_log_results) STDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################
2020-03-02 04:52:01,897 INFO (run) root.eng.blr.redhat.com (cp): gluster volume remove-brick testvol_distributed-dispersed  rhsauto041.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick10 rhsauto042.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick11 rhsauto045.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick7 rhsauto025.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick6 rhsauto032.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick9 rhsauto028.lab.eng.blr.redhat.com:/bricks/brick2/testvol_distributed-dispersed_brick8 status --mode=script --xml
2020-03-02 04:52:01,897 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-02 04:54:01,950 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m
2020-03-02 04:54:01,951 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################
2020-03-02 06:33:09,642 INFO (run) root.eng.blr.redhat.com (cp): gluster vol status testvol_distributed-replicated   --xml
2020-03-02 06:33:09,642 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.eng.blr.redhat.com
2020-03-02 06:35:09,694 INFO (_log_results) ^[[34;1mRETCODE (root.eng.blr.redhat.com): 1^[[0m
2020-03-02 06:35:09,694 DEBUG (_log_results) ^[[47;30;1mSTDOUT (root.eng.blr.redhat.com)...
Error : Request timed out
##############################################################################################################################################################################################################


When the testcases which failed were reran on local setup it was observed that these testcases were failing even when executed one by one.

##############################################################################################################################################################################################################
glusterd logs
##############################################################################################################################################################################################################
Snippit 1
##############################################################################################################################################################################################################
[2020-03-12 14:05:37.069585] I [MSGID: 106499] [glusterd-handler.c:4502:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_distributed-dispersed
[2020-03-12 14:05:37.163688] I [MSGID: 106539] [glusterd-utils.c:12461:glusterd_generate_and_set_task_id] 0-management: Generated task-id f06410aa-7a2c-4206-ad07-5fb47aca3582 for key rebalance-id
[2020-03-12 14:05:42.198882] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-03-12 14:05:42.199261] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
The message "E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict" repeated 40 times between [2020-03-12 14:05:42.199261] and [2020-03-12 14:07:18.776453]
[2020-03-12 14:07:29.445663] E [MSGID: 106061] [glusterd-utils.c:11159:glusterd_volume_rebalance_use_rsp_dict] 0-glusterd: failed to get index from rsp dict
[2020-03-12 14:09:28.620158] I [glusterd-locks.c:740:gd_mgmt_v3_unlock_timer_cbk] 0-management: unlock timer is cancelled for volume testvol_distributed-disperse
[2020-03-12 14:10:54.708894] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000
[2020-03-12 14:11:10.409238] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20
##############################################################################################################################################################################################################

##############################################################################################################################################################################################################
Snippit 2 
##############################################################################################################################################################################################################
[2020-03-12 14:12:31.550310] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:12:55.662055] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 70000
[2020-03-12 14:13:08.296521] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto052.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296604] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto022.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296643] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto024.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:08.296677] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto021.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:10.869641] E [MSGID: 106115] [glusterd-mgmt.c:117:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on rhsauto020.lab.eng.blr.redhat.com. Please check log file for details.
[2020-03-12 14:13:10.869820] E [MSGID: 106117] [glusterd-mgmt.c:2182:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers
[2020-03-12 14:13:10.869895] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf156a) [0x7feabf8a356a] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf1348) [0x7feabf8a3348] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
[2020-03-12 14:13:10.869914] E [MSGID: 106117] [glusterd-locks.c:342:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for vol testvol_distributed-dispersed on behalf of 1fc8fc20-a1f6-42f9-a403-02826ef934e1.
[2020-03-12 14:13:10.869926] E [MSGID: 106146] [glusterd-locks.c:457:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol
[2020-03-12 14:13:10.869937] E [MSGID: 106117] [glusterd-mgmt.c:2338:glusterd_mgmt_v3_initiate_all_phases_with_brickop_phase] 0-management: Failed to release mgmt_v3 locks on localhost
[2020-03-12 14:13:10.869996] I [socket.c:3844:socket_submit_outgoing_msg] 0-socket.management: not connected (priv->connected = -1)
[2020-03-12 14:13:10.870007] E [rpcsvc.c:1583:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 11) to rpc-transport (socket.management)
[2020-03-12 14:13:10.870023] E [MSGID: 106430] [glusterd-utils.c:599:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2020-03-12 14:13:49.459442] I [MSGID: 106490] [glusterd-handler.c:2675:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:14:03.720052] I [MSGID: 106493] [glusterd-handler.c:3953:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto024.lab.eng.blr.redhat.com (0), ret: 0, op_ret: 0
[2020-03-12 14:14:12.851026] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:14:12.851080] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:14:23.258524] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 5650b20d-a2b0-44bd-9abc-4597ccbf0dc5
[2020-03-12 14:24:34.677880] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd.
[2020-03-12 14:24:34.678229] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
[2020-03-12 14:24:34.678256] W [MSGID: 106117] [glusterd-handler.c:6542:__glusterd_peer_rpc_notify] 0-management: Lock not released for testvol_distributed-dispersed
[2020-03-12 14:26:10.328965] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8, host: rhsauto020.lab.eng.blr.redhat.com, port: 0
[2020-03-12 14:26:12.456287] I [MSGID: 106492] [glusterd-handler.c:2860:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 091678e4-4723-41be-9696-1ff746e33ca8
[2020-03-12 14:26:12.456354] I [MSGID: 106502] [glusterd-handler.c:2901:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2020-03-12 14:26:33.732291] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 091678e4-4723-41be-9696-1ff746e33ca8
[2020-03-12 14:31:51.413557] I [MSGID: 106004] [glusterd-handler.c:6521:__glusterd_peer_rpc_notify] 0-management: Peer <rhsauto020.lab.eng.blr.redhat.com> (<091678e4-4723-41be-9696-1ff746e33ca8>), in state <Peer in Cluster>, has disconnected from glusterd.
[2020-03-12 14:31:51.413771] W [glusterd-locks.c:807:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2400d) [0x7feabf7d600d] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x2e130) [0x7feabf7e0130] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0xf0688) [0x7feabf8a2688] ) 0-management: Lock for vol testvol_distributed-dispersed not held
##############################################################################################################################################################################################################


##############################################################################################################################################################################################################
pstack output:
##############################################################################################################################################################################################################
# pstack `pidof glusterd`
Thread 9 (Thread 0x7feabd7b4700 (LWP 24010)):
#0  0x00007feac4f51211 in poll () from /lib64/libc.so.6
#1  0x00007feac3edccbb in __res_context_send () from /lib64/libresolv.so.2
#2  0x00007feac3eda64a in __res_context_query () from /lib64/libresolv.so.2
#3  0x00007feabe1c25e3 in _nss_dns_gethostbyaddr2_r () from /lib64/libnss_dns.so.2
#4  0x00007feabe1c2917 in _nss_dns_gethostbyaddr_r () from /lib64/libnss_dns.so.2
#5  0x00007feac4f6e242 in gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00007feac4f76d55 in getnameinfo () from /lib64/libc.so.6
#7  0x00007feac6b958d6 in gf_get_hostname_from_ip () from /lib64/libglusterfs.so.0
#8  0x00007feabf805ba9 in glusterd_remote_hostname_get () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#9  0x00007feabf7dbd4e in __glusterd_handle_incoming_friend_req () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#10 0x00007feabf7d6071 in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#11 0x00007feac692abad in rpcsvc_handle_rpc_call () from /lib64/libgfrpc.so.0
#12 0x00007feac692b21e in rpcsvc_notify () from /lib64/libgfrpc.so.0
#13 0x00007feac692d1fb in rpc_transport_notify () from /lib64/libgfrpc.so.0
#14 0x00007feabe9f904f in socket_event_poll_in (notify_handled=true, this=0x7feab802c1b0) at socket.c:2582
#15 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x7feab802c1b0, gen=7, idx=28, fd=<optimized out>) at socket.c:2991
#16 socket_event_handler (fd=<optimized out>, idx=28, gen=7, data=0x7feab802c1b0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2911
#17 0x00007feac6bea34b in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#18 0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#19 0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7feabdfb5700 (LWP 24009)):
#0  0x00007feac570f47c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feabf89c90a in hooks_worker () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7feac04a7700 (LWP 24007)):
#0  0x00007feac4f5379f in select () from /lib64/libc.so.6
#1  0x00007feac6c01da5 in runner () from /lib64/libglusterfs.so.0
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7feac0ca8700 (LWP 24006)):
#0  0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7feac14a9700 (LWP 24005)):
#0  0x00007feac570f7ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007feac6bc801e in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007feac6bc8ca8 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7feac1caa700 (LWP 24004)):
#0  0x00007feac4f28548 in nanosleep () from /lib64/libc.so.6
#1  0x00007feac4f2844e in sleep () from /lib64/libc.so.6
#2  0x00007feac6bb4515 in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7feac24ab700 (LWP 24003)):
#0  0x00007feac4e986ac in sigtimedwait () from /lib64/libc.so.6
#1  0x00007feac571394c in sigwait () from /lib64/libpthread.so.0
#2  0x00005654f0cfb81b in glusterfs_sigwaiter ()
#3  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#4  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7feac2cac700 (LWP 24002)):
#0  0x00007feac5713250 in nanosleep () from /lib64/libpthread.so.0
#1  0x00007feac6b9975e in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007feac57092de in start_thread () from /lib64/libpthread.so.0
#3  0x00007feac4f5c133 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7feac7088580 (LWP 24001)):
#0  0x00007feac570a7bd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007feac6be9a77 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00005654f0cfb246 in main ()
##############################################################################################################################################################################################################

Version-Release number of selected component (if applicable):
glusterfs-6.0-30.1.el8rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Ran bvt on a RHEL 8.1 Setup
https://github.com/gluster/glusto-tests/blob/master/tests/functional/bvt/test_cvt.py

Actual results:
Commands are failing with "Error: Request timed out"

Expected results:
Commands shouldn't fail.

Additional info:
At first this bug might seem a duplicate of Bug 1810516 but however we are fileing a new bug as the frequecny on RHEL 8 machines is much more. And in RHEL 8 this behaviour was observed even when only one testcase is executed which wasn't oberved on RHEL 7. Hence marking the servirty to urgent.

Note You need to log in before you can comment on or make changes to this bug.