Bug 1710159

Summary:	glusterd: While upgrading (3-node cluster) 'gluster v status' times out on node to be upgraded
Product:	[Community] GlusterFS	Reporter:	Sanju <srakonde>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bmekala, bugs, rallan, rhinduja, rhs-bugs, sankarshan, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1707246	Environment:
Last Closed:	2019-05-31 03:09:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1707246

Description Sanju 2019-05-15 02:47:53 UTC

Description of problem:
=======================
While performing an in-service upgrade (for geo-rep) of a 3 node cluster, 
2 nodes (from the master) upgraded successfully .
The third node which was on a previous build (3.4.4), while running 'gluster v status' it times out.

[root@dhcp41-155]# gluster v status
Error : Request timed out

Other gluster commands seem to be working on this node.

The same command on the upgraded nodes work as expected: (as shown)
[root@dhcp42-173 glusterfs]# gluster v status
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.211:/var/lib/glusterd/ss_bri
ck                                          49152     0          Y       9661 
Brick 10.70.41.155:/var/lib/glusterd/ss_bri
ck                                          49155     0          Y       16101
Brick dhcp42-173.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49152     0          Y       4718 
Self-heal Daemon on localhost               N/A       N/A        Y       4809 
Self-heal Daemon on 10.70.41.155            N/A       N/A        Y       16524
Self-heal Daemon on 10.70.42.211            N/A       N/A        Y       9964 
 
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.173:/rhs/brick1/b1           49153     0          Y       4748 
Brick 10.70.42.211:/rhs/brick1/b2           49153     0          Y       9698 
Brick 10.70.41.155:/rhs/brick1/b3           49152     0          Y       15867
Brick 10.70.42.173:/rhs/brick2/b4           49154     0          Y       4757 
Brick 10.70.42.211:/rhs/brick2/b5           49154     0          Y       9707 
Brick 10.70.41.155:/rhs/brick2/b6           49153     0          Y       15888
Brick 10.70.42.173:/rhs/brick3/b7           49155     0          Y       4764 
Brick 10.70.42.211:/rhs/brick3/b8           49155     0          Y       9722 
Brick 10.70.41.155:/rhs/brick3/b9           49154     0          Y       15909
Self-heal Daemon on localhost               N/A       N/A        Y       4809 
Self-heal Daemon on 10.70.42.211            N/A       N/A        Y       9964 
Self-heal Daemon on 10.70.41.155            N/A       N/A        Y       16524
 
Task Status of Volume master




Error messages in glusterd log (on the 3.4.4 node) are as follows:
------------------------------------------------------------------
[2019-05-07 06:29:06.143535] E [rpc-clnt.c:185:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(4)) xid = 0x19 sent = 2019-05-07 06:18:58.538764. timeout = 600 for 10.70.42.173:24007
[2019-05-07 06:29:06.143630] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on dhcp42-173.lab.eng.blr.redhat.com. Please check log file for details.
[2019-05-07 06:29:06.144183] I [socket.c:3699:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2019-05-07 06:29:06.144234] E [rpcsvc.c:1573:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-05-07 06:29:06.144279] E [MSGID: 106430] [glusterd-utils.c:560:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2019-05-07 06:42:46.327616] E [rpc-clnt.c:185:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(4)) xid = 0x1b sent = 2019-05-07 06:32:44.342818. timeout = 600 for 10.70.42.173:24007
[2019-05-07 06:42:46.327901] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on dhcp42-173.lab.eng.blr.redhat.com. Please check log file for details.
[2019-05-07 06:42:50.328686] E [rpc-clnt.c:185:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(4)) xid = 0x1a sent = 2019-05-07 06:32:44.342952. timeout = 600 for 10.70.42.211:24007
[2019-05-07 06:42:50.328839] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on 10.70.42.211. Please check log file for details.
[2019-05-07 06:42:50.329321] I [socket.c:3699:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2019-05-07 06:42:50.329356] E [rpcsvc.c:1573:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-05-07 06:42:50.329402] E [MSGID: 106430] [glusterd-utils.c:560:glusterd_submit_reply] 0-glusterd: Reply submission failed


On the upgraded node: (glusterd log)
----------------------
[2019-05-07 06:18:58.535711] E [glusterd-op-sm.c:8193:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24c8e) [0x7f6ae684ec8e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d05e) [0x7f6ae684705e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x444ff) [0x7f6ae686e4ff] ) 0-management: Unable to get transaction opinfo for transaction ID :bc0a5ca5-f3a1-4c27-b263-d2d34289cbe3
[2019-05-07 06:32:44.339866] E [glusterd-op-sm.c:8193:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24c8e) [0x7f6ae684ec8e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d05e) [0x7f6ae684705e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x444ff) [0x7f6ae686e4ff] ) 0-management: Unable to get transaction opinfo for transaction ID :949c5d20-fb42-4d4d-8175-3e3dc158d310
[2019-05-07 06:43:02.694136] E [glusterd-op-sm.c:8193:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24c8e) [0x7f6ae684ec8e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d05e) [0x7f6ae684705e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x444ff) [0x7f6ae686e4ff] ) 0-management: Unable to get transaction opinfo for transaction ID :d150e923-3749-439a-ae3a-82ce997c8608




Version-Release number of selected component (if applicable):
==============================================================

UPGRADED NODE ==> [root@dhcp42-173]# rpm -qa | grep gluster
python2-gluster-6.0-2.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.7.x86_64
glusterfs-cli-6.0-2.el7rhgs.x86_64
glusterfs-6.0-2.el7rhgs.x86_64
glusterfs-fuse-6.0-2.el7rhgs.x86_64
glusterfs-geo-replication-6.0-2.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-libs-6.0-2.el7rhgs.x86_64
glusterfs-api-6.0-2.el7rhgs.x86_64
glusterfs-events-6.0-2.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-12.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-server-6.0-2.el7rhgs.x86_64
glusterfs-client-xlators-6.0-2.el7rhgs.x86_64
glusterfs-rdma-6.0-2.el7rhgs.x86_64


NODE TO BE UPGRADED ==> [root@dhcp41-155]# rpm -qa | grep gluster
glusterfs-events-3.12.2-47.el7rhgs.x86_64
glusterfs-rdma-3.12.2-47.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.7.x86_64
glusterfs-cli-3.12.2-47.el7rhgs.x86_64
glusterfs-fuse-3.12.2-47.el7rhgs.x86_64
glusterfs-server-3.12.2-47.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-libs-3.12.2-47.el7rhgs.x86_64
glusterfs-3.12.2-47.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-47.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-12.el7rhgs.noarch
glusterfs-client-xlators-3.12.2-47.el7rhgs.x86_64
glusterfs-api-3.12.2-47.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
python2-gluster-3.12.2-47.el7rhgs.x86_64


How reproducible:
=================
1/1


Steps to Reproduce:
====================
In-service upgrade (geo-rep)

Actual results:
===============
While performing an in-service upgrade on a 3 node cluster, the third node which was on a previous build (3.4.4), while running 'gluster v status' it times out.

Expected results:
=================
'gluster v status' should not time out

Additional info:
===============
In-service upgrade was still in progress and hence cluster.op-version is still on the previous op-version. 
The cluster is essentially in the middle of an in-service upgrade scenario.

--- Additional comment from Rochelle on 2019-05-07 13:00:30 IST ---

I've sent a mail with the access and credentials to my systems.

--- Additional comment from Rochelle on 2019-05-07 14:25:32 IST ---

pstack <gluster pid> outputs from each node:

Node 1: (Upgraded node)
-----------------------

[root@dhcp42-173 ~]# pstack 4457
Thread 9 (Thread 0x7f6ae9d10700 (LWP 4458)):
#0  0x00007f6af1578e3d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f6af27421c6 in gf_timer_proc (data=0x563bd3637ca0) at timer.c:194
#2  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f6ae950f700 (LWP 4459)):
#0  0x00007f6af1579361 in sigwait () from /lib64/libpthread.so.0
#1  0x0000563bd286143b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2370
#2  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f6ae8d0e700 (LWP 4460)):
#0  0x00007f6af0dffe2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f6af0dffcc4 in sleep () from /lib64/libc.so.6
#2  0x00007f6af275f5bd in pool_sweeper (arg=<optimized out>) at mem-pool.c:473
#3  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f6ae850d700 (LWP 4461)):
#0  0x00007f6af1575d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6af2774800 in syncenv_task (proc=proc@entry=0x563bd3638430) at syncop.c:612
#2  0x00007f6af27756b0 in syncenv_processor (thdata=0x563bd3638430) at syncop.c:679
#3  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f6ae7d0c700 (LWP 4462)):
#0  0x00007f6af1575d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6af2774800 in syncenv_task (proc=proc@entry=0x563bd36387f0) at syncop.c:612
#2  0x00007f6af27756b0 in syncenv_processor (thdata=0x563bd36387f0) at syncop.c:679
#3  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f6ae750b700 (LWP 4463)):
#0  0x00007f6af0e2ff73 in select () from /lib64/libc.so.6
#1  0x00007f6af27b37f4 in runner (arg=0x563bd363c590) at ../../contrib/timer-wheel/timer-wheel.c:186
#2  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f6ae3b28700 (LWP 4654)):
#0  0x00007f6af1575965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6ae6917f2b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:527
#2  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f6ae3327700 (LWP 4655)):
#0  0x00007f6af0e39483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f6af2799050 in event_dispatch_epoll_worker (data=0x563bd3759870) at event-epoll.c:751
#2  0x00007f6af1571dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6af0e38ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f6af2c31780 (LWP 4457)):
#0  0x00007f6af1572f47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f6af2798468 in event_dispatch_epoll (event_pool=0x563bd362f5b0) at event-epoll.c:846
#2  0x0000563bd285d9b5 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2866



Node 2 : (upgraded node)
------------------------

[root@dhcp42-211 ~]# pstack 4493
Thread 9 (Thread 0x7f9f5c2c2700 (LWP 4495)):
#0  0x00007f9f63b2ae3d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f9f64cf41c6 in gf_timer_proc (data=0x562895ff8ca0) at timer.c:194
#2  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f9f5bac1700 (LWP 4496)):
#0  0x00007f9f63b2b361 in sigwait () from /lib64/libpthread.so.0
#1  0x00005628944ba43b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2370
#2  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f9f5b2c0700 (LWP 4497)):
#0  0x00007f9f633b1e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f9f633b1cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f9f64d115bd in pool_sweeper (arg=<optimized out>) at mem-pool.c:473
#3  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f9f5aabf700 (LWP 4498)):
#0  0x00007f9f63b27d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9f64d26800 in syncenv_task (proc=proc@entry=0x562895ff9430) at syncop.c:612
#2  0x00007f9f64d276b0 in syncenv_processor (thdata=0x562895ff9430) at syncop.c:679
#3  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f9f5a2be700 (LWP 4499)):
#0  0x00007f9f63b27d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9f64d26800 in syncenv_task (proc=proc@entry=0x562895ff97f0) at syncop.c:612
#2  0x00007f9f64d276b0 in syncenv_processor (thdata=0x562895ff97f0) at syncop.c:679
#3  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f9f59abd700 (LWP 4500)):
#0  0x00007f9f633e1f73 in select () from /lib64/libc.so.6
#1  0x00007f9f64d657f4 in runner (arg=0x562895ffd590) at ../../contrib/timer-wheel/timer-wheel.c:186
#2  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f9f560da700 (LWP 4693)):
#0  0x00007f9f63b27965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9f58ec9f2b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:527
#2  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f9f558d9700 (LWP 4694)):
#0  0x00007f9f633eb483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f9f64d4b050 in event_dispatch_epoll_worker (data=0x562896119ae0) at event-epoll.c:751
#2  0x00007f9f63b23dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9f633eaead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f9f651e3780 (LWP 4493)):
#0  0x00007f9f63b24f47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f9f64d4a468 in event_dispatch_epoll (event_pool=0x562895ff05b0) at event-epoll.c:846
#2  0x00005628944b69b5 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2866



Node 3 (Node to be upgraded)
-----------------------------
[root@dhcp41-155 yum.repos.d]# pstack 3942
Thread 8 (Thread 0x7f468c4a2700 (LWP 3943)):
#0  0x00007f4693d0ae3d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f4694ed5f86 in gf_timer_proc (data=0x556b28add800) at timer.c:174
#2  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f468bca1700 (LWP 3944)):
#0  0x00007f4693d0b361 in sigwait () from /lib64/libpthread.so.0
#1  0x0000556b26f90c7b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2242
#2  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f468b4a0700 (LWP 3945)):
#0  0x00007f4693591e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f4693591cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f4694ef0b9d in pool_sweeper (arg=<optimized out>) at mem-pool.c:470
#3  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f468ac9f700 (LWP 3946)):
#0  0x00007f4693d07d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4694f03dd8 in syncenv_task (proc=proc@entry=0x556b28ade020) at syncop.c:603
#2  0x00007f4694f04ca0 in syncenv_processor (thdata=0x556b28ade020) at syncop.c:695
#3  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f468a49e700 (LWP 3947)):
#0  0x00007f4693d07d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4694f03dd8 in syncenv_task (proc=proc@entry=0x556b28ade3e0) at syncop.c:603
#2  0x00007f4694f04ca0 in syncenv_processor (thdata=0x556b28ade3e0) at syncop.c:695
#3  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f46852f1700 (LWP 4320)):
#0  0x00007f4693d07965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f46898b783b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:529
#2  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f4684af0700 (LWP 4321)):
#0  0x00007f46935cb483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4694f266e8 in event_dispatch_epoll_worker (data=0x556b28af8780) at event-epoll.c:749
#2  0x00007f4693d03dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f46935caead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f46953ad780 (LWP 3942)):
#0  0x00007f4693d04f47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f4694f26f78 in event_dispatch_epoll (event_pool=0x556b28ad5a70) at event-epoll.c:846
#2  0x0000556b26f8d538 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2692

--- Additional comment from Atin Mukherjee on 2019-05-07 14:42:05 IST ---

I don't see any threads being stuck. Until and unless this problem is reproducible, nothing can be done here. Does the volume status still times out? If so, a gdb session can be used and the problem can be debugged.

--- Additional comment from Rochelle on 2019-05-08 10:04:12 IST ---

Yes Atin, the issue still persists. 
I have sent the credentials of the systems in an email.

--- Additional comment from Rochelle on 2019-05-08 11:16:09 IST ---

Hit this issue again but differently. This time on RHGS 3.5 nodes

1. Ran the gdeploy conf file to create a geo-rep session. 
2. Once the run was successful, I ran a 'gluster v geo-rep status' on the 
   master node (to see if creation was successfull)and this timed out as shown below:

[root@dhcp42-131 ~]# gluster v geo-rep status
Error : Request timed out
geo-replication command failed


3. When I ran the full (specific) ge-rep status, it worked as expected. 

[root@dhcp42-131 ~]# gluster v geo-replication master 10.70.42.250::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.131    master        /rhs/brick1/b1    root          10.70.42.250::slave    10.70.42.250    Active     Changelog Crawl    2019-05-08 05:37:20          
10.70.42.131    master        /rhs/brick2/b4    root          10.70.42.250::slave    10.70.42.250    Active     Changelog Crawl    2019-05-08 05:37:20          
10.70.42.131    master        /rhs/brick2/b7    root          10.70.42.250::slave    10.70.42.250    Active     Changelog Crawl    2019-05-08 05:37:20          
10.70.42.255    master        /rhs/brick1/b3    root          10.70.42.250::slave    10.70.43.245    Passive    N/A                N/A                          
10.70.42.255    master        /rhs/brick2/b6    root          10.70.42.250::slave    10.70.43.245    Passive    N/A                N/A                          
10.70.42.255    master        /rhs/brick2/b9    root          10.70.42.250::slave    10.70.43.245    Passive    N/A                N/A                          
10.70.42.14     master        /rhs/brick1/b2    root          10.70.42.250::slave    10.70.41.224    Passive    N/A                N/A                          
10.70.42.14     master        /rhs/brick2/b5    root          10.70.42.250::slave    10.70.41.224    Passive    N/A                N/A                          
10.70.42.14     master        /rhs/brick2/b8    root          10.70.42.250::slave    10.70.41.224    Passive    N/A                N/A

--- Additional comment from Bala Konda Reddy M on 2019-05-09 11:27:14 IST ---

Atin/Sanju,

While upgarding my cluster(6 nodes), I am able to hit the issue. Upgraded first node and checked volume status and heal info working fine on upgraded node.
On node(2nd node) which is still in 3.4.4, gluster volume status is in hung state.

Regards,
Bala

Comment 1 Worker Ant 2019-05-15 02:51:26 UTC

REVIEW: https://review.gluster.org/22730 (glusterd: add an op-version check) posted (#1) for review on master by Sanju Rakonde

Comment 2 Sanju 2019-05-15 06:13:49 UTC

Root cause:
With commit 34e010d64, we have added some conditions to set txn-opinfo to avoid
the memory leak in txn-opinfo object. But, in a heterogeneous cluster the
upgraded and non-upgraded nodes are following different conditions to set
txn-opinfo. This is leading the get-txn-opinfo operation to fail and eventually
the process hungs.

[root@server2 glusterfs]# git show 34e010d64
commit 34e010d64905b7387de57840d3fb16a326853c9b
Author: Atin Mukherjee <amukherj>
Date:   Mon Mar 18 16:08:04 2019 +0530

    glusterd: fix txn-id mem leak

    This commit ensures the following:
    1. Don't send commit op request to the remote nodes when gluster v
    status all is executed as for the status all transaction the local
    commit gets the name of the volumes and remote commit ops are
    technically a no-op. So no need for additional rpc requests.
    2. In op state machine flow, if the transaction is in staged state and
    op_info.skip_locking is true, then no need to set the txn id in the
    priv->glusterd_txn_opinfo dictionary which never gets freed.

    Fixes: bz#1691164
    Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822
    Signed-off-by: Atin Mukherjee <amukherj>

diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
index 6495a9d88..84c34f1fe 100644
--- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c
+++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c
@@ -5652,6 +5652,9 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
     dict_t *dict = NULL;
     xlator_t *this = NULL;
     uuid_t *txn_id = NULL;
+    glusterd_op_info_t txn_op_info = {
+        {0},
+    };

     this = THIS;
     GF_ASSERT(this);
@@ -5686,6 +5689,7 @@ glusterd_op_ac_stage_op(glusterd_op_sm_event_t *event,
void *ctx)
         ret = -1;
         goto out;
     }
+    ret = glusterd_get_txn_opinfo(&event->txn_id, &txn_op_info);

     ret = dict_set_bin(rsp_dict, "transaction_id", txn_id, sizeof(*txn_id));
     if (ret) {
@@ -5704,6 +5708,12 @@ out:

     gf_msg_debug(this->name, 0, "Returning with %d", ret);

+    /* for no volname transactions, the txn_opinfo needs to be cleaned up
+     * as there's no unlock event triggered
+     */
+    if (txn_op_info.skip_locking)
+        ret = glusterd_clear_txn_opinfo(txn_id);
+
     if (rsp_dict)
         dict_unref(rsp_dict);

@@ -8159,12 +8169,16 @@ glusterd_op_sm()
                            "Unable to clear "
                            "transaction's opinfo");
             } else {
-                ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
-                if (ret)
-                    gf_msg(this->name, GF_LOG_ERROR, 0,
-                           GD_MSG_TRANS_OPINFO_SET_FAIL,
-                           "Unable to set "
-                           "transaction's opinfo");
+                if (!(event_type == GD_OP_EVENT_STAGE_OP &&
+                      opinfo.state.state == GD_OP_STATE_STAGED &&
+                      opinfo.skip_locking)) {     <---- now, upgraded nodes
will not set txn-opinfo when this condition is false, so the
glusterd_get_txn_opinfo() after this is failing. previously we used to set
txn-opinfo in every state of op-sm and glusterd_get_txn_opinfo will be called
in every phase. We need to add an op-version check for this change.
+                    ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo);
+                    if (ret)
+                        gf_msg(this->name, GF_LOG_ERROR, 0,
+                               GD_MSG_TRANS_OPINFO_SET_FAIL,
+                               "Unable to set "
+                               "transaction's opinfo");
+                }
             }

             glusterd_destroy_op_event_ctx(event);
diff --git a/xlators/mgmt/glusterd/src/glusterd-syncop.c
b/xlators/mgmt/glusterd/src/glusterd-syncop.c
index 45b221c2e..9bab2cfd5 100644
--- a/xlators/mgmt/glusterd/src/glusterd-syncop.c
+++ b/xlators/mgmt/glusterd/src/glusterd-syncop.c
@@ -1392,6 +1392,8 @@ gd_commit_op_phase(glusterd_op_t op, dict_t *op_ctx,
dict_t *req_dict,
     char *errstr = NULL;
     struct syncargs args = {0};
     int type = GF_QUOTA_OPTION_TYPE_NONE;
+    uint32_t cmd = 0;
+    gf_boolean_t origin_glusterd = _gf_false;

     this = THIS;
     GF_ASSERT(this);
@@ -1449,6 +1451,20 @@ commit_done:
     gd_syncargs_init(&args, op_ctx);
     synctask_barrier_init((&args));
     peer_cnt = 0;
+    origin_glusterd = is_origin_glusterd(req_dict);
+
+    if (op == GD_OP_STATUS_VOLUME) {
+        ret = dict_get_uint32(req_dict, "cmd", &cmd);
+        if (ret)
+            goto out;
+
+        if (origin_glusterd) {
+            if ((cmd & GF_CLI_STATUS_ALL)) {
+                ret = 0;
+                goto out;
+            }
+        }
+    }

     RCU_READ_LOCK;
     cds_list_for_each_entry_rcu(peerinfo, &conf->peers, uuid_list)
(END)

Thanks,
Sanju

Comment 3 Atin Mukherjee 2019-05-31 03:07:39 UTC

One correction to the above root cause is process never hangs here. It's just that the response back to the cli is not sent due to the incorrect state machine movement depending on the event trigger.

Comment 4 Worker Ant 2019-05-31 03:09:26 UTC

REVIEW: https://review.gluster.org/22730 (glusterd: add an op-version check) merged (#3) on master by Atin Mukherjee