1447920 – [Brick MUX]: Tier daemons in failed state on a setup where brick-multiplexing was on-and-put-off-later

Bug 1447920 - [Brick MUX]: Tier daemons in failed state on a setup where brick-multiplexing was on-and-put-off-later

Summary: [Brick MUX]: Tier daemons in failed state on a setup where brick-multiplexing...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	hari gowtham
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-05-04 08:55 UTC by Sweta Anandpara
Modified:	2017-09-21 04:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-26
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:41:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Sweta Anandpara 2017-05-04 08:55:36 UTC

Description of problem:
=======================
Had a 6 node cluster with a couple of distribute/dist-replicate volumes. Enabled brick-multiplexing and created a disperse 1*(4+2) volume. Attached a plain distribute 4*1 as hot tier. Created a few distribute/dist-rep volumes even after that, and continued with testing. 
At the end of the day, disabled brick-multiplexing and did not do anything further. 

When the tier volume was assessed again after a few days, it was noticed that all the tier daemons are in failed state. The logs (older ones - pointing to the day when I did testing on that setup) point out a failed socket connection resulting in failure of the daemon eventually. 

Sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/


Version-Release number of selected component (if applicable):
============================================================
3.8.4-23


How reproducible:
================
1:1


Steps to Reproduce:
==================

Do not have anything more than what is already mentioned in the description. However, I'll be trying out the a few scenarios with tier volume and brick-multiplexing to see if there are any sure shot steps to reproduce this.


Additional info:
================


tier-logs
-----------

[2017-04-25 07:36:33.543874] I [glusterfsd-mgmt.c:2150:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-04-25 07:36:36.736800] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f5487964dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f5488ffbf05] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f5488ffbd6b] ) 0-: received signum (15), shutting down
[2017-04-25 08:23:36.195081] I [MSGID: 100030] [glusterfsd.c:2417:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/disp --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *tier-dht.xattr-name=trusted.tier.tier-dht --xlator-option *dht.rebalance-cmd=6 --xlator-option *dht.node-uuid=49610061-1788-4cbc-9205-0e59fe91d842 --xlator-option *dht.commit-hash=0 --socket-file /var/run/gluster/gluster-tier-ca8ba15e-1c0e-463c-b041-76bca48b0330.sock --pid-file /var/lib/glusterd/vols/disp/tier/49610061-1788-4cbc-9205-0e59fe91d842.pid -l /var/log/glusterfs/disp-tier.log)
[2017-04-25 08:23:36.217403] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-04-25 08:23:36.217511] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket
[2017-04-25 08:23:36.217543] I [glusterfsd-mgmt.c:2129:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
...
...
...
[2017-04-25 08:23:46.628316] I [MSGID: 114020] [client.c:2356:notify] 0-disp-client-3: parent translators are ready, attempting connect on transport
[2017-04-25 08:23:46.641409] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2017-04-25 08:23:46.643834] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-disp-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-04-25 08:23:46.643915] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-disp-client-0: disconnected from disp-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-04-25 08:23:46.644158] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-disp-client-1: changing port to 49153 (from 0)
[2017-04-25 08:23:46.644268] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-disp-client-2: changing port to 49153 (from 0)
[2017-04-25 08:23:46.648628] I [MSGID: 114020] [client.c:2356:notify] 0-disp-client-4: parent translators are ready, attempting connect on transport
...
...
...
[2017-04-25 08:23:50.265413] I [MSGID: 0] [dht-rebalance.c:3730:gf_defrag_total_file_cnt] 0-disp-tier-dht: Total number of files = 75
[2017-04-25 08:23:50.265437] E [MSGID: 0] [dht-rebalance.c:3893:gf_defrag_start_crawl] 0-disp-tier-dht: Failed to get the total number of files. Unable to estimate time to complete rebalance.
[2017-04-25 08:23:50.265844] I [dht-rebalance.c:3938:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2017-04-25 08:23:50.265918] I [dht-rebalance.c:3938:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful




[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster pool list
UUID					Hostname                         	State
a0557927-4e5e-4ff7-8dce-94873f867707	dhcp47-113.lab.eng.blr.redhat.com	Connected 
c0dac197-5a4d-4db7-b709-dbf8b8eb0896	dhcp47-114.lab.eng.blr.redhat.com	Connected 
f828fdfa-e08f-4d12-85d8-2121cafcf9d0	dhcp47-115.lab.eng.blr.redhat.com	Connected 
a96e0244-b5ce-4518-895c-8eb453c71ded	dhcp47-116.lab.eng.blr.redhat.com	Connected 
17eb3cef-17e7-4249-954b-fc19ec608304	dhcp47-117.lab.eng.blr.redhat.com	Connected 
49610061-1788-4cbc-9205-0e59fe91d842	localhost                        	Connected 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v list
disp
dist
distrep
distrep2
distrep3
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v info disp
 
Volume Name: disp
Type: Tier
Volume ID: ca8ba15e-1c0e-463c-b041-76bca48b0330
Status: Started
Snapshot Count: 0
Number of Bricks: 10
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distribute
Number of Bricks: 4
Brick1: 10.70.47.115:/bricks/brick4/disp_tier3
Brick2: 10.70.47.114:/bricks/brick4/disp_tier2
Brick3: 10.70.47.113:/bricks/brick4/disp_tier1
Brick4: 10.70.47.121:/bricks/brick4/disp_tier0
Cold Tier:
Cold Tier Type : Disperse
Number of Bricks: 1 x (4 + 2) = 6
Brick5: 10.70.47.121:/bricks/brick3/disp_0
Brick6: 10.70.47.113:/bricks/brick3/disp_1
Brick7: 10.70.47.114:/bricks/brick3/disp_2
Brick8: 10.70.47.115:/bricks/brick3/disp_3
Brick9: 10.70.47.116:/bricks/brick3/disp_4
Brick10: 10.70.47.117:/bricks/brick3/disp_5
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.scrub-freq: hourly
performance.stat-prefetch: on
features.ctr-enabled: on
cluster.tier-mode: cache
cluster.brick-multiplex: disable
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v status disp
Status of volume: disp
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.47.115:/bricks/brick4/disp_tier
3                                           49152     0          Y       14509
Brick 10.70.47.114:/bricks/brick4/disp_tier
2                                           49152     0          Y       16405
Brick 10.70.47.113:/bricks/brick4/disp_tier
1                                           49152     0          Y       16834
Brick 10.70.47.121:/bricks/brick4/disp_tier
0                                           49152     0          Y       23606
Cold Bricks:
Brick 10.70.47.121:/bricks/brick3/disp_0    49153     0          Y       23612
Brick 10.70.47.113:/bricks/brick3/disp_1    49153     0          Y       16841
Brick 10.70.47.114:/bricks/brick3/disp_2    49153     0          Y       16406
Brick 10.70.47.115:/bricks/brick3/disp_3    49153     0          Y       14515
Brick 10.70.47.116:/bricks/brick3/disp_4    49152     0          Y       30238
Brick 10.70.47.117:/bricks/brick3/disp_5    49152     0          Y       25198
Self-heal Daemon on localhost               N/A       N/A        Y       23558
Quota Daemon on localhost                   N/A       N/A        Y       23567
Bitrot Daemon on localhost                  N/A       N/A        Y       13552
Scrubber Daemon on localhost                N/A       N/A        Y       13566
Self-heal Daemon on dhcp47-113.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       16790
Quota Daemon on dhcp47-113.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       16804
Bitrot Daemon on dhcp47-113.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       27284
Scrubber Daemon on dhcp47-113.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       27298
Self-heal Daemon on dhcp47-114.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       16365
Quota Daemon on dhcp47-114.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       16374
Bitrot Daemon on dhcp47-114.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       27085
Scrubber Daemon on dhcp47-114.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       27099
Self-heal Daemon on dhcp47-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       14469
Quota Daemon on dhcp47-115.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       14478
Bitrot Daemon on dhcp47-115.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       24984
Scrubber Daemon on dhcp47-115.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       24998
Self-heal Daemon on dhcp47-117.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       25143
Quota Daemon on dhcp47-117.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       25152
Bitrot Daemon on dhcp47-117.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       2447 
Scrubber Daemon on dhcp47-117.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       2460 
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       30198
Quota Daemon on dhcp47-116.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       30207
Bitrot Daemon on dhcp47-116.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       8011 
Scrubber Daemon on dhcp47-116.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       8024 
 
Task Status of Volume disp
------------------------------------------------------------------------------
Task                 : Tier migration      
ID                   : 31a36238-7edc-46d5-8eea-3bf8d25a2599
Status               : in progress         
 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v tier disp status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            0                    0                    in progress         
dhcp47-113.lab.eng.blr.redhat.com 0                    0                    failed              
dhcp47-114.lab.eng.blr.redhat.com 0                    0                    failed              
dhcp47-115.lab.eng.blr.redhat.com 0                    0                    failed              
dhcp47-116.lab.eng.blr.redhat.com 0                    0                    failed              
dhcp47-117.lab.eng.blr.redhat.com 0                    0                    failed              
Tiering Migration Functionality: disp: success
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# rpm -qa | grep gluster
glusterfs-api-3.8.4-23.el7rhgs.x86_64
glusterfs-3.8.4-23.el7rhgs.x86_64
glusterfs-cli-3.8.4-23.el7rhgs.x86_64
glusterfs-fuse-3.8.4-23.el7rhgs.x86_64
glusterfs-server-3.8.4-23.el7rhgs.x86_64
glusterfs-events-3.8.4-23.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-23.el7rhgs.x86_64
python-gluster-3.8.4-23.el7rhgs.noarch
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-23.el7rhgs.x86_64
glusterfs-rdma-3.8.4-23.el7rhgs.x86_64
glusterfs-libs-3.8.4-23.el7rhgs.x86_64
[root@dhcp47-121 ~]#

Comment 2 Sweta Anandpara 2017-05-04 10:28:49 UTC

[qe@rhsqe-repo 1447920]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1447920]$ pwd
/home/repo/sosreports/1447920
[qe@rhsqe-repo 1447920]$ ll
total 279940
-rwxr-xr-x. 1 qe qe 47589384 May  4 15:48 sosreport-sysreg-prod_dhcp47-113-20170504044728.tar.xz
-rwxr-xr-x. 1 qe qe 48654484 May  4 15:48 sosreport-sysreg-prod_dhcp47-114-20170504044728.tar.xz
-rwxr-xr-x. 1 qe qe 48254232 May  4 15:48 sosreport-sysreg-prod_dhcp47-115-20170504044728.tar.xz
-rwxr-xr-x. 1 qe qe 45685132 May  4 15:48 sosreport-sysreg-prod_dhcp47-116-20170504044728.tar.xz
-rwxr-xr-x. 1 qe qe 46089020 May  4 15:48 sosreport-sysreg-prod_dhcp47-117-20170504044728.tar.xz
-rwxr-xr-x. 1 qe qe 50375272 May  4 15:48 sosreport-sysreg-prod_dhcp47-121-20170504044728.tar.xz
[qe@rhsqe-repo 1447920]$

Comment 3 hari gowtham 2017-05-09 10:51:51 UTC

RCA: 
From the logs, I was able to see that the brick multiplexing was enabled, then a volume was created and it was converted into a tiered volume. then multiplexing was disabled after which the upgrade was done. 

After the upgrade, the tierd didn't come up on the node as it wasn't able to connect to the subvolumes 

[2017-04-25 08:23:46.587380] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-disp-client-6: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.

glusterd kept trying for a while and as it was unable to connect, it got a child down event.

[2017-04-25 08:23:25.458734] E [socket.c:2318:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused); disconnecting socket

This patch (https://review.gluster.org/#/c/17101/) fixes this socket issue.

Comment 4 Atin Mukherjee 2017-05-09 14:07:01 UTC

Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596

Downstream patches:

https://code.engineering.redhat.com/gerrit/#/c/105595/
https://code.engineering.redhat.com/gerrit/#/c/105596/

Comment 8 Bala Konda Reddy M 2017-08-02 12:51:28 UTC

Build : 3.8.4-36

Enabled brick mux, created a volume and attached tier. Disabled brick mux, and rebooted one node. tier daemons are coming up after that.

@rahul, switching brick-mux on and off when volume is present is it recommended?

Comment 10 Bala Konda Reddy M 2017-08-09 11:24:23 UTC

BUILD: 3.8.4-38

1. Enabled brick-mux
2. created a volume and made it tiered volume
3. Disabled brick-mux (with and without this step)
4. restarted glusterd
5. tier daemons are coming up (visible in status too, gluster vol tier <vol> status)

Hence marking it as verified

Comment 12 errata-xmlrpc 2017-09-21 04:41:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.