Bug 1711130

Summary:	[shd+geo-rep]: shd not coming up on 2 nodes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	geo-replication	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	amukherj, avishwan, csaba, khiremat, nchilaka, rhs-bugs, rkavunga, sheggodu, storage-qa-internal, vdas
Target Milestone:	---
Target Release:	RHGS 3.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.0-5	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-30 12:21:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1696809

Description Rochelle 2019-05-17 03:42:15 UTC

Description of problem:
=======================
In a brick-mux enabled setup (on both the master and slave clusters) a geo-rep session was created.
3 brick became ACTIVE : b1,b7,b13.
Killed b7 using gf_attach -d and b8 became ACTIVE
Killed b8 using gf_attach and another one became ACTIVE.
Did a volume start force to brick all the bricks back online..

gluster v status shows the following output:
--------------------------------------------
[root@dhcp42-131 scripts]# gluster v status
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.14:/var/lib/glusterd/ss_bric
k                                           49153     0          Y       18029
Brick 10.70.42.255:/var/lib/glusterd/ss_bri
ck                                          49153     0          Y       18981
Brick dhcp42-131.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49153     0          Y       25181
Self-heal Daemon on localhost               N/A       N/A        Y       24973
Self-heal Daemon on 10.70.43.245            N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.42.250            N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.41.224            N/A       N/A        Y       16268
Self-heal Daemon on 10.70.42.14             N/A       N/A        Y       17852
Self-heal Daemon on 10.70.42.255            N/A       N/A        Y       18810
 
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.131:/rhs/brick1/b1           49152     0          Y       24935
Brick 10.70.42.14:/rhs/brick1/b2            49152     0          Y       17811
Brick 10.70.42.255:/rhs/brick1/b3           49152     0          Y       18761
Brick 10.70.42.250:/rhs/brick1/b4           49152     0          Y       31287
Brick 10.70.41.224:/rhs/brick1/b5           49152     0          Y       16227
Brick 10.70.43.245:/rhs/brick1/b6           49152     0          Y       11537
Brick 10.70.42.131:/rhs/brick2/b7           49152     0          Y       24935
Brick 10.70.42.14:/rhs/brick2/b8            49152     0          Y       17811
Brick 10.70.42.255:/rhs/brick2/b9           49152     0          Y       18761
Brick 10.70.42.250:/rhs/brick2/b10          49152     0          Y       31287
Brick 10.70.41.224:/rhs/brick2/b11          49152     0          Y       16227
Brick 10.70.43.245:/rhs/brick2/b12          49152     0          Y       11537
Brick 10.70.42.131:/rhs/brick3/b13          49152     0          Y       24935
Brick 10.70.42.14:/rhs/brick3/b14           49152     0          Y       17811
Brick 10.70.42.255:/rhs/brick3/b15          49152     0          Y       18761
Brick 10.70.42.250:/rhs/brick3/b16          49152     0          Y       31287
Brick 10.70.41.224:/rhs/brick3/b17          49152     0          Y       16227
Brick 10.70.43.245:/rhs/brick3/b18          49152     0          Y       11537
Self-heal Daemon on localhost               N/A       N/A        Y       24973
Self-heal Daemon on 10.70.42.250            N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.42.255            N/A       N/A        Y       18810
Self-heal Daemon on 10.70.41.224            N/A       N/A        Y       16268
Self-heal Daemon on 10.70.43.245            N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.42.14             N/A       N/A        Y       17852
 
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks



Seems like shd is not coming up on 2 nodes. 

In a geo-rep session, once the bricks are brought back online they
should be in ACTIVE/PASSIVE state but currently out of 18 bricks, 2 are ACTIVE and 14 are PASSIVE. 2 are in 'Initializing...' state (There should be 3 ACTIVE bricks in this setup)


Version-Release number of selected component (if applicable):
============================================================
glusterfs-6.0-3.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.9.x86_64
glusterfs-rdma-6.0-3.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-6.0-3.el7rhgs.x86_64
glusterfs-geo-replication-6.0-3.el7rhgs.x86_64
glusterfs-libs-6.0-3.el7rhgs.x86_64
glusterfs-api-6.0-3.el7rhgs.x86_64
glusterfs-server-6.0-3.el7rhgs.x86_64
python2-gluster-6.0-3.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-client-xlators-6.0-3.el7rhgs.x86_64
glusterfs-fuse-6.0-3.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch



How reproducible:
==================
1/1

Steps to Reproduce:
-------------------
1.Start a master and slave volume (3 x (4 + 2) = 18 with Brick-mux enabled
2.Create and start a geo-rep session 
3.IO on the master:  for i in {create,chmod,chown,chgrp,symlink,truncate,hardlink,rename,chmod,create,chown,chgrp,rename,symlink,rename,rename,rename,rename,symlink,hardlink,rename,rename,truncate}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 5 -t text --fop=$i /mnt/master/ ; sleep 10 ; done

3.Kill one ACTIVE brick using gf_attach -d 
4.Wait for another to become ACTIVE
5.Kill the new ACTIVE bring
6.Wait for 10 minutes and bring back all bricks using : gluster v start force


Actual results:
---------------
1. Geo-rep still in 'initializing..' state for some bricks.
2. shd is not coming up on 2 nodes


Expected results:
-----------------
1. Geo-rep should be in ACTIVE/PASSIVE state
2. shd should be up on all nodes

Additional info:
=================
Filed this on geo-rep for now as I can see the side effects here.

Comment 13 errata-xmlrpc 2019-10-30 12:21:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249