Description of problem:
=======================
In a brick-mux enabled setup (on both the master and slave clusters) a geo-rep session was created.
3 brick became ACTIVE : b1,b7,b13.
Killed b7 using gf_attach -d and b8 became ACTIVE
Killed b8 using gf_attach and another one became ACTIVE.
Did a volume start force to brick all the bricks back online..
gluster v status shows the following output:
--------------------------------------------
[root@dhcp42-131 scripts]# gluster v status
Status of volume: gluster_shared_storage
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.42.14:/var/lib/glusterd/ss_bric
k 49153 0 Y 18029
Brick 10.70.42.255:/var/lib/glusterd/ss_bri
ck 49153 0 Y 18981
Brick dhcp42-131.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick 49153 0 Y 25181
Self-heal Daemon on localhost N/A N/A Y 24973
Self-heal Daemon on 10.70.43.245 N/A N/A N N/A
Self-heal Daemon on 10.70.42.250 N/A N/A N N/A
Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268
Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852
Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
Status of volume: master
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.42.131:/rhs/brick1/b1 49152 0 Y 24935
Brick 10.70.42.14:/rhs/brick1/b2 49152 0 Y 17811
Brick 10.70.42.255:/rhs/brick1/b3 49152 0 Y 18761
Brick 10.70.42.250:/rhs/brick1/b4 49152 0 Y 31287
Brick 10.70.41.224:/rhs/brick1/b5 49152 0 Y 16227
Brick 10.70.43.245:/rhs/brick1/b6 49152 0 Y 11537
Brick 10.70.42.131:/rhs/brick2/b7 49152 0 Y 24935
Brick 10.70.42.14:/rhs/brick2/b8 49152 0 Y 17811
Brick 10.70.42.255:/rhs/brick2/b9 49152 0 Y 18761
Brick 10.70.42.250:/rhs/brick2/b10 49152 0 Y 31287
Brick 10.70.41.224:/rhs/brick2/b11 49152 0 Y 16227
Brick 10.70.43.245:/rhs/brick2/b12 49152 0 Y 11537
Brick 10.70.42.131:/rhs/brick3/b13 49152 0 Y 24935
Brick 10.70.42.14:/rhs/brick3/b14 49152 0 Y 17811
Brick 10.70.42.255:/rhs/brick3/b15 49152 0 Y 18761
Brick 10.70.42.250:/rhs/brick3/b16 49152 0 Y 31287
Brick 10.70.41.224:/rhs/brick3/b17 49152 0 Y 16227
Brick 10.70.43.245:/rhs/brick3/b18 49152 0 Y 11537
Self-heal Daemon on localhost N/A N/A Y 24973
Self-heal Daemon on 10.70.42.250 N/A N/A N N/A
Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810
Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268
Self-heal Daemon on 10.70.43.245 N/A N/A N N/A
Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
Seems like shd is not coming up on 2 nodes.
In a geo-rep session, once the bricks are brought back online they
should be in ACTIVE/PASSIVE state but currently out of 18 bricks, 2 are ACTIVE and 14 are PASSIVE. 2 are in 'Initializing...' state (There should be 3 ACTIVE bricks in this setup)
Version-Release number of selected component (if applicable):
============================================================
glusterfs-6.0-3.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.9.x86_64
glusterfs-rdma-6.0-3.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-6.0-3.el7rhgs.x86_64
glusterfs-geo-replication-6.0-3.el7rhgs.x86_64
glusterfs-libs-6.0-3.el7rhgs.x86_64
glusterfs-api-6.0-3.el7rhgs.x86_64
glusterfs-server-6.0-3.el7rhgs.x86_64
python2-gluster-6.0-3.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-client-xlators-6.0-3.el7rhgs.x86_64
glusterfs-fuse-6.0-3.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
How reproducible:
==================
1/1
Steps to Reproduce:
-------------------
1.Start a master and slave volume (3 x (4 + 2) = 18 with Brick-mux enabled
2.Create and start a geo-rep session
3.IO on the master: for i in {create,chmod,chown,chgrp,symlink,truncate,hardlink,rename,chmod,create,chown,chgrp,rename,symlink,rename,rename,rename,rename,symlink,hardlink,rename,rename,truncate}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 5 -t text --fop=$i /mnt/master/ ; sleep 10 ; done
3.Kill one ACTIVE brick using gf_attach -d
4.Wait for another to become ACTIVE
5.Kill the new ACTIVE bring
6.Wait for 10 minutes and bring back all bricks using : gluster v start force
Actual results:
---------------
1. Geo-rep still in 'initializing..' state for some bricks.
2. shd is not coming up on 2 nodes
Expected results:
-----------------
1. Geo-rep should be in ACTIVE/PASSIVE state
2. shd should be up on all nodes
Additional info:
=================
Filed this on geo-rep for now as I can see the side effects here.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2019:3249
Description of problem: ======================= In a brick-mux enabled setup (on both the master and slave clusters) a geo-rep session was created. 3 brick became ACTIVE : b1,b7,b13. Killed b7 using gf_attach -d and b8 became ACTIVE Killed b8 using gf_attach and another one became ACTIVE. Did a volume start force to brick all the bricks back online.. gluster v status shows the following output: -------------------------------------------- [root@dhcp42-131 scripts]# gluster v status Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.14:/var/lib/glusterd/ss_bric k 49153 0 Y 18029 Brick 10.70.42.255:/var/lib/glusterd/ss_bri ck 49153 0 Y 18981 Brick dhcp42-131.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49153 0 Y 25181 Self-heal Daemon on localhost N/A N/A Y 24973 Self-heal Daemon on 10.70.43.245 N/A N/A N N/A Self-heal Daemon on 10.70.42.250 N/A N/A N N/A Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268 Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852 Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: master Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.131:/rhs/brick1/b1 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick1/b2 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick1/b3 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick1/b4 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick1/b5 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick1/b6 49152 0 Y 11537 Brick 10.70.42.131:/rhs/brick2/b7 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick2/b8 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick2/b9 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick2/b10 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick2/b11 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick2/b12 49152 0 Y 11537 Brick 10.70.42.131:/rhs/brick3/b13 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick3/b14 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick3/b15 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick3/b16 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick3/b17 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick3/b18 49152 0 Y 11537 Self-heal Daemon on localhost N/A N/A Y 24973 Self-heal Daemon on 10.70.42.250 N/A N/A N N/A Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810 Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268 Self-heal Daemon on 10.70.43.245 N/A N/A N N/A Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852 Task Status of Volume master ------------------------------------------------------------------------------ There are no active volume tasks Seems like shd is not coming up on 2 nodes. In a geo-rep session, once the bricks are brought back online they should be in ACTIVE/PASSIVE state but currently out of 18 bricks, 2 are ACTIVE and 14 are PASSIVE. 2 are in 'Initializing...' state (There should be 3 ACTIVE bricks in this setup) Version-Release number of selected component (if applicable): ============================================================ glusterfs-6.0-3.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.9.x86_64 glusterfs-rdma-6.0-3.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-cli-6.0-3.el7rhgs.x86_64 glusterfs-geo-replication-6.0-3.el7rhgs.x86_64 glusterfs-libs-6.0-3.el7rhgs.x86_64 glusterfs-api-6.0-3.el7rhgs.x86_64 glusterfs-server-6.0-3.el7rhgs.x86_64 python2-gluster-6.0-3.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 glusterfs-client-xlators-6.0-3.el7rhgs.x86_64 glusterfs-fuse-6.0-3.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: ================== 1/1 Steps to Reproduce: ------------------- 1.Start a master and slave volume (3 x (4 + 2) = 18 with Brick-mux enabled 2.Create and start a geo-rep session 3.IO on the master: for i in {create,chmod,chown,chgrp,symlink,truncate,hardlink,rename,chmod,create,chown,chgrp,rename,symlink,rename,rename,rename,rename,symlink,hardlink,rename,rename,truncate}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 5 -t text --fop=$i /mnt/master/ ; sleep 10 ; done 3.Kill one ACTIVE brick using gf_attach -d 4.Wait for another to become ACTIVE 5.Kill the new ACTIVE bring 6.Wait for 10 minutes and bring back all bricks using : gluster v start force Actual results: --------------- 1. Geo-rep still in 'initializing..' state for some bricks. 2. shd is not coming up on 2 nodes Expected results: ----------------- 1. Geo-rep should be in ACTIVE/PASSIVE state 2. shd should be up on all nodes Additional info: ================= Filed this on geo-rep for now as I can see the side effects here.