Description of problem: ======================= In a brick-mux enabled setup (on both the master and slave clusters) a geo-rep session was created. 3 brick became ACTIVE : b1,b7,b13. Killed b7 using gf_attach -d and b8 became ACTIVE Killed b8 using gf_attach and another one became ACTIVE. Did a volume start force to brick all the bricks back online.. gluster v status shows the following output: -------------------------------------------- [root@dhcp42-131 scripts]# gluster v status Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.14:/var/lib/glusterd/ss_bric k 49153 0 Y 18029 Brick 10.70.42.255:/var/lib/glusterd/ss_bri ck 49153 0 Y 18981 Brick dhcp42-131.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49153 0 Y 25181 Self-heal Daemon on localhost N/A N/A Y 24973 Self-heal Daemon on 10.70.43.245 N/A N/A N N/A Self-heal Daemon on 10.70.42.250 N/A N/A N N/A Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268 Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852 Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: master Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.131:/rhs/brick1/b1 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick1/b2 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick1/b3 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick1/b4 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick1/b5 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick1/b6 49152 0 Y 11537 Brick 10.70.42.131:/rhs/brick2/b7 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick2/b8 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick2/b9 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick2/b10 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick2/b11 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick2/b12 49152 0 Y 11537 Brick 10.70.42.131:/rhs/brick3/b13 49152 0 Y 24935 Brick 10.70.42.14:/rhs/brick3/b14 49152 0 Y 17811 Brick 10.70.42.255:/rhs/brick3/b15 49152 0 Y 18761 Brick 10.70.42.250:/rhs/brick3/b16 49152 0 Y 31287 Brick 10.70.41.224:/rhs/brick3/b17 49152 0 Y 16227 Brick 10.70.43.245:/rhs/brick3/b18 49152 0 Y 11537 Self-heal Daemon on localhost N/A N/A Y 24973 Self-heal Daemon on 10.70.42.250 N/A N/A N N/A Self-heal Daemon on 10.70.42.255 N/A N/A Y 18810 Self-heal Daemon on 10.70.41.224 N/A N/A Y 16268 Self-heal Daemon on 10.70.43.245 N/A N/A N N/A Self-heal Daemon on 10.70.42.14 N/A N/A Y 17852 Task Status of Volume master ------------------------------------------------------------------------------ There are no active volume tasks Seems like shd is not coming up on 2 nodes. In a geo-rep session, once the bricks are brought back online they should be in ACTIVE/PASSIVE state but currently out of 18 bricks, 2 are ACTIVE and 14 are PASSIVE. 2 are in 'Initializing...' state (There should be 3 ACTIVE bricks in this setup) Version-Release number of selected component (if applicable): ============================================================ glusterfs-6.0-3.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.9.x86_64 glusterfs-rdma-6.0-3.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-cli-6.0-3.el7rhgs.x86_64 glusterfs-geo-replication-6.0-3.el7rhgs.x86_64 glusterfs-libs-6.0-3.el7rhgs.x86_64 glusterfs-api-6.0-3.el7rhgs.x86_64 glusterfs-server-6.0-3.el7rhgs.x86_64 python2-gluster-6.0-3.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 glusterfs-client-xlators-6.0-3.el7rhgs.x86_64 glusterfs-fuse-6.0-3.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: ================== 1/1 Steps to Reproduce: ------------------- 1.Start a master and slave volume (3 x (4 + 2) = 18 with Brick-mux enabled 2.Create and start a geo-rep session 3.IO on the master: for i in {create,chmod,chown,chgrp,symlink,truncate,hardlink,rename,chmod,create,chown,chgrp,rename,symlink,rename,rename,rename,rename,symlink,hardlink,rename,rename,truncate}; do crefi --multi -n 5 -b 5 -d 5 --max=1K --min=50 --random -T 5 -t text --fop=$i /mnt/master/ ; sleep 10 ; done 3.Kill one ACTIVE brick using gf_attach -d 4.Wait for another to become ACTIVE 5.Kill the new ACTIVE bring 6.Wait for 10 minutes and bring back all bricks using : gluster v start force Actual results: --------------- 1. Geo-rep still in 'initializing..' state for some bricks. 2. shd is not coming up on 2 nodes Expected results: ----------------- 1. Geo-rep should be in ACTIVE/PASSIVE state 2. shd should be up on all nodes Additional info: ================= Filed this on geo-rep for now as I can see the side effects here.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249