Description of problem:
=======================
On upgrade to 3.5.0, self heal deamon fails to come up on a brickmux setup.
Version-Release number of selected component (if applicable):
============================================================
3.5.0 (glusterfs-6.0-2.el7rhgs.x86_64)
How reproducible:
================
The issue is not consistent but you saw it 3/6 times
Steps to Reproduce:
==================
1.Upgraded node from 3.4.4 to 3.5
2.Started glusterd
3.shd fails to come up
Another way to reproduce
==========================
On a 3.5.0 setup with brick-mux enabled
1.pkill glusterfsd
2.pkill glusterfs
3.systemctl stop glusterd
4.systemctl start glusterd
Actual results:
===============
self heal deamon should come up
Expected results:
================
self heal deamon not coming up
Additional info:
================
[root@dhcp43-102 ~]# gluster v status
Status of volume: disperse-vol
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.43.44:/gluster/brick1/ec1 49152 0 Y 32720
Brick 10.70.42.80:/gluster/brick1/ec2 49152 0 Y 31885
Brick 10.70.43.116:/gluster/brick1/ec3 49152 0 Y 24287
Brick 10.70.43.211:/gluster/brick1/ec4 49152 0 Y 445
Brick 10.70.35.15:/gluster/brick1/ec5 49152 0 Y 4430
Brick 10.70.43.102:/gluster/brick1/ec6 49152 0 Y 1773
Brick 10.70.43.44:/gluster/brick1/ec7 49152 0 Y 32720
Brick 10.70.42.80:/gluster/brick1/ec8 49152 0 Y 31885
Brick 10.70.43.116:/gluster/brick1/ec9 49152 0 Y 24287
Brick 10.70.43.211:/gluster/brick1/ec10 49152 0 Y 445
Brick 10.70.35.15:/gluster/brick1/ec11 49152 0 Y 4430
Brick 10.70.43.102:/gluster/brick1/ec12 49152 0 Y 1773
Brick 10.70.43.44:/gluster/brick1/ec13 49152 0 Y 32720
Brick 10.70.42.80:/gluster/brick1/ec14 49152 0 Y 31885
Brick 10.70.43.116:/gluster/brick1/ec15 49152 0 Y 24287
Brick 10.70.43.211:/gluster/brick1/ec16 49152 0 Y 445
Brick 10.70.35.15:/gluster/brick1/ec17 49152 0 Y 4430
Brick 10.70.43.102:/gluster/brick1/ec18 49152 0 Y 1773
Self-heal Daemon on localhost N/A N/A N N/A
Self-heal Daemon on 10.70.42.80 N/A N/A Y 695
Self-heal Daemon on 10.70.43.211 N/A N/A Y 434
Self-heal Daemon on dhcp35-15.lab.eng.blr.r
edhat.com N/A N/A Y 5441
Self-heal Daemon on 10.70.43.116 N/A N/A Y 1738
Self-heal Daemon on 10.70.43.44 N/A N/A Y 302
Task Status of Volume disperse-vol
------------------------------------------------------------------------------
There are no active volume tasks
[root@dhcp43-102 ~]# rpm -qa|grep gluster
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.7.x86_64
glusterfs-api-6.0-2.el7rhgs.x86_64
glusterfs-server-6.0-2.el7rhgs.x86_64
glusterfs-libs-6.0-2.el7rhgs.x86_64
glusterfs-geo-replication-6.0-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
python2-gluster-6.0-2.el7rhgs.x86_64
glusterfs-rdma-6.0-2.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-6.0-2.el7rhgs.x86_64
glusterfs-client-xlators-6.0-2.el7rhgs.x86_64
glusterfs-cli-6.0-2.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-fuse-6.0-2.el7rhgs.x86_64
[root@dhcp43-102 ~]#
[root@dhcp43-102 ~]#
[root@dhcp43-102 ~]# gluster v info
Volume Name: disperse-vol
Type: Distributed-Disperse
Volume ID: 6d36d014-8c14-4866-9e39-4d8e42a8b657
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: 10.70.43.44:/gluster/brick1/ec1
Brick2: 10.70.42.80:/gluster/brick1/ec2
Brick3: 10.70.43.116:/gluster/brick1/ec3
Brick4: 10.70.43.211:/gluster/brick1/ec4
Brick5: 10.70.35.15:/gluster/brick1/ec5
Brick6: 10.70.43.102:/gluster/brick1/ec6
Brick7: 10.70.43.44:/gluster/brick1/ec7
Brick8: 10.70.42.80:/gluster/brick1/ec8
Brick9: 10.70.43.116:/gluster/brick1/ec9
Brick10: 10.70.43.211:/gluster/brick1/ec10
Brick11: 10.70.35.15:/gluster/brick1/ec11
Brick12: 10.70.43.102:/gluster/brick1/ec12
Brick13: 10.70.43.44:/gluster/brick1/ec13
Brick14: 10.70.42.80:/gluster/brick1/ec14
Brick15: 10.70.43.116:/gluster/brick1/ec15
Brick16: 10.70.43.211:/gluster/brick1/ec16
Brick17: 10.70.35.15:/gluster/brick1/ec17
Brick18: 10.70.43.102:/gluster/brick1/ec18
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable
[root@dhcp43-102 ~]#
Had shared the setup with rafi .
He looked into it and gave me a custom build to test in which i am not seeing the issue
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2019:3249