Brick for heketidbstorage failed to come ONLINE after node restart
Description of problem:
++++++++++++++++++++++++++++
We were regressing https://bugzilla.redhat.com/show_bug.cgi?id=1601341 in the new heketi build 7.0.5. Created 2 blockvolumes with HA=4 and it succeeded.
Brought down one node X and created 10 blockvolumes with ha=3 which succeeded with the bug fix BZ#1601341.
The system had 2 volumes , 1 replica 3 and 1 block-hosting volume, each using the node X bricks
But then when the node is powered back on, the brick for heketidbstorage volume failed to come back back online. The brick for the block-hosting volume was restored successfully.
Note:
1. we didnt hit the issue for any other replica-3/block-hosting volume. The issue is seen only for the heketidbvolume.
2. If we bring down a node ( but dont create any block device while it is down) and then bring it up, no issue is seen and the brick process for heketidb volume is restored.
Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++++++++++++
[root@dhcp46-52 heketidb]# oc rsh glusterfs-storage-2qrqw
sh-4.2# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-23.el7rhgs.x86_64
sh-4.2#
sh-4.2# rpm -qa|grep configshell
python-configshell-1.1.fb23-4.el7_5.noarch
sh-4.2# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-23.el7rhgs.x86_64
sh-4.2#
# oc rsh heketi-storage-1-797hw rpm -qa|grep heketi
python-heketi-7.0.0-5.el7rhgs.x86_64
heketi-client-7.0.0-5.el7rhgs.x86_64
heketi-7.0.0-5.el7rhgs.x86_64
How reproducible:
+++++++++++++++++++
We did this on two setups and everytime the brick for heketidbstorage volume failed to come up, in the corner case when we had created some block devices with one node down..
Steps to Reproduce:
1. Create a 4 node CNS setup and created a blockvolume with HA=4. Two volumes exist on the setup, heketidbstorage and block-hosting volume. Confirm all brick processes are up.
2. Bring down 1 node from the CNS cluster and create some new block devices with HA=3
3. Bring up the failed node.
4. check the gluster v and gluster heal status of all the volumes. It is seen that only for the heketidbstorage volume, the brick process failed to come up.
5. The gluster v heal status also lists "Status: Transport endpoint is not connected" for heketidbstorage.
Actual results:
+++++++++++++++++
The heketidbstorage brick process is not restored after a node reboot, in scenarios when we had .
Expected results:
++++++++++++++++
The brick process for the heketidbstorage volume should come online after node reboot.
Additional info:
++++++++++++++++
All setup and command outputs attached in the next comment.
[root@dhcp46-52 pvc-create]# oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
block3-1-1-h2ps9 1/1 Running 0 1h 10.128.2.10 dhcp47-62.lab.eng.blr.redhat.com
block4-1-88l58 1/1 Running 0 4h 10.129.0.8 dhcp46-177.lab.eng.blr.redhat.com
block4-2-1-8k8hv 1/1 Running 5 59m 10.128.0.21 dhcp46-52.lab.eng.blr.redhat.com
glusterblock-storage-provisioner-dc-1-s2m7w 1/1 Running 0 4h 10.128.2.9 dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-8gz76 1/1 Running 0 5h 10.70.47.62 dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-9gtv6 1/1 Running 0 5h 10.70.46.172 dhcp46-172.lab.eng.blr.redhat.com
glusterfs-storage-l4vfz 1/1 Running 0 5h 10.70.46.177 dhcp46-177.lab.eng.blr.redhat.com
glusterfs-storage-rc2mf 1/1 Running 1 5h 10.70.46.17 dhcp46-17.lab.eng.blr.redhat.com
heketi-storage-1-797hw 1/1 Running 0 5h 10.128.0.19 dhcp46-52.lab.eng.blr.redhat.com
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2019:0287