Brick for heketidbstorage failed to come ONLINE after node restart Description of problem: ++++++++++++++++++++++++++++ We were regressing https://bugzilla.redhat.com/show_bug.cgi?id=1601341 in the new heketi build 7.0.5. Created 2 blockvolumes with HA=4 and it succeeded. Brought down one node X and created 10 blockvolumes with ha=3 which succeeded with the bug fix BZ#1601341. The system had 2 volumes , 1 replica 3 and 1 block-hosting volume, each using the node X bricks But then when the node is powered back on, the brick for heketidbstorage volume failed to come back back online. The brick for the block-hosting volume was restored successfully. Note: 1. we didnt hit the issue for any other replica-3/block-hosting volume. The issue is seen only for the heketidbvolume. 2. If we bring down a node ( but dont create any block device while it is down) and then bring it up, no issue is seen and the brick process for heketidb volume is restored. Version-Release number of selected component (if applicable): ++++++++++++++++++++++++++++++++++++++++++++ [root@dhcp46-52 heketidb]# oc rsh glusterfs-storage-2qrqw sh-4.2# rpm -qa|grep gluster glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64 glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-api-3.8.4-54.15.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64 glusterfs-server-3.8.4-54.15.el7rhgs.x86_64 gluster-block-0.2.1-23.el7rhgs.x86_64 sh-4.2# sh-4.2# rpm -qa|grep configshell python-configshell-1.1.fb23-4.el7_5.noarch sh-4.2# rpm -qa|grep tcmu-runner tcmu-runner-1.2.0-23.el7rhgs.x86_64 sh-4.2# # oc rsh heketi-storage-1-797hw rpm -qa|grep heketi python-heketi-7.0.0-5.el7rhgs.x86_64 heketi-client-7.0.0-5.el7rhgs.x86_64 heketi-7.0.0-5.el7rhgs.x86_64 How reproducible: +++++++++++++++++++ We did this on two setups and everytime the brick for heketidbstorage volume failed to come up, in the corner case when we had created some block devices with one node down.. Steps to Reproduce: 1. Create a 4 node CNS setup and created a blockvolume with HA=4. Two volumes exist on the setup, heketidbstorage and block-hosting volume. Confirm all brick processes are up. 2. Bring down 1 node from the CNS cluster and create some new block devices with HA=3 3. Bring up the failed node. 4. check the gluster v and gluster heal status of all the volumes. It is seen that only for the heketidbstorage volume, the brick process failed to come up. 5. The gluster v heal status also lists "Status: Transport endpoint is not connected" for heketidbstorage. Actual results: +++++++++++++++++ The heketidbstorage brick process is not restored after a node reboot, in scenarios when we had . Expected results: ++++++++++++++++ The brick process for the heketidbstorage volume should come online after node reboot. Additional info: ++++++++++++++++ All setup and command outputs attached in the next comment. [root@dhcp46-52 pvc-create]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE block3-1-1-h2ps9 1/1 Running 0 1h 10.128.2.10 dhcp47-62.lab.eng.blr.redhat.com block4-1-88l58 1/1 Running 0 4h 10.129.0.8 dhcp46-177.lab.eng.blr.redhat.com block4-2-1-8k8hv 1/1 Running 5 59m 10.128.0.21 dhcp46-52.lab.eng.blr.redhat.com glusterblock-storage-provisioner-dc-1-s2m7w 1/1 Running 0 4h 10.128.2.9 dhcp47-62.lab.eng.blr.redhat.com glusterfs-storage-8gz76 1/1 Running 0 5h 10.70.47.62 dhcp47-62.lab.eng.blr.redhat.com glusterfs-storage-9gtv6 1/1 Running 0 5h 10.70.46.172 dhcp46-172.lab.eng.blr.redhat.com glusterfs-storage-l4vfz 1/1 Running 0 5h 10.70.46.177 dhcp46-177.lab.eng.blr.redhat.com glusterfs-storage-rc2mf 1/1 Running 1 5h 10.70.46.17 dhcp46-17.lab.eng.blr.redhat.com heketi-storage-1-797hw 1/1 Running 0 5h 10.128.0.19 dhcp46-52.lab.eng.blr.redhat.com
I shall update the bug with more details and logs in some time.Ap;ologies for the delay
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0287