Bug 1610903 - [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE after node restart
Summary: [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE aft...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhgs-server-container
Version: cns-3.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 3.11.1
Assignee: Raghavendra Talur
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On: 1622452 1623433
Blocks: OCS-3.11.1-devel-triage-done 1644154
TreeView+ depends on / blocked
 
Reported: 2018-08-01 15:06 UTC by Neha Berry
Modified: 2019-02-11 10:49 UTC (History)
15 users (show)

Fixed In Version: rhgs-server-rhel7:3.11.1-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-07 04:12:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1653283 0 unspecified CLOSED Heketi volume brick went offline while deleting/creating block pvc's followed by node shutdown 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2019:0287 0 None None None 2019-02-07 04:13:08 UTC

Internal Links: 1653283

Description Neha Berry 2018-08-01 15:06:46 UTC
Brick for heketidbstorage failed to come ONLINE after node restart

Description of problem:
++++++++++++++++++++++++++++
We were regressing https://bugzilla.redhat.com/show_bug.cgi?id=1601341 in the new heketi build 7.0.5. Created 2 blockvolumes with HA=4 and it succeeded.

Brought down one node X and created 10 blockvolumes with ha=3 which succeeded with the bug fix BZ#1601341. 

The system had 2 volumes , 1 replica 3 and 1 block-hosting volume, each using the node X bricks

But then when the node is powered back on, the brick for heketidbstorage volume failed to come back back online. The brick for the block-hosting volume was restored successfully.

Note: 
1. we didnt hit the issue for any other replica-3/block-hosting volume. The issue is seen only for the heketidbvolume.
2. If we bring down a node ( but dont create any block device while it is down) and then bring it up, no issue is seen and the brick process for heketidb volume is restored.


Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++++++++++++

[root@dhcp46-52 heketidb]# oc rsh glusterfs-storage-2qrqw 
sh-4.2# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-23.el7rhgs.x86_64
sh-4.2# 

sh-4.2# rpm -qa|grep configshell
python-configshell-1.1.fb23-4.el7_5.noarch
sh-4.2# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-23.el7rhgs.x86_64
sh-4.2# 

# oc rsh heketi-storage-1-797hw rpm -qa|grep heketi
python-heketi-7.0.0-5.el7rhgs.x86_64
heketi-client-7.0.0-5.el7rhgs.x86_64
heketi-7.0.0-5.el7rhgs.x86_64




How reproducible:
+++++++++++++++++++
We did this on two setups and everytime the brick for heketidbstorage  volume failed to come up, in the corner case when we had created some block devices with one node down..

Steps to Reproduce:
1. Create a 4 node CNS setup and created a blockvolume with HA=4. Two volumes exist on the setup, heketidbstorage and block-hosting volume. Confirm all brick processes are up.
2. Bring down 1 node from the CNS cluster and create some new block devices with HA=3
3. Bring up the failed node.
4. check the gluster v and gluster heal status of all the volumes. It is seen that only for the heketidbstorage volume, the brick process failed to come up. 
5. The gluster v heal status also lists "Status: Transport endpoint is not connected" for heketidbstorage.

Actual results:
+++++++++++++++++
The heketidbstorage brick process is not restored after a node reboot, in scenarios when we had .

Expected results:
++++++++++++++++
The brick process for the heketidbstorage volume should come online after node reboot.

Additional info:
++++++++++++++++

All setup and command outputs attached in the next comment.




[root@dhcp46-52 pvc-create]# oc get pods -o wide
NAME                                          READY     STATUS    RESTARTS   AGE       IP             NODE
block3-1-1-h2ps9                              1/1       Running   0          1h        10.128.2.10    dhcp47-62.lab.eng.blr.redhat.com
block4-1-88l58                                1/1       Running   0          4h        10.129.0.8     dhcp46-177.lab.eng.blr.redhat.com
block4-2-1-8k8hv                              1/1       Running   5          59m       10.128.0.21    dhcp46-52.lab.eng.blr.redhat.com
glusterblock-storage-provisioner-dc-1-s2m7w   1/1       Running   0          4h        10.128.2.9     dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-8gz76                       1/1       Running   0          5h        10.70.47.62    dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-9gtv6                       1/1       Running   0          5h        10.70.46.172   dhcp46-172.lab.eng.blr.redhat.com
glusterfs-storage-l4vfz                       1/1       Running   0          5h        10.70.46.177   dhcp46-177.lab.eng.blr.redhat.com
glusterfs-storage-rc2mf                       1/1       Running   1          5h        10.70.46.17    dhcp46-17.lab.eng.blr.redhat.com
heketi-storage-1-797hw                        1/1       Running   0          5h        10.128.0.19    dhcp46-52.lab.eng.blr.redhat.com

Comment 2 Neha Berry 2018-08-01 15:51:13 UTC
I shall update the bug with more details and logs in some time.Ap;ologies for the delay

Comment 40 errata-xmlrpc 2019-02-07 04:12:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287


Note You need to log in before you can comment on or make changes to this bug.