1610903 – [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE after node restart

Bug 1610903 - [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE after node restart

Summary: [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE aft...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	cns-3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 3.11.1
Assignee:	Raghavendra Talur
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1622452 1623433
Blocks:	OCS-3.11.1-devel-triage-done 1644154
TreeView+	depends on / blocked

Reported:	2018-08-01 15:06 UTC by Neha Berry
Modified:	2019-02-11 10:49 UTC (History)
CC List:	15 users (show)
Fixed In Version:	rhgs-server-rhel7:3.11.1-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 04:12:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1653283	0	unspecified	CLOSED	Heketi volume brick went offline while deleting/creating block pvc's followed by node shutdown	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2019:0287	0	None	None	None	2019-02-07 04:13:08 UTC

Internal Links: 1653283

Description Neha Berry 2018-08-01 15:06:46 UTC

Brick for heketidbstorage failed to come ONLINE after node restart

Description of problem:
++++++++++++++++++++++++++++
We were regressing https://bugzilla.redhat.com/show_bug.cgi?id=1601341 in the new heketi build 7.0.5. Created 2 blockvolumes with HA=4 and it succeeded.

Brought down one node X and created 10 blockvolumes with ha=3 which succeeded with the bug fix BZ#1601341.

The system had 2 volumes , 1 replica 3 and 1 block-hosting volume, each using the node X bricks

But then when the node is powered back on, the brick for heketidbstorage volume failed to come back back online. The brick for the block-hosting volume was restored successfully.

Note:
1. we didnt hit the issue for any other replica-3/block-hosting volume. The issue is seen only for the heketidbvolume.
2. If we bring down a node ( but dont create any block device while it is down) and then bring it up, no issue is seen and the brick process for heketidb volume is restored.

Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++++++++++++

[root@dhcp46-52 heketidb]# oc rsh glusterfs-storage-2qrqw
sh-4.2# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-23.el7rhgs.x86_64
sh-4.2#

sh-4.2# rpm -qa|grep configshell
python-configshell-1.1.fb23-4.el7_5.noarch
sh-4.2# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-23.el7rhgs.x86_64
sh-4.2#

# oc rsh heketi-storage-1-797hw rpm -qa|grep heketi
python-heketi-7.0.0-5.el7rhgs.x86_64
heketi-client-7.0.0-5.el7rhgs.x86_64
heketi-7.0.0-5.el7rhgs.x86_64

How reproducible:
+++++++++++++++++++
We did this on two setups and everytime the brick for heketidbstorage volume failed to come up, in the corner case when we had created some block devices with one node down..

Steps to Reproduce:
1. Create a 4 node CNS setup and created a blockvolume with HA=4. Two volumes exist on the setup, heketidbstorage and block-hosting volume. Confirm all brick processes are up.
2. Bring down 1 node from the CNS cluster and create some new block devices with HA=3
3. Bring up the failed node.
4. check the gluster v and gluster heal status of all the volumes. It is seen that only for the heketidbstorage volume, the brick process failed to come up.
5. The gluster v heal status also lists "Status: Transport endpoint is not connected" for heketidbstorage.

Actual results:
+++++++++++++++++
The heketidbstorage brick process is not restored after a node reboot, in scenarios when we had .

Expected results:
++++++++++++++++
The brick process for the heketidbstorage volume should come online after node reboot.

Additional info:
++++++++++++++++

All setup and command outputs attached in the next comment.

[root@dhcp46-52 pvc-create]# oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
block3-1-1-h2ps9 1/1 Running 0 1h 10.128.2.10 dhcp47-62.lab.eng.blr.redhat.com
block4-1-88l58 1/1 Running 0 4h 10.129.0.8 dhcp46-177.lab.eng.blr.redhat.com
block4-2-1-8k8hv 1/1 Running 5 59m 10.128.0.21 dhcp46-52.lab.eng.blr.redhat.com
glusterblock-storage-provisioner-dc-1-s2m7w 1/1 Running 0 4h 10.128.2.9 dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-8gz76 1/1 Running 0 5h 10.70.47.62 dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-9gtv6 1/1 Running 0 5h 10.70.46.172 dhcp46-172.lab.eng.blr.redhat.com
glusterfs-storage-l4vfz 1/1 Running 0 5h 10.70.46.177 dhcp46-177.lab.eng.blr.redhat.com
glusterfs-storage-rc2mf 1/1 Running 1 5h 10.70.46.17 dhcp46-17.lab.eng.blr.redhat.com
heketi-storage-1-797hw 1/1 Running 0 5h 10.128.0.19 dhcp46-52.lab.eng.blr.redhat.com

Comment 2 Neha Berry 2018-08-01 15:51:13 UTC

I shall update the bug with more details and logs in some time.Ap;ologies for the delay

Comment 40 errata-xmlrpc 2019-02-07 04:12:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287

Note You need to log in before you can comment on or make changes to this bug.