Bug 1610903

Summary: [Tracker-RHGS-BZ#1622452] Brick for heketidbstorage failed to come ONLINE after node restart
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha Berry <nberry>
Component: rhgs-server-containerAssignee: Raghavendra Talur <rtalur>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.10CC: amukherj, hchiramm, jmulligan, kramdoss, madam, moagrawa, nberry, nigoyal, pprakash, rcyriac, rhs-bugs, rtalur, sankarshan, storage-qa-internal, vinug
Target Milestone: ---Keywords: ZStream
Target Release: OCS 3.11.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhgs-server-rhel7:3.11.1-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-07 04:12:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1622452, 1623433    
Bug Blocks: 1641915, 1644154    

Description Neha Berry 2018-08-01 15:06:46 UTC
Brick for heketidbstorage failed to come ONLINE after node restart

Description of problem:
++++++++++++++++++++++++++++
We were regressing https://bugzilla.redhat.com/show_bug.cgi?id=1601341 in the new heketi build 7.0.5. Created 2 blockvolumes with HA=4 and it succeeded.

Brought down one node X and created 10 blockvolumes with ha=3 which succeeded with the bug fix BZ#1601341. 

The system had 2 volumes , 1 replica 3 and 1 block-hosting volume, each using the node X bricks

But then when the node is powered back on, the brick for heketidbstorage volume failed to come back back online. The brick for the block-hosting volume was restored successfully.

Note: 
1. we didnt hit the issue for any other replica-3/block-hosting volume. The issue is seen only for the heketidbvolume.
2. If we bring down a node ( but dont create any block device while it is down) and then bring it up, no issue is seen and the brick process for heketidb volume is restored.


Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++++++++++++

[root@dhcp46-52 heketidb]# oc rsh glusterfs-storage-2qrqw 
sh-4.2# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-3.8.4-54.15.el7rhgs.x86_64
glusterfs-api-3.8.4-54.15.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64
glusterfs-server-3.8.4-54.15.el7rhgs.x86_64
gluster-block-0.2.1-23.el7rhgs.x86_64
sh-4.2# 

sh-4.2# rpm -qa|grep configshell
python-configshell-1.1.fb23-4.el7_5.noarch
sh-4.2# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-23.el7rhgs.x86_64
sh-4.2# 

# oc rsh heketi-storage-1-797hw rpm -qa|grep heketi
python-heketi-7.0.0-5.el7rhgs.x86_64
heketi-client-7.0.0-5.el7rhgs.x86_64
heketi-7.0.0-5.el7rhgs.x86_64




How reproducible:
+++++++++++++++++++
We did this on two setups and everytime the brick for heketidbstorage  volume failed to come up, in the corner case when we had created some block devices with one node down..

Steps to Reproduce:
1. Create a 4 node CNS setup and created a blockvolume with HA=4. Two volumes exist on the setup, heketidbstorage and block-hosting volume. Confirm all brick processes are up.
2. Bring down 1 node from the CNS cluster and create some new block devices with HA=3
3. Bring up the failed node.
4. check the gluster v and gluster heal status of all the volumes. It is seen that only for the heketidbstorage volume, the brick process failed to come up. 
5. The gluster v heal status also lists "Status: Transport endpoint is not connected" for heketidbstorage.

Actual results:
+++++++++++++++++
The heketidbstorage brick process is not restored after a node reboot, in scenarios when we had .

Expected results:
++++++++++++++++
The brick process for the heketidbstorage volume should come online after node reboot.

Additional info:
++++++++++++++++

All setup and command outputs attached in the next comment.




[root@dhcp46-52 pvc-create]# oc get pods -o wide
NAME                                          READY     STATUS    RESTARTS   AGE       IP             NODE
block3-1-1-h2ps9                              1/1       Running   0          1h        10.128.2.10    dhcp47-62.lab.eng.blr.redhat.com
block4-1-88l58                                1/1       Running   0          4h        10.129.0.8     dhcp46-177.lab.eng.blr.redhat.com
block4-2-1-8k8hv                              1/1       Running   5          59m       10.128.0.21    dhcp46-52.lab.eng.blr.redhat.com
glusterblock-storage-provisioner-dc-1-s2m7w   1/1       Running   0          4h        10.128.2.9     dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-8gz76                       1/1       Running   0          5h        10.70.47.62    dhcp47-62.lab.eng.blr.redhat.com
glusterfs-storage-9gtv6                       1/1       Running   0          5h        10.70.46.172   dhcp46-172.lab.eng.blr.redhat.com
glusterfs-storage-l4vfz                       1/1       Running   0          5h        10.70.46.177   dhcp46-177.lab.eng.blr.redhat.com
glusterfs-storage-rc2mf                       1/1       Running   1          5h        10.70.46.17    dhcp46-17.lab.eng.blr.redhat.com
heketi-storage-1-797hw                        1/1       Running   0          5h        10.128.0.19    dhcp46-52.lab.eng.blr.redhat.com

Comment 2 Neha Berry 2018-08-01 15:51:13 UTC
I shall update the bug with more details and logs in some time.Ap;ologies for the delay

Comment 40 errata-xmlrpc 2019-02-07 04:12:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0287