Bug 1653283

Summary:	Heketi volume brick went offline while deleting/creating block pvc's followed by node shutdown
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Manisha Saini <msaini>
Component:	rhgs-server-container	Assignee:	Saravanakumar <sarumuga>
Status:	CLOSED WORKSFORME	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	hchiramm, kramdoss, madam, msaini, rhs-bugs, sankarshan
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-28 06:04:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Manisha Saini 2018-11-26 12:40:15 UTC

Description of problem:

In order to reproduce an issue,while performing testing around block pvc creation/deletion followed by node shutdown,observed one of the heketi brick went offline.

Steps performed y'day and today while doing this testing-

1.Create 20 block pvc's.Assign 20 block pvc's to 20 cirros pods
2.Power off 1 gluster node (choose the node which host 1-1 brick of heketi volume and block hosting volume) so that 1 of the gluster pod goes down.
3.Now when 1 of the node is down,create 20 more block pvc's
4.Power on the gluster node and wait for all the pods to be in 1/1 state.Check for gluster brick inside 1 of the pod.All bricks were up.
5.Now again power down 1 node (criteria to select node is same mentioned in comment#2).
6.Again create 20 more pvc and assign it to 20 cirros pod.Wait for pods to come up
7.Power on the node

Here now in total 2 volumes are present (1 is block hosting volume and 1 is heketi volume). 60 block devices are present and 60 cirros pod consuming those 60 block volumes.

8.Delete all the pvc's and cirros pods
9.Create 20 more block pvc's (5 GB each) hosting on 2 block hosting volume

When the node came up,all brick came up and running at step 7.
But after sometime,one of the heketi brick went offline.

==============
# oc rsh glusterfs-storage-2zr8j
sh-4.2# gluster v status
Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.141:/var/lib/heketi/mounts/v
g_ede1e924e59132dadfe79c5515986f70/brick_d5
f28da6882efa9fde21055004e08c35/brick        N/A       N/A        N       N/A  
Brick 10.70.46.38:/var/lib/heketi/mounts/vg
_d2108b159f0ef9188ef8c63ec7295540/brick_010
35b4a8dc955f63c06274610b0476c/brick         49152     0          Y       401  
Brick 10.70.47.190:/var/lib/heketi/mounts/v
g_10350b0a86976eaef1350b375c24abb4/brick_72
f9013ae459a3d868be53df72342616/brick        49152     0          Y       406  
Self-heal Daemon on localhost               N/A       N/A        Y       42600
Self-heal Daemon on 10.70.47.115            N/A       N/A        Y       56933
Self-heal Daemon on dhcp47-190.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       92848
Self-heal Daemon on 10.70.46.38             N/A       N/A        Y       46806
 
Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_1269c6d46399c7dab9f88fe1f4efcb02
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.141:/var/lib/heketi/mounts/v
g_ede1e924e59132dadfe79c5515986f70/brick_e4
f058aecea07ec8c7c517ecf56de002/brick        49152     0          Y       41946
Brick 10.70.46.38:/var/lib/heketi/mounts/vg
_d2108b159f0ef9188ef8c63ec7295540/brick_597
e233138a5b245c9e8fc5845d0e882/brick         49153     0          Y       46785
Brick 10.70.47.190:/var/lib/heketi/mounts/v
g_10350b0a86976eaef1350b375c24abb4/brick_9d
da7996e6d03862c26a391ec805cc4a/brick        49153     0          Y       91813
Self-heal Daemon on localhost               N/A       N/A        Y       42600
Self-heal Daemon on 10.70.47.115            N/A       N/A        Y       56933
Self-heal Daemon on dhcp47-190.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       92848
Self-heal Daemon on 10.70.46.38             N/A       N/A        Y       46806
 
Task Status of Volume vol_1269c6d46399c7dab9f88fe1f4efcb02
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_c6100fe5fc5eae87bdd4972aa79853c7
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.190:/var/lib/heketi/mounts/v
g_e68a5ba833ae758bcb4e0e52c21126dc/brick_05
08a00d31f9426d3a4541cc43c77c5a/brick        49153     0          Y       91813
Brick 10.70.47.115:/var/lib/heketi/mounts/v
g_bd92a54d28d1ad649c0cb3e349e58e5d/brick_5a
8043c95bff8c5951ebd1905c3a2025/brick        49152     0          Y       56377
Brick 10.70.47.141:/var/lib/heketi/mounts/v
g_ede1e924e59132dadfe79c5515986f70/brick_0e
348632c4fcbd91bcaa16f728b2695d/brick        49152     0          Y       41946
Self-heal Daemon on localhost               N/A       N/A        Y       42600
Self-heal Daemon on 10.70.46.38             N/A       N/A        Y       46806
Self-heal Daemon on 10.70.47.115            N/A       N/A        Y       56933
Self-heal Daemon on dhcp47-190.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       92848
 
Task Status of Volume vol_c6100fe5fc5eae87bdd4972aa79853c7
------------------------------------------------------------------------------
There are no active volume tasks

# gluster peer status
Number of Peers: 3

Hostname: dhcp47-190.lab.eng.blr.redhat.com
Uuid: 6fce4225-bca3-4e04-98f4-f3e82a364566
State: Peer in Cluster (Connected)

Hostname: 10.70.47.115
Uuid: a07581ec-4a32-4da0-ab71-3d2612bdf100
State: Peer in Cluster (Connected)

Hostname: 10.70.46.38
Uuid: 5f14ee56-6a79-4e1f-975a-4f9c7aea1721
State: Peer in Cluster (Connected)

================


Version-Release number of selected component (if applicable):


sh-4.2# rpm -qa|grep gluster
glusterfs-fuse-3.12.2-27.el7rhgs.x86_64
python2-gluster-3.12.2-27.el7rhgs.x86_64
glusterfs-server-3.12.2-27.el7rhgs.x86_64
gluster-block-0.2.1-29.el7rhgs.x86_64
glusterfs-api-3.12.2-27.el7rhgs.x86_64
glusterfs-cli-3.12.2-27.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-27.el7rhgs.x86_64
glusterfs-libs-3.12.2-27.el7rhgs.x86_64
glusterfs-3.12.2-27.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-27.el7rhgs.x86_64

# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-27.el7rhgs.x86_64

# rpm -qa|grep heketi
heketi-client-8.0.0-1.el7rhgs.x86_64
heketi-8.0.0-1.el7rhgs.x86_64


How reproducible:

Hit while doing testing across pv creation/deletion followed by node power off

Steps to Reproduce:

Mentioned above

Actual results:

One of the heketi brick went offline

Expected results:

No brick should be offline

Additional info:


# oc get nodes -o wide
NAME                                STATUS    ROLES     AGE       VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION          CONTAINER-RUNTIME
dhcp46-120.lab.eng.blr.redhat.com   Ready     infra     2d        v1.11.0+d4cacc0   10.70.46.120   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp46-159.lab.eng.blr.redhat.com   Ready     compute   2d        v1.11.0+d4cacc0   10.70.46.159   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp46-168.lab.eng.blr.redhat.com   Ready     infra     2d        v1.11.0+d4cacc0   10.70.46.168   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp46-238.lab.eng.blr.redhat.com   Ready     compute   2d        v1.11.0+d4cacc0   10.70.46.238   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp46-38.lab.eng.blr.redhat.com    Ready     compute   2d        v1.11.0+d4cacc0   10.70.46.38    <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-115.lab.eng.blr.redhat.com   Ready     compute   2d        v1.11.0+d4cacc0   10.70.47.115   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-141.lab.eng.blr.redhat.com   Ready     compute   2d        v1.11.0+d4cacc0   10.70.47.141   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-148.lab.eng.blr.redhat.com   Ready     infra     2d        v1.11.0+d4cacc0   10.70.47.148   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-190.lab.eng.blr.redhat.com   Ready     compute   2d        v1.11.0+d4cacc0   10.70.47.190   <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-31.lab.eng.blr.redhat.com    Ready     compute   2d        v1.11.0+d4cacc0   10.70.47.31    <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1
dhcp47-89.lab.eng.blr.redhat.com    Ready     master    2d        v1.11.0+d4cacc0   10.70.47.89    <none>        Employee SKU   3.10.0-957.el7.x86_64   docker://1.13.1


Attaching sosreports shortly

Comment 7 Manisha Saini 2019-01-23 11:16:05 UTC

Post performing above test case,all heketi volume and gluster block hosting volumes were online.No issue was observed