Bug 1605158

Summary:	[Tracker] While creating, deleting pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 state
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	vinutha <vinug>
Component:	kubernetes	Assignee:	Humble Chirammal <hchiramm>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Prasanth <pprakash>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	cns-3.9	CC:	dyocum, kramdoss, pprakash, rhs-bugs, rtalur, sankarshan
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-14 14:41:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1601874
Bug Blocks:

Description vinutha 2018-07-20 10:22:56 UTC

Description of problem:

*** This bug is created as a clone of 1601874 ***

While creating, deleting 200 pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 state


Version-Release number of selected component (if applicable):
# rpm -qa| grep openshift
openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch
openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch

# oc rsh glusterfs-storage-mrfh4 
sh-4.2# rpm -qa| grep gluster 
glusterfs-libs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-cli-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-debuginfo-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64

# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64

How reproducible:
1:1

Steps to Reproduce:

CNS 4 node setup each node having 1TB device and CPU = 32 (4 cores) Memory = 72GB

1.Created 100 1Gb mongodb pods and ran IO (using dd) 

2.Upgraded the system from 3.9 live build to the experian hotfix build 

3.After all 4 gluster pods have spinned up and in 1/1 running state. All mongodb pods are also in running state. 

4. Initiated creation and deletion of 200 pvs alongwith running gluster volume heal on all 4 gluster pods. 

---- creation and delation of pvs ----------
while true
do
    for i in {101..300}
    do
        ./pvc_create.sh c$i 1; sleep 30;
    done

    sleep 40

    for i in {101..300}
    do
        oc delete pvc c$i; sleep 20;
    done
done
---------------------pv creation/deletion-------------

running gluster volume heal : while true; do for i in $(gluster v list | grep vol); do gluster v heal $i; sleep 2; done; done 

5. 2 gluster pods are in 0/1 state


Actual results:
2 gluster pods are in 0/1 state 

Expected results:
All 4 gluster pods should be in 1/1 Running state. 

Additional info:
logs attached

Comment 10 Humble Chirammal 2018-07-24 15:56:43 UTC

Karthick/Prasanth, can you ack for deferring this bug from the release?

Comment 11 krishnaram Karthick 2018-07-25 09:28:37 UTC

(In reply to Humble Chirammal from comment #10)
> Karthick/Prasanth, can you ack for deferring this bug from the release?

Ack. we can move this out of 3.10