Bug 1678446

Summary:	gluster pod stuck in 0/1 state for more than 4 hours during an upgrade (on large scale - > 850 volumes) - possibly LVM scale issue
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	rhgs-server-container	Assignee:	Raghavendra Talur <rtalur>
Status:	CLOSED DUPLICATE	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	kramdoss, madam, pasik, puebele, rhs-bugs, rtalur, sankarshan, sarumuga
Target Milestone:	---	Keywords:	Performance, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-07 11:45:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description RamaKasturi 2019-02-18 19:13:15 UTC

Description of problem:
When trying to upgrade a setup from OCP3.11+OCS 3.11 to OCP3.11 + OCS3.11.1 where the no.of volumes are more than 850 , gluster pod has been stuck in 0/1 state for more than 4 hours and when digging deeper found that pvscan has been stuck since the number of lvs here are greater than 1000.

Version-Release number of selected component (if applicable):
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.refarch311.ocsqeblr.com:443
openshift v3.11.69
kubernetes v1.11.0+d4cacc0
sh-4.2# rpm -qa | grep glusterfs
glusterfs-libs-3.12.2-32.el7rhgs.x86_64
glusterfs-3.12.2-32.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-32.el7rhgs.x86_64
glusterfs-server-3.12.2-32.el7rhgs.x86_64
glusterfs-api-3.12.2-32.el7rhgs.x86_64
glusterfs-cli-3.12.2-32.el7rhgs.x86_64
glusterfs-fuse-3.12.2-32.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-32.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have a setup of OCP3.11+OCS3.11 in aws environment
2. create 850 file and 50 block volumes.
3. create 100 cirrros pod attached to file and block volumes.
4. Try upgrading the setup to latest version which is OCS3.11.1

Actual results:
I see that gluster pod is in 0/1 state for more than four hours and pvscan has been stuck since the number of lvs present were more than 1000.

Expected results:
gluster pod should not be in 0/1 state and pvscan should be run successfully.

Additional info:

Below is the workaround we followed to get the pod up and running.

1. removing glusterfs=storage-host label from node
2. rebooting the node / stop and start the aws instance
3. and relabeling the node back to glusterfs=storage-host.

Comment 8 Yaniv Kaul 2019-04-14 14:20:42 UTC

Status?